Published on Wed Sep 01 2021

Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure

Mansoor, S., Baek, M., Madan, U., Horvitz, E.

Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner.

1
0
2
Abstract

Protein embeddings learned from aligned sequences have been leveraged in a wide array of tasks in protein understanding and engineering. The sequence embeddings are generated through semi-supervised training on millions of sequences with deep neural models defined with hundreds of millions of parameters, and they continue to increase in performance on target tasks with increasing complexity. We report a more data-efficient approach to encode protein information through joint training on protein sequence and structure in a semi-supervised manner. We show that the method is able to encode both types of information to form a rich embedding space which can be used for downstream prediction tasks. We show that the incorporation of rich structural information into the context under consideration boosts the performance of the model by predicting the effects of single-mutations. We attribute increases in accuracy to the value of leveraging proximity within the enriched representation to identify sequentially and spatially close residues that would be affected by the mutation, using experimentally validated or predicted structures.