research topics
Protein language models for understanding
and guiding evolution
Nature is the most powerful protein engineer, and evolution is nature’s design process. We have been using machine learning algorithms to better understand the rules of natural protein evolution, with the ultimate goal of designing better proteins. We leverage sophisticated algorithms called neural language models that can learn evolutionary patterns from millions of natural protein sequences (Fig. 11A).
To demonstrate that language models can learn evolution, we first apply them to the task of predicting evolutionary dynamics, which has diverse applications that range from tracing the progression of viral outbreaks to understanding the history of life on earth. Our main conceptual advance is to reconstruct a global landscape of protein evolution through local evolutionary predictions that we refer to as evolutionary velocity (evo-velocity), which predicts the direction of evolution in landscapes of viral proteins evolving over years and of eukaryotic proteins evolving over geologic eons (Fig. 11B).
The same ideas can also be used to design new proteins. By testing mutations with higher language-model likelihood than wildtype (i.e., with positive “evolutionary velocity”), we find a surprisingly large number of variants with improved fitness. This enables highly efficient, machine-learning-guided antibody affinity maturation against diverse viral antigens, without providing the model with knowledge of the antigen, protein structure, or task-specific training data (Fig. 11C), and may also inform the design of other types of proteins as well.
While our work established the effectiveness of language models to guide evolution using sequence alone, a protein's function is inherently linked to its underlying structure. To improve predictive capabilities, we incorporate structural information with a structure-informed language model and demonstrate substantial gains across protein families and definitions of protein fitness. Importantly, we show that using only protein backbone coordinates and no side-chain information is sufficient to learn principles of binding that generalize to protein complexes and enable antibody engineering with unprecedented efficiency. Together, these results lay the groundwork for more potent and resilient therapeutic design in completely unsupervised settings, or even in the absence of task-specific training data (Fig 12).
PUBLIcation highlights
Unsupervised evolution of protein and antibody complexes with a structure-informed language model. (PDF) (Science)
Varun R. Shanker, Theodora U. J. Bruun, Brian L. Hie, Peter S. Kim. Science 385, 46-53 (2024). doi: 10.1126/science.adk8946.
Supplementary Materials
Efficient evolution of human antibodies from general protein language models. (PDF) (Nat. Bio.)
Brian L. Hie, Varun R. Shanker, Duo Xu, Theodora U. J. Bruun, Payton A. Weidenbacher, Shaogeng Tang, Wesley Wu, John E. Pak, Peter S. Kim. Nature Biotechnology (2023) doi.org/10.1038/s41587-023-01763-2.
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. (PDF) (Cell Systems)
Brian L.Hie, Kevin K. Yang, Peter S.Kim. Cell Systems (2022).