New application of diffusion model - Microsoft launches protein generation framework EvoDiff

Author | Xie Niannian

Recently, Microsoft launched a general framework called EvoDiff that it claims can generate " high fidelity " and " diversity " of proteins based on protein sequences.

This technology is significant because proteins are the fundamental building blocks of disease in our bodies. By studying proteins, we can uncover the mechanisms of disease and find ways to slow or reverse it.

Large model research test portal

GPT-4 portal (no wall, can be tested directly, if you encounter the browser warning point, just advance/continue accessing):
https://gpt4test.com

And by creating proteins , we can develop entirely new drugs and treatments.

Currently, the process of designing proteins is complex and expensive, but EvoDiff could change that. It does not require structural information of the target protein, thus eliminating the most tedious steps.

This technology has potential applications in the creation of enzymes for novel therapeutics and drug delivery methods, as well as the development of novel enzymes for industrial chemical reactions.

Paper link :
https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1.full.pdf

github address :
https://github.com/microsoft/evodiff

Protein production is expensive

The current process of designing proteins in the laboratory is expensive from a computational and human resource perspective.

This process involves two key steps.

  • First, a protein structure needs to be found that can perform a specific task in the body.

  • Second, one needs to find a protein sequence that might "fold" into the structure, that is, the sequence of amino acids that make up the protein.

Only when a protein folds correctly into a three-dimensional shape can it perform its intended function. This process requires extensive computing and human resources and is therefore costly.

But sometimes we don’t have to make things too complicated.

Recently, Microsoft introduced a general framework called EvoDiff. Microsoft says this framework can generate high-fidelity, diverse proteins given only a protein sequence.

Unlike other protein generation frameworks, EvoDiff does not require any information about the structure of the target protein , thus eliminating what is often the most laborious step.

the process of producing protein

▲The process of producing protein

EvoDiff framework

At the heart of the EvoDiff framework is a 640 million parameter model trained using protein data from different species and functional categories.

The data used to train the model comes from the sequence alignment OpenFold data set and UniRef50 and UniProt data subsets.

UniProt is a database of protein sequence and functional information maintained by the UniProt Alliance. By using this data, the EvoDiff framework is able to train a powerful model for tasks such as protein generation.

EvoDiff is essentially a diffusion model, and its architecture is similar to some modern image generation models, such as Stable Diffusion and DALL-E 2. The goal of EvoDiff is to gradually subtract noise from a starting protein that is almost entirely composed of noise, and gradually restore the protein sequence.

Diffusion modeling is a technique that is increasingly used outside the field of image generation. Not only can it be used to design new proteins, such as EvoDiff, but it can also be used to create music and even synthesize speech. The range of applications of this model is expanding.

Unlike traditional protein generation frameworks, EvoDiff designs proteins based not on their structure but on their sequence space. This means it can synthesize a special type of protein, called disordered proteins, which do not end up folding into a specific three-dimensional structure.

Nonetheless, these disordered proteins still play important roles in biology and disease. They can enhance or reduce the activity of other proteins, thereby affecting the functioning of the organism. This shows that disordered proteins have multiple functions within cells and are important for our understanding of biological processes in organisms and the mechanisms of disease.

EvoDiff will advance protein engineering

Ava Amini, another EvoDiff author and a senior researcher at Microsoft, emphasized the importance of generating proteins from sequence , noting the advantages of this approach in terms of versatility, scale, and modularity.

Ava Amini also mentioned that their diffusion framework allows them to control the design of proteins to achieve specific functional goals. This framework gives them the ability to generate proteins and control their design to perform specific functions.

Amini believes that EvoDiff can not only create new proteins, but also fill "gaps" in existing protein designs. For example, if part of a protein binds to another protein, the model can generate a sequence of the protein's amino acids around that part that meets a series of criteria. This means that EvoDiff can help scientists design more types of proteins, thus broadening the application fields of proteins.

Kevin Yang, a senior researcher at Microsoft, said that EvoDiff will be open source . This open-source tool can be used to make enzymes for new therapies and drug delivery methods, as well as new enzymes for industrial chemical reactions.

The team anticipates that EvoDiff will advance the evolution of protein engineering from the traditional structure-function paradigm to programmable, sequence-first design.

Through the practice of EvoDiff, they proved an important point, that is, protein generation does not necessarily need to rely on a specific structure, and only using the protein sequence itself can also play an effective role. This means they can enable more applications by controllably engineering new proteins.

But for now, it’s important to note that the research behind EvoDiff has not been peer-reviewed—at least not yet. Sarah Alamdari, a Microsoft data scientist involved in the project, acknowledged that more scaling work needs to be done before the framework can be used commercially.

next step

The current EvoDiff model only has 640 million parameters. If you scale the parameters to billions, the build quality might be better. Not only that, the team also hopes to apply EvoDiff to text, chemical information or other methods to customize the required functions.

The EvoDiff team also plans to test the proteins produced by their model in the lab to determine whether they are viable. If the test results prove feasible, they will start developing the next generation framework.

References

[1]https://techcrunch.com/2023/09/14/microsoft-open-sources-evodiff-a-novel-protein-generating-ai/

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/133187006