This AI-based protein language model opens up general-purpose sequence modeling

The way people study the language of life has been fundamentally changed by comparing the syntactic semantics of natural languages ​​and the function of protein sequences. Although this comparison has inherent value when viewed as a historical landmark that helped improve the application of NLP in the field of proteins (eg language models), findings from the field of NLP do not fully translate into protein language. In addition to increasing NLP model sizes, expanding protein language models may have a much larger effect than increasing NLP model sizes.

The observation that language models with a large number of parameters trained over a large number of steps still undergo appreciable learning gradients, and thus are seen as unsuitable, tends to encourage a rather-falsely proportional fit between the size of the model and the richness of its learned representations. As a result, the selection of more accurate or relevant protein representations has gradually changed to the selection of larger models, which require more computing power and are therefore more difficult to access. Notably, PLM sizes recently increased from 106 to 109 parameters. They establish a volume performance benchmark using ProtTrans’s ProtT5-XL-U50, an encoder and decoder pre-trained on the UniRef50 database, whose parameters are 3B for training and 1.5B for inference, which sheds light on the historical state of the protein language model. – Art (Sota).

To develop scaling principles for protein sequence modeling, the RITA family of language models was used, a first step in this direction, to show how model performance changes around its size. RITA offers four alternative models with proportional increases in size from 85M to 300M, to 680M, to 1.2B parameters. A similar pattern was later confirmed by ProGen2, a set of protein language models trained on various sequence datasets including 6.4B parameters. Finally, as of the time this study was published, ESM-2, a survey of general-purpose protein language models that similarly shows a proportional rise in performance in scale from parameters 650M to 3B to 15B, is the latest encouraging addition. – sizing.

The seemingly simple relationship between bigger and better PLM ignores several factors, including computational costs and the design and deployment of non-functional models. This increases the barrier to entry for innovative research and limits its potential for scaling up. Although the size of the model undoubtedly affects the achievement of the above goals, it is not the only goal. Scaling the dataset before training in the same direction is conditional, that is, larger datasets are not always preferred over smaller datasets of higher quality. They argue that the scaling of linguistic models is conditional and continues in the same approach (eg, larger models are not necessarily better than smaller models for protein-knowledge-directed optimization methods).

The primary goal of this study is to integrate knowledge-guided improvement into an iterative experimental framework that encourages access to research innovation through practical resources. Because their model “unlocks” the language of life by learning a better representation of its “letters,” the amino acids, they named their project the “ankh” (a reference to the ancient Egyptian sign for the key to life). This was developed into two parts of evidence to assess and improve the generality of the Ankh.

A generation study of protein engineering on High-N (family-based) and One-N (single-sequence-based) implementations, where N is the number of input sequences, is the first step in outperforming SOTA on a broad set of structure and function criteria. The second step is to achieve this performance by conducting a survey of optimal attributes, including not only the architecture of the model but also the software and hardware used to build, train, and deploy the model. According to the application needs, they provide two pre-trained models called Ankh Beig and Ankh Base, each of which provides two calculation methods. They call their flagship model, the Ankh Beig, an Ankh, for convenience. Pre-tested templates are available on their GitHub page. It also contains details on how to run the software source.


scan the paper And github. All credit for this research goes to the researchers on this project. Also, don’t forget to join Our Reddit pageAnd discord channelAnd And Email newsletterwhere we share the latest AI research news, cool AI projects, and more.


Anish Teeku is a Consultant Trainee at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. Likes to communicate with people and collaborate on interesting projects.


Leave a Reply

Your email address will not be published. Required fields are marked *