• Log in with Facebook Log in with Twitter Log In with Google      Sign In    
  • Create Account
  LongeCity
              Advocacy & Research for Unlimited Lifespans

Photo

A Generative, Foundational AI Model for Genetics


  • Please log in to reply
No replies to this topic

#1 Steve H

  • Guest
  • 127 posts
  • 430
  • Location:UK
  • NO

Posted Yesterday, 11:00 PM


The Arc Institute, a nonprofit research organization, has published a manuscript on its creation of Evo 2, an AI foundation model that is capable of both understanding and building full genomes of organisms.

A new step in understanding biology

The authors of this paper, a group of professionals largely from the Arc Institute and well-known universities in California, begin by discussing Evo 2’s unprecedented size. Unlike the original Evo, which was only trained on organisms that lack nuclei (prokaryotes), this model was trained on organisms with nuclei (eukaryotes) as well, a classification that includes everything from amoebae to human beings, and a total of 9.3 trillion base pairs were included in its training set.

The researchers created two variants, one with 7 billion parameters (7B) and another with 40 billion parameters (40B), and both models use a context window of a million single base pairs. This model is open source, including both the training and inference code along with its parameters and the training data originating from OpenGenome2.

This paper goes into detail describing how the model was trained. Like the commonly known large language models (LLMs), this model was fundamentally trained to predict the next “token”; instead of predicting the next word in the English language, however, Evo 2 was built to predict the next DNA base pair. This model was built on StripedHyena2, a convolutional, multi-hybrid system that directs it to think in different, layered ways (stripes) about the training information it’s receiving.

Predicting the effects of mutations

The researchers found that Evo 2 was able to predict whether or not a genetic mutation would impact essential function, which had never been accomplished before in eukaryotes. Evo 2 had learned to predict the likelihood of mutations as they related to start and stop codons; this, the researchers claimed, meant that it had an understanding of such “fundamental genetic features” despite solely being trained on base pairs and not taught what they meant.

Furthermore, by testing its predictions against known effects in RNA sequences, the researchers determined that the model was able to accurately ascertain whether any given mutation would affect the essential function of the sequence, and it was even able to grasp that effects in noncoding regions would have significant consequences. The 40B model was found to be substantially better than the 7B model at this.

This held true even for sequences derived from human beings. Mutations in the BRCA1 gene often lead to breast cancer, and 40B Evo2 was able to predict whether or not any given mutation in this gene would be dangerous or not, especially when it was specifically supervised to do so, even beating out specialized models made for the purpose. This, the researchers note, is in spite of the model being trained on only one reference human genome within its expansive dataset; its predictions are fundamentally derived from how organisms work, not humans in particular.

Grasping genetics from the ground up

The researchers took a close look at Evo2’s thought process. They realized that it was accurately able to identify features associated with CRISPR-related phage sequences within E.coli bacteria. Rather than memorizing the bacterial phages themselves, the model identified the CRISPR spacers instead. Similarly, the model was able to identify frameshift mutations and premature stop codons. It was able to identify exons and introns that it learned from the human genome and notice them in the woolly mammoth genome, which it had never been trained on.

As this is a generative AI, the researchers set it to the task of generating genomes. The genomes it created were found to have many natural features, including reasonable chromatin accessibility, although the authors judged its performance based on other algorithms and did not actually create any physical structures based on Evo2’s outputs. They posit that their model can, with further training related to sequences and their associated functions, be used to generate effective genetic structures.

To prevent this open-source model from being used for bioterrorism, the researchers intentionally excluded infectious diseases from its training set, and they red-teamed their model to ensure that it was no better than random chance in generating or understanding the effects of infectious diseases. However, they did note that they cannot prevent malefactors from training the model with such diseases.

This model may have significant benefits for diagnosing and treating both mitochondrial dysfunction and genomic instability, such as by identifying and better understanding the age-related mutations that give some cells or mitochondria a reproductive advantage over others at the expense of the overall organism. It may even be possible for future research to use this model to test individual people for mutated cells or even to create individually targeted gene therapies. It is still a foundational model, however, and nothing based on Evo2 has been put to such tasks.

This manuscript was published on the Arc Institute’s website and not in a peer-reviewed publication. However, the depth and detail of this paper’s explanations, along with its authorship of researchers from reputable institutions, lend weight to its claims being correct. As this is an open-source tool for the research community, it will swiftly become clear whether or not it can be used to advance oncology, develop treatments for genetic diseases, or directly impact aging at the genetic level.

We would like to ask you a small favor. We are a non-profit foundation, and unlike some other organizations, we have no shareholders and no products to sell you. All our news and educational content is free for everyone to read, but it does mean that we rely on the help of people like you. Every contribution, no matter if it’s big or small, supports independent journalism and sustains our future.
">

View the article at lifespan.io




2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users