November 29, 2022

Adapting language models to track virus variants

Scientists from the U.S. Department of Energy's (DOE) Argonne National Laboratory and a team of collaborators have won the 2022 Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research for their new method of quickly identifying how a virus evolves. Their work in training large language models (LLMs) to discover variants of SARS-CoV-2 has implications to biology beyond COVID-19.

A form of artificial intelligence (AI), LLMs are generally meant to summarize and translate texts or predict what words might come next based on what the model learned in an initial training stage. For instance, an LLM can be trained—with the help of gigantic language datasets—to translate text from English to Spanish.

The researchers who won this year's award leveraged Argonne's powerful supercomputing and AI resources to develop and apply LLMs toward tracking how a virus can mutate into more dangerous or more transmissible variants.

When a virus evolves, it mutates into new variants that can be similar to past variants or even more deadly than previous iterations. When a particular variant is considered more dangerous or harmful, it is labeled as a variant of concern (VOC). Discovering these VOCs quickly and efficiently can save lives by providing scientists with time to design and develop effective vaccines and treatment strategies.

Existing methods to track these variants can be slow. To solve this problem, computational biologist Arvind Ramanathan and his colleagues at Argonne together with collaborators from the University of Chicago, NVIDIA, Cerebras Inc., University of Illinois at Chicago, Northern Illinois University, California Institute of Technology, New York University and Technical University of Munich set out to create a means of identifying VOCs. Their paper, "GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics," is the culmination of the team's findings.

"When the pandemic began, we had several of these really harmful variants of the virus, like the Delta variant," Ramanathan said. "It resulted in a large death toll. But Delta evolved as a consequence of certain mutations that were happening when the virus was facing the human hosts. It's a process of evolution of the virus inside of the human cell."

Their work resulted in the first genome-scale language model (GenSLM), which is a model that can analyze genes and rapidly identify VOCs. The model discussed in the paper was trained on data from the COVID-19 pandemic, and the hope is that models like this could potentially give health officials the tools they need to quickly respond to rising variants. GenSLM is the first whole genome-scale foundation model that can be altered and applied to other prediction tasks similar to VOC identification.

While these evolutionary variants may seem to crop up randomly to the human eye, tracking them is of the utmost concern. As such, the work of Ramanathan and his colleagues could seriously alter how we stay on top of viral outbreaks.

The language of evolution

Previous work demonstrated that LLMs based on the amino acid language of proteins can be used both to track the evolution of proteins and to design entirely new proteins with novel structure and function. However, Ramanathan points out that, to his knowledge, the research he and his colleagues performed was the first attempt at running an LLM-based model at the gene level.

"Large language models are key for achieving the AI for science vision across diverse science domains," said Venkatram Vishwanath, co-author of the study and data science lead at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility.

That said, proteins are still two steps away from the core biological process that Ramanathan and his team were interested in. In the cell, genes are first transcribed onto something called messenger RNA (mRNA). This mRNA exits the cell's nucleus and makes its way into ribosomes, where the ribosomes synthesize proteins. In a sense, you could consider genes to be a message containing instructions on how to build proteins. And since proteins function somewhat similar to conversational messages, it would make sense to apply language models to define them.

While previous experiments had proven that language models were adept at explaining evolutionary changes at the protein level, scientists needed to go deeper if they wanted to identify VOCs at the gene level.

Machine learning played a pivotal role in this research, and the models needed information to learn about VOCs. The Bacterial and Viral Bioinformatics Resource Center web resources as well as the Houston Hospital System provided integrated data and analysis tools to support this work. The researchers analyzed 1.5 million high-quality SARS-CoV-2 complete genome sequences from the resource center and 16,545 total sequences from Houston to better understand the virus.

Previously, without this GenSLMs, VOCs needed to be identified by individually going through every protein and mapping each mutation to see if any mutations were of interest. This is incredibly labor- and time-intensive, and GenSLMs should help make this process easier.

The team has proven that these models can help advance biology research, and they now want to understand how far they can push the approach. Ramanathan believes that their work could lay the groundwork for a future pandemic observatory. He also suggests that protein engineering applications could come from this work, or even the modeling of entire organisms.

The right tools for the job

Powerful supercomputing assets were vital in the success of this work. The researchers used both Polaris, a Hewlett Packard Enterprise system, and the Cerebras CS-2 AI platform at the ALCF. This research also relied on NVIDIA's Selene supercomputer. While Polaris and Selene are powerful supercomputers accelerated by GPUs (graphics processing units), the CS-2 system is different. The CS-2 AI-accelerator system, part of the ALCF AI Testbed, is highly optimized for learning-based tasks.

"Polaris is the new ALCF supercomputer with four GPUs on a single node, and we have 560 of these nodes," said Ramanathan. "This really helps us scale the end-to-end workflow, including the training process, across multiple nodes in a much more convenient manner. And because of the amount of memory and node-local storage that is available on a single node, we can load or we can basically stage the data in certain ways that let us utilize the entire machine's power for doing these types of complex calculations."

Given the critical time crunch to obtain results, the team also relied on Cerebras CS-2 machines to aid in their work in addition to the Polaris and Selene systems. Specifically, they utilized the Cerebras Wafer-Scale Engine and ended up requiring both a single CS-2 machine as well as a cluster of 16 CS-2 to achieve the desired accuracy and perplexity results in less than a day.

"A key challenge in this problem is dealing with long sequence lengths and tackling these foundation models at the scale of the viral genome," said Ramanathan. "This process can benefit from systems with large memory capabilities, such as the CS-2 system architecture with their Memory-X and Swarm-X infrastructure, and this makes it easier to load and train on these very long sequences."

"Resources such as Polaris, CS-2 systems and the various AI accelerator systems at the ALCF AI Testbed are helping us advance the use of these models for scientific research," said Vishwanath.

More information: Maxim Zvyagin et al, GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics, bioRxiv (2022). DOI: 10.1101/2022.10.10.511571

Provided by Argonne National Laboratory

Citation: Adapting language models to track virus variants (2022, November 29) retrieved 16 July 2024 from https://medicalxpress.com/news/2022-11-language-track-virus-variants.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Researchers bring innovative AI and simulation tools to the COVID-19 battlefront

31 shares

Feedback to editors

Patient-driven discovery reveals potential target for autoimmune diseases

21 minutes ago

Study reveals dual role of protein in cancer treatment

39 minutes ago

Genetically modified mice reveal protein receptor's role in metabolic health

41 minutes ago

Scientists develop potential stealth cancer therapy

1 hour ago

Study on origins of schizophrenia in the brain offers hope for targeted treatments, better diagnosis

1 hour ago

New study finds early detection of miRNAs in maternal blood may offer potential for predicting preeclampsia

5 hours ago

Friends of at-risk youth may need extra support for their mental well-being

15 hours ago

Better together: Spatial arrangement of three immune cells is key to attacking tumors, study finds

16 hours ago

A widespread practice among athletes harms both performance and health, says study

17 hours ago

Scientists identify 'unconventional' new pathway for TB vaccines

18 hours ago

Load comments (0)

Adapting language models to track virus variants

The language of evolution

The right tools for the job

Patient-driven discovery reveals potential target for autoimmune diseases

Study reveals dual role of protein in cancer treatment

Genetically modified mice reveal protein receptor's role in metabolic health

Scientists develop potential stealth cancer therapy

Study on origins of schizophrenia in the brain offers hope for targeted treatments, better diagnosis

New study finds early detection of miRNAs in maternal blood may offer potential for predicting preeclampsia

Friends of at-risk youth may need extra support for their mental well-being

Better together: Spatial arrangement of three immune cells is key to attacking tumors, study finds

A widespread practice among athletes harms both performance and health, says study

Scientists identify 'unconventional' new pathway for TB vaccines

Researchers bring innovative AI and simulation tools to the COVID-19 battlefront

Scientists articulate new data standards for AI models

Evolutionary analysis shows SARS-CoV-2 variants converging

Collaborative AI effort unraveling SARS-CoV-2 mysteries wins Gordon Bell Special Prize

Computer model predicts dominant SARS-CoV-2 variants

Model analyzes how viruses escape the immune system

Scientists identify 'unconventional' new pathway for TB vaccines

Preclinical data suggest antioxidant strategy to address mitochondrial dysfunction caused by SARS-CoV-2 virus

Off-the-shelf wearable trackers provide clinically-useful information for patients with heart disease

Even mild SARS-CoV-2 infections have a long-term impact on the immune system, finds study

Accurate and continuous remote monitoring of step length can be sensitive marker for neurological diseases and aging

Beyond algorithms: The role of human empathy in AI-enhanced therapy

Phys.org

Tech Xplore

Science X

Adapting language models to track virus variants

The language of evolution

The right tools for the job

Patient-driven discovery reveals potential target for autoimmune diseases

Study reveals dual role of protein in cancer treatment

Genetically modified mice reveal protein receptor's role in metabolic health

Scientists develop potential stealth cancer therapy

Study on origins of schizophrenia in the brain offers hope for targeted treatments, better diagnosis

New study finds early detection of miRNAs in maternal blood may offer potential for predicting preeclampsia

Friends of at-risk youth may need extra support for their mental well-being

Better together: Spatial arrangement of three immune cells is key to attacking tumors, study finds

A widespread practice among athletes harms both performance and health, says study

Scientists identify 'unconventional' new pathway for TB vaccines

Related Stories

Researchers bring innovative AI and simulation tools to the COVID-19 battlefront

Scientists articulate new data standards for AI models

Evolutionary analysis shows SARS-CoV-2 variants converging

Collaborative AI effort unraveling SARS-CoV-2 mysteries wins Gordon Bell Special Prize

Computer model predicts dominant SARS-CoV-2 variants

Model analyzes how viruses escape the immune system

Recommended for you

Scientists identify 'unconventional' new pathway for TB vaccines

Preclinical data suggest antioxidant strategy to address mitochondrial dysfunction caused by SARS-CoV-2 virus

Off-the-shelf wearable trackers provide clinically-useful information for patients with heart disease

Even mild SARS-CoV-2 infections have a long-term impact on the immune system, finds study

Accurate and continuous remote monitoring of step length can be sensitive marker for neurological diseases and aging

Beyond algorithms: The role of human empathy in AI-enhanced therapy

Newsletter sign up

Donate and enjoy an ad-free experience