Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool that they have used to predict the function of proteins. Not only could it help with the difficult task of modifying proteins in virtually useful ways, but it also works with fully interpretable methods, an advantage over conventional artificial intelligence (AI) that has helped protein engineering in the past. .
The new tool, called LANTERN, could be useful in work ranging from biofuel production to crop improvement and the development of new treatments for disease. Proteins, as basic elements of biology, are a key element in all of these tasks. But while it is relatively easy to make changes to the DNA strand that serves as a template for a given protein, it is still difficult to determine which specific base pairs, the steps of the DNA scale, are the keys to produce the desired effect. . Finding these keys has been the work of AI built from deep neural networks (DNNs), which, while efficient, are notoriously opaque to human understanding.
Described in a new article published in the Proceedings of the National Academy of Sciences, LANTERN shows the ability to predict the genetic changes needed to create useful differences in three different proteins. One is the spike-shaped protein on the surface of the SARS-CoV-2 virus that causes COVID-19; Understanding how changes in DNA can alter this cutting-edge protein could help epidemiologists predict the future of the pandemic. The other two are well-known battlehorses in the lab: E. coli bacterium LacI protein and green fluorescent protein (GFP) used as a marker in biology experiments. The selection of these three topics allowed the NIST team to demonstrate not only that their tool works, but also that its results are interpretable, an important feature for the industry, which needs predictive methods to help understand the underlying system.
“We have an approach that is fully interpretable and has no loss of predictive power,” said Peter Tonner, a NIST statistician and computational biologist and senior developer at LANTERN. “There’s a common assumption that if you want one of these things, you can’t have the other. We’ve shown that sometimes you can have both.”
The problem facing the NIST team could be imagined as the interaction with a complex machine that has a large control panel full of thousands of unlabeled switches: the device is a gene, a DNA strand that encodes a protein; switches are base pairs in the chain. All switches affect the output of the device in any way. If your job is to make the machine work differently in a specific way, which switches should you turn?
Because the answer may require changes in several base pairs, scientists must invert a combination of these, measure the result, then choose a new combination, and re-measure. The number of permutations is impressive.
“The number of potential combinations may be greater than the number of atoms in the universe,” Tonner said. “All possibilities could never be measured. That’s a ridiculously large number.”
Due to the amount of data involved, DNNs were tasked with sorting a sample of data and predicting which base pairs were to be reversed. In this, they were successful, as long as you do not ask for an explanation of how they get their answers. They are often described as “black boxes” because their internal operation is impenetrable.
“It’s very difficult to understand how DNNs make their predictions,” said NIST physicist David Ross, one of the co-authors of the paper. “And that’s very important if you want to use these predictions to design something new.”
LANTERN, on the other hand, is explicitly designed to be understandable. Part of its explicability comes from the use of interpretable parameters to represent the data it analyzes. Instead of allowing the number of these parameters to grow extraordinarily large and often impenetrable, as is the case with DNNs, each parameter in LANTERN calculations has a purpose that is intended to be intuitive, helping users understand what these parameters mean and how. influence LANTERN. calculations. predictions.
The LANTERN model represents mutations of proteins by means of vectors, widely used mathematical tools often represented visually as arrows. Each arrow has two properties: its direction implies the effect of the mutation, while its length represents the strength of this effect. When two proteins have vectors pointing in the same direction, LANTERN indicates that the proteins have a similar function.
The directions of these vectors often correspond to biological mechanisms. For example, LANTERN learned a direction associated with protein folding in the three data sets studied by the team. (This plays a critical role in the functioning of a protein, so identifying this factor in the data sets was an indication that the model works as expected.) In making predictions, LANTERN simply adds these vectors, a method that users can follow when reviewing. their predictions.
Other laboratories had previously used DNN to make predictions about switch changes that would make useful changes to the three proteins in question, so the NIST team decided to contrast LANTERN with DNN results. The new approach was not enough; according to the team, it reaches a new state of the art in predictive precision for this type of problem.
“LANTERN matched or outperformed almost every alternative approach in predictive accuracy,” Tonner said. “It surpasses all other approaches to predicting changes in LacI and has comparable predictive accuracy for GFP for all but one. For SARS-CoV-2, it has higher predictive accuracy than all non-type alternatives. DNN, which match LANTERN’s accuracy, but I didn’t get over it. “
LANTERN determines which switch sets have the greatest effect on a given protein attribute (its folding stability, for example) and summarizes how the user can modify that attribute to achieve the desired effect. In a way, LANTERN transmutes the many switches on our machine’s panel into a few simple dials.
“It reduces thousands of switches to maybe five small dials that you can turn,” Ross said. “It tells you that the first dial will have a big effect, the second will have a different but smaller effect, the third even smaller, and so on. So as an engineer, this tells me that I can focus on the first and second dial to get the result I need. LANTERN explains all this to me and is incredibly helpful.
Rajmonda Caceres, a scientist at MIT’s Lincoln Laboratory who is familiar with the method behind LANTERN, said she liked the interpretability of the tool.
“There are not many AI methods applied to biology applications where they are explicitly designed for interpretation,” said Caceres, who is not affiliated with the NIST study. “When biologists see the results, they can see which mutation contributes to the change in protein. This level of interpretation allows for more interdisciplinary research so that biologists can understand how the algorithm learns and can generate other information about the biological system. . »
Tonner said that while he is happy with the results, LANTERN is not a panacea for the AI explanation problem. He said exploring alternatives to DNN more broadly would benefit overall efforts to create an explicable and reliable AI.
“In the context of predicting genetic effects on protein function, LANTERN is the first example of something that rivals DNNs in predictive power although it is still fully interpretable,” Tonner said. “It provides a specific solution to a particular problem. We hope that it can be applied to others and that this work will inspire the development of new interpretable approaches. We do not want predictive AI to remain a black box.”