New artificial intelligence (AI) tools can use concentration data to classify chemical reaction mechanisms and make predictions with 99.6% accuracy on realistically noisy data. Igor Larossa and Jordi Breath from Manchester University made the model freely available to help advance the “Discovery and Development of Fully Automated Organic Reactions”.
“Movement data has far more information than chemists have traditionally been able to extract,” commented Larosa. Deep-learning models “don’t just match what dynamics chemists could do with previous tools, they surpass them,” he argues.
Larrosa adds that chemistry is at a unique turning point for AI tools. As such, Manchester chemists sought to design a model with ideal features for reaction classification. Bures and Larrosa combined two different neural networks. First, long-short-term memory neural networks track changes in concentration over time. A fully connected neural network then processes what comes out of the first network.
The final model contains 576,000 trainable parameters. Parameters, Larrosa explains, represent “mathematical operations performed on dynamic profile data.” These operations generate probabilities of which mechanism the data originates from. “For comparison, AlphaFold uses his 21 million parameters and GPT3 his 175 billion parameters,” he adds.
Catalytic Insights
Bures and Larrosa trained the model on 5 million simulated dynamic samples labeled with one of 20 common catalytic reaction mechanisms with which the samples are relevant. Once the model learns to recognize the features of the kinetic data associated with each reaction mechanism, it “applies those rules to new input kinetic data to classify it,” Bures said. The first of the twenty is the simplest catalytic mechanism described by the Michaelis-Menten model. Bures and Larrosa classified the rest as bicatalytic steps, mechanisms involving catalytic activation steps, and mechanisms involving catalytic deactivation steps, with the latter being the largest group.
Experimental data are inevitably noisy and difficult to interpret, Bures adds, so simulated data is needed for good classification performance. “Experimental data and the corresponding chemist’s conclusions should not be used for training, because the resulting model is at best as accurate as the average chemist, and probably not as accurate.” he says.
To test the trained model, Bures and Larrosa used more simulated data, but only 38 misclassifications occurred in 100,000 samples. To more closely simulate real-world experiments, chemists added noise to the data. This dropped his accuracy to 99.6% for realistic levels of noise, and to 83% for what Larrosa calls “the ludicrous extremes of noisy data.”
The chemists also applied the model to data from previously published experiments. “We can’t know the correct answers for these, but the model suggested a chemically sound mechanism,” he says Larosa. The results also provided new insights into how catalysts in reactions such as ring-closing olefin metathesis and cycloaddition degrade. “Understanding the decomposition pathways of the catalyst is very important to be able to reproduce the process,” emphasizes Larrosa.
Marwin Segler of Microsoft Research AI4Science called the research “an amazing demonstration of how machine learning can help creative scientists make sense of nature and solve tough chemical problems”. “We need better tools like this to discover new reactions to make new drugs and materials and to make chemistry greener,” he says. “It highlights how powerful it is for training, and you can expect a lot more.”