2024

Chen, Jiayuan; Guo, Kehan; Liu, Zhen; Isayev, Olexandr; Zhang, Xiangliang
Uncertainty-Aware Yield Prediction with Multimodal Molecular Features Journal Article
In: AAAI, vol. 38, no. 8, pp. 8274–8282, 2024, ISSN: 2374-3468.
Abstract | Links | BibTeX | Tags: Organic reactions
@article{Chen2024,
title = {Uncertainty-Aware Yield Prediction with Multimodal Molecular Features},
author = {Jiayuan Chen and Kehan Guo and Zhen Liu and Olexandr Isayev and Xiangliang Zhang},
doi = {10.1609/aaai.v38i8.28668},
issn = {2374-3468},
year = {2024},
date = {2024-03-25},
urldate = {2024-03-25},
journal = {AAAI},
volume = {38},
number = {8},
pages = {8274--8282},
publisher = {Association for the Advancement of Artificial Intelligence (AAAI)},
abstract = {\<jats:p\>Predicting chemical reaction yields is pivotal for efficient chemical synthesis, an area that focuses on the creation of novel compounds for diverse uses.
Yield prediction demands accurate representations of reactions for forecasting practical transformation rates. Yet, the uncertainty issues broadcasting in real-world situations prohibit current models to excel in this task owing to the high sensitivity of yield activities and the uncertainty in yield measurements. Existing models often utilize single-modal feature representations, such as molecular fingerprints, SMILES sequences, or molecular graphs, which is not sufficient to capture the complex interactions and dynamic behavior of molecules in reactions. In this paper, we present an advanced Uncertainty-Aware Multimodal model (UAM) to tackle these challenges. Our approach seamlessly integrates data sources from multiple modalities by encompassing sequence representations, molecular graphs, and expert-defined chemical reaction features for a comprehensive representation of reactions. Additionally, we address both the model and data-based uncertainty, refining the model\'s predictive capability. Extensive experiments on three datasets, including two high throughput experiment (HTE) datasets and one chemist-constructed Amide coupling reaction dataset, demonstrate that UAM outperforms the state-of-the-art methods. The code and used datasets are available at https://github.com/jychen229/Multimodal-reaction-yield-prediction.\</jats:p\>},
keywords = {Organic reactions},
pubstate = {published},
tppubtype = {article}
}
Yield prediction demands accurate representations of reactions for forecasting practical transformation rates. Yet, the uncertainty issues broadcasting in real-world situations prohibit current models to excel in this task owing to the high sensitivity of yield activities and the uncertainty in yield measurements. Existing models often utilize single-modal feature representations, such as molecular fingerprints, SMILES sequences, or molecular graphs, which is not sufficient to capture the complex interactions and dynamic behavior of molecules in reactions. In this paper, we present an advanced Uncertainty-Aware Multimodal model (UAM) to tackle these challenges. Our approach seamlessly integrates data sources from multiple modalities by encompassing sequence representations, molecular graphs, and expert-defined chemical reaction features for a comprehensive representation of reactions. Additionally, we address both the model and data-based uncertainty, refining the model's predictive capability. Extensive experiments on three datasets, including two high throughput experiment (HTE) datasets and one chemist-constructed Amide coupling reaction dataset, demonstrate that UAM outperforms the state-of-the-art methods. The code and used datasets are available at https://github.com/jychen229/Multimodal-reaction-yield-prediction.</jats:p>

Zhang, Shuhao; Makoś, Małgorzata Z.; Jadrich, Ryan B.; Kraka, Elfi; Barros, Kipton; Nebgen, Benjamin T.; Tretiak, Sergei; Isayev, Olexandr; Lubbers, Nicholas; Messerly, Richard A.; Smith, Justin S.
Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential Journal Article
In: Nat. Chem., 2024.
Abstract | Links | BibTeX | Tags: Active learning, ANI, Organic reactions
@article{Zhang2024,
title = {Exploring the frontiers of condensed-phase chemistry with a general reactive machine learning potential},
author = {Shuhao Zhang and Ma\lgorzata Z. Mako\'{s} and Ryan B. Jadrich and Elfi Kraka and Kipton Barros and Benjamin T. Nebgen and Sergei Tretiak and Olexandr Isayev and Nicholas Lubbers and Richard A. Messerly and Justin S. Smith},
doi = {10.1038/s41557-023-01427-3},
year = {2024},
date = {2024-03-07},
urldate = {2024-03-07},
journal = {Nat. Chem.},
publisher = {Springer Science and Business Media LLC},
abstract = {Atomistic simulation has a broad range of applications from drug design to materials discovery. Machine learning interatomic potentials (MLIPs) have become an efficient alternative to computationally expensive ab initio simulations. For this reason, chemistry and materials science would greatly benefit from a general reactive MLIP, that is, an MLIP that is applicable to a broad range of reactive chemistry without the need for refitting. Here we develop a general reactive MLIP (ANI-1xnr) through automated sampling of condensed-phase reactions. ANI-1xnr is then applied to study five distinct systems: carbon solid-phase nucleation, graphene ring formation from acetylene, biofuel additives, combustion of methane and the spontaneous formation of glycine from early earth small molecules. In all studies, ANI-1xnr closely matches experiment (when available) and/or previous studies using traditional model chemistry methods. As such, ANI-1xnr proves to be a highly general reactive MLIP for C, H, N and O elements in the condensed phase, enabling high-throughput in silico reactive chemistry experimentation.},
keywords = {Active learning, ANI, Organic reactions},
pubstate = {published},
tppubtype = {article}
}
2023

Zhao, Qiyuan; Anstine, Dylan M.; Isayev, Olexandr; Savoie, Brett M.
Δ2 machine learning for reaction property prediction Journal Article
In: Chem. Sci., vol. 14, no. 46, pp. 13392–13401, 2023.
Abstract | Links | BibTeX | Tags: AIMNet, Machine learning potential, Organic reactions
@article{Zhao2023b,
title = {Δ^{2} machine learning for reaction property prediction},
author = {Qiyuan Zhao and Dylan M. Anstine and Olexandr Isayev and Brett M. Savoie},
doi = {10.1039/d3sc02408c},
year = {2023},
date = {2023-11-29},
urldate = {2023-11-29},
journal = {Chem. Sci.},
volume = {14},
number = {46},
pages = {13392--13401},
publisher = {Royal Society of Chemistry (RSC)},
abstract = {The emergence of Δ-learning models, whereby machine learning (ML) is used to predict a correction to a low-level energy calculation, provides a versatile route to accelerate high-level energy evaluations at a given geometry. However, Δ-learning models are inapplicable to reaction properties like heats of reaction and activation energies that require both a high-level geometry and energy evaluation. Here, a Δ2-learning model is introduced that can predict high-level activation energies based on low-level critical-point geometries. The Δ2 model uses an atom-wise featurization typical of contemporary ML interatomic potentials (MLIPs) and is trained on a dataset of ∼167 000 reactions, using the GFN2-xTB energy and critical-point geometry as a low-level input and the B3LYP-D3/TZVP energy calculated at the B3LYP-D3/TZVP critical point as a high-level target. The excellent performance of the Δ2 model on unseen reactions demonstrates the surprising ease with which the model implicitly learns the geometric deviations between the low-level and high-level geometries that condition the activation energy prediction. The transferability of the Δ2 model is validated on several external testing sets where it shows near chemical accuracy, illustrating the benefits of combining ML models with readily available physical-based information from semi-empirical quantum chemistry calculations. Fine-tuning of the Δ2 model on a small number of Gaussian-4 calculations produced a 35% accuracy improvement over DFT activation energy predictions while retaining xTB-level cost. The Δ2 model approach proves to be an efficient strategy for accelerating chemical reaction characterization with minimal sacrifice in prediction accuracy.},
keywords = {AIMNet, Machine learning potential, Organic reactions},
pubstate = {published},
tppubtype = {article}
}

Liu, Zhen; Moroz, Yurii S.; Isayev, Olexandr
The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions Journal Article
In: Chem. Sci., vol. 14, no. 39, pp. 10835–10846, 2023.
Abstract | Links | BibTeX | Tags: AIMNet, Organic reactions
@article{Liu2023,
title = {The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions},
author = {Zhen Liu and Yurii S. Moroz and Olexandr Isayev},
doi = {10.1039/d3sc03902a},
year = {2023},
date = {2023-10-11},
urldate = {2023-10-11},
journal = {Chem. Sci.},
volume = {14},
number = {39},
pages = {10835--10846},
publisher = {Royal Society of Chemistry (RSC)},
abstract = {Accurate prediction of reaction yield is the holy grail for computer-assisted synthesis prediction, but current models have failed to generalize to large literature datasets. To understand the causes and inspire future design, we systematically benchmarked the yield prediction task. We carefully curated and augmented a literature dataset of 41 239 amide coupling reactions, each with information on reactants, products, intermediates, yields, and reaction contexts, and provided 3D structures for the molecules. We calculated molecular features related to 2D and 3D structure information, as well as physical and electronic properties. These descriptors were paired with 4 categories of machine learning methods (linear, kernel, ensemble, and neural network), yielding valuable benchmarks about feature and model performance. Despite the excellent performance on a high-throughput experiment (HTE) dataset (R2 around 0.9), no method gave satisfactory results on the literature data. The best performance was an R2 of 0.395 ± 0.020 using the stack technique. Error analysis revealed that reactivity cliff and yield uncertainty are among the main reasons for incorrect predictions. Removing reactivity cliffs and uncertain reactions boosted the R2 to 0.457 ± 0.006. These results highlight that yield prediction models must be sensitive to the reactivity change due to the subtle structure variance, as well as be robust to the uncertainty associated with yield measurements.},
keywords = {AIMNet, Organic reactions},
pubstate = {published},
tppubtype = {article}
}

Zhao, Qiyuan; Vaddadi, Sai Mahit; Woulfe, Michael; Ogunfowora, Lawal A.; Garimella, Sanjay S.; Isayev, Olexandr; Savoie, Brett M.
Comprehensive exploration of graphically defined reaction spaces Journal Article
In: Sci Data, vol. 10, pp. 145 , 2023.
Abstract | Links | BibTeX | Tags: dataset, Organic reactions
@article{Zhao2023,
title = {Comprehensive exploration of graphically defined reaction spaces},
author = {Qiyuan Zhao and Sai Mahit Vaddadi and Michael Woulfe and Lawal A. Ogunfowora and Sanjay S. Garimella and Olexandr Isayev and Brett M. Savoie},
doi = {10.1038/s41597-023-02043-z},
year = {2023},
date = {2023-03-15},
urldate = {2023-03-15},
journal = {Sci Data},
volume = {10},
pages = {145 },
publisher = {Springer Science and Business Media LLC},
abstract = {Existing reaction transition state (TS) databases are comparatively small and lack chemical diversity. Here, this data gap has been addressed using the concept of a graphically-defined model reaction to comprehensively characterize a reaction space associated with C, H, O, and N containing molecules with up to 10 heavy (non-hydrogen) atoms. The resulting dataset is composed of 176,992 organic reactions possessing at least one validated TS, activation energy, heat of reaction, reactant and product geometries, frequencies, and atom-mapping. For 33,032 reactions, more than one TS was discovered by conformational sampling, allowing conformational errors in TS prediction to be assessed. Data is supplied at the GFN2-xTB and B3LYP-D3/TZVP levels of theory. A subset of reactions were recalculated at the CCSD(T)-F12/cc-pVDZ-F12 and ωB97X-D2/def2-TZVP levels to establish relative errors. The resulting collection of reactions and properties are called the Reaction Graph Depth 1 (RGD1) dataset. RGD1 represents the largest and most chemically diverse TS dataset published to date and should find immediate use in developing novel machine learning models for predicting reaction properties.},
keywords = {dataset, Organic reactions},
pubstate = {published},
tppubtype = {article}
}