Outsmarting Quantum Chemistry through Transfer Learning

Computer simulations are foundational to theoretical chemistry. Quantum-mechanical (QM) methods provide the highest accuracy for simulating molecules but have difficulty scaling to large systems. Empirical interatomic potentials (classical force fields) are scalable, but lack transferability to new systems and are hard to systematically improve. Automated, data-driven machine learning is close to achieving the best of both approaches. Here we use transfer learning to retrain a general purpose neural network potential, ANI-1x, on a dataset of gold standard QM calculations (CCSD(T)/CBS level) that is relatively small but designed to optimally span chemical space. The resulting potential, ANI-1ccx, approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. ANI-1ccx is broadly applicable to materials science, biology and chemistry, and billions of times faster than the parent CCSD(T)/CBS calculations.

Read the ChemRxiv preprint for full details.

Transfer learning starts from a pretrained ANI-1x DFT model, then retrains to higher accuracy CCSD(T)*/CBS data with some parameters fixed during training

Accuracy in predicting reaction and isomerization energy differences on the (a) HC7/11 and (b) ISOL6 benchmarks, relative to CCSD(T)/CBS. Methods compared are the ANI-1ccx transfer learning potential, ANI-1x trained only on  data, the DFT reference ( B97X-D3), and our coupled cluster extrapolation scheme CCSD(T)*/CBS.

Accuracy in predicting atomization energies Ea on the GDB-10to13 benchmark relative to CCSD(T)*/CBS.

Accelerating QM with Neural Networks

Overall this work is built on the ANAKIN-ME method for developing the ANI-1 potential1 is, to the best of our knowledge, the first example of an extensible and transferrable (universal) ML atomistic potential for organic molecules. For training such models, we also developed a data set of 22 million structural conformations (conformers) from ~60K distinct organic molecules.2 This data set was built through an exhaustive sampling from the subset of the GDB-11 data set containing between one and eight C, N, and O atoms.

References:

  1. Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 27, 479–496 (2017).
  2. Smith, J. S., Isayev, O. & Roitberg, A. E. Data Descriptor: ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic molecules. Sci. Data 4, (2017).