Synthetic data enable experiments in atomistic machine learning

John L. A. Gardner; Zoé Faure Beaulieu; Volker L. Deringer

doi:10.1039/D2DD00137C

Synthetic data enable experiments in atomistic machine learning

John L. A. Gardner,

^a Zoé Faure Beaulieu

^a and Volker L. Deringer

*^a

Author affiliations

* Corresponding authors

^a Department of Chemistry, Inorganic Chemistry Laboratory, University of Oxford, Oxford OX1 3QR, UK
E-mail: volker.deringer@chem.ox.ac.uk

Abstract

Machine-learning models are increasingly used to predict properties of atoms in chemical systems. There have been major advances in developing descriptors and regression frameworks for this task, typically starting from (relatively) small sets of quantum-mechanical reference data. Larger datasets of this kind are becoming available, but remain expensive to generate. Here we demonstrate the use of a large dataset that we have “synthetically” labelled with per-atom energies from an existing ML potential model. The cheapness of this process, compared to the quantum-mechanical ground truth, allows us to generate millions of datapoints, in turn enabling rapid experimentation with atomistic ML models from the small- to the large-data regime. This approach allows us here to compare regression frameworks in depth, and to explore visualisation based on learned representations. We also show that learning synthetic data labels can be a useful pre-training task for subsequent fine-tuning on small datasets. In the future, we expect that our open-sourced dataset, and similar ones, will be useful in rapidly exploring deep-learning models in the limit of abundant chemical data.

Article information

https://doi.org/10.1039/D2DD00137C

Article type

Paper

Submitted

09 Dec 2022

Accepted

20 Mar 2023

First published

20 Mar 2023

This article is Open Access

Download Citation

Digital Discovery, 2023,2, 651-662

Permissions

Request permissions

Synthetic data enable experiments in atomistic machine learning

J. L. A. Gardner, Z. Faure Beaulieu and V. L. Deringer, Digital Discovery, 2023, 2, 651 DOI: 10.1039/D2DD00137C

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Digital Discovery

Synthetic data enable experiments in atomistic machine learning

Abstract

Transparent peer review

Article information

Download Citation

Permissions

Synthetic data enable experiments in atomistic machine learning

Social activity

Search articles by author

Spotlight

Advertisements