Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Mingjian Wen; Samuel M. Blau; Xiaowei Xie; Shyam Dwaraknath; Kristin A. Persson

doi:10.1039/D1SC06515G

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining†

Mingjian Wen,

^a Samuel M. Blau,^a Xiaowei Xie,

^bc Shyam Dwaraknath^d and Kristin A. Persson

*^ef

Author affiliations

* Corresponding authors

^a Energy Technologies Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

^b College of Chemistry, University of California, Berkeley, CA 94720, USA

^c Materials Science Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

^d Luxembourg Institute of Science and Technology, Luxembourg

^e Department of Materials Science and Engineering, University of California, Berkeley, CA 94720, USA

^f Molecular Foundry, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
E-mail: kapersson@lbl.gov

Abstract

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

This article is part of the themed collections: Machine Learning and Artificial Intelligence: A cross-journal collection and Most popular 2022 physical and theoretical chemistry articles

Supplementary files

Article information

DOI: https://doi.org/10.1039/D1SC06515G
Article type: Edge Article
Submitted: 22 Nov 2021
Accepted: 09 Jan 2022
First published: 11 Jan 2022
This article is Open Access

All publication charges for this article have been paid for by the Royal Society of Chemistry

Download Citation

Chem. Sci., 2022,13, 1446-1458

Permissions

Request permissions

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

M. Wen, S. M. Blau, X. Xie, S. Dwaraknath and K. A. Persson, Chem. Sci., 2022, 13, 1446 DOI: 10.1039/D1SC06515G

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Chemical Science

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Social activity

Search articles by author

Spotlight

Advertisements