Identifying and embedding transferability in data-driven representations of chemical space

Tim Gould; Bun Chan; Stephen G. Dale; Stefan Vuckovic

doi:10.1039/D4SC02358G

Identifying and embedding transferability in data-driven representations of chemical space†

Tim Gould,

^a Bun Chan,

^b Stephen G. Dale

^ac and Stefan Vuckovic

*^d

Author affiliations

* Corresponding authors

^a Queensland Micro- and Nanotechnology Centre, Griffith University, Nathan, Qld 4111, Australia

^b Graduate School of Engineering, Nagasaki University, Bunkyo 1-14, Nagasaki 852-8521, Japan

^c Institute of Functional Intelligent Materials, National University of Singapore, 4 Science Drive 2, Singapore

^d Department of Chemistry, University of Fribourg, Fribourg, Switzerland
E-mail: stefan.vuckovic@unifr.ch

Abstract

Transferability, especially in the context of model generalization, is a paradigm of all scientific disciplines. However, the rapid advancement of machine learned model development threatens this paradigm, as it can be difficult to understand how transferability is embedded (or missed) in complex models developed using large training data sets. Two related open problems are how to identify, without relying on human intuition, what makes training data transferable; and how to embed transferability into training data. To solve both problems for ab initio chemical modelling, an indispensable tool in everyday chemistry research, we introduce a transferability assessment tool (TAT) and demonstrate it on a controllable data-driven model for developing density functional approximations (DFAs). We reveal that human intuition in the curation of training data introduces chemical biases that can hamper the transferability of data-driven DFAs. We use our TAT to motivate three transferability principles; one of which introduces the key concept of transferable diversity. Finally, we propose data curation strategies for general-purpose machine learning models in chemistry that identify and embed the transferability principles.

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

DOI: https://doi.org/10.1039/D4SC02358G
Article type: Edge Article
Submitted: 10 Apr 2024
Accepted: 02 Jun 2024
First published: 21 Jun 2024
This article is Open Access

All publication charges for this article have been paid for by the Royal Society of Chemistry

Download Citation

Chem. Sci., 2024,15, 11122-11133

Permissions

Request permissions

Identifying and embedding transferability in data-driven representations of chemical space

T. Gould, B. Chan, S. G. Dale and S. Vuckovic, Chem. Sci., 2024, 15, 11122 DOI: 10.1039/D4SC02358G

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Chemical Science

Identifying and embedding transferability in data-driven representations of chemical space†

Abstract

Supplementary files

Transparent peer review

Article information

Download Citation

Permissions

Identifying and embedding transferability in data-driven representations of chemical space

Social activity

Search articles by author

Spotlight

Advertisements