Themed collection Data-driven discovery in the chemical sciences
List of participants
Faraday Discuss., 2025,256, 696-699
https://doi.org/10.1039/D4FD90072C
Poster list
Faraday Discuss., 2025,256, 691-695
https://doi.org/10.1039/D4FD90071E
Concluding remarks: Faraday Discussion on data-driven discovery in the chemical sciences
Some brief introductory comments on the history of this field are provided, along with some personal views on the discussion topics covered, concluding with three future challenges for this area.
Faraday Discuss., 2025,256, 664-690
https://doi.org/10.1039/D4FD00174E
Large property models: a new generative machine-learning formulation for molecules
We have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models”. A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data.
Faraday Discuss., 2025,256, 104-119
https://doi.org/10.1039/D4FD00113C
Analysis of uncertainty of neural fingerprint-based models
Assessment of uncertainty estimates of neural fingerprint-based models by comparing deep learning-based models with combinations of neural fingerprints and classical machine learning algorithms that employ established uncertainty calibration methods.
Faraday Discuss., 2025,256, 551-567
https://doi.org/10.1039/D4FD00095A
Spiers Memorial Lecture: How to do impactful research in artificial intelligence for chemistry and materials science
We discuss how machine learning researchers view and approach problems in chemistry and provide our considerations for maximizing impact when researching machine learning for chemistry.
Faraday Discuss., 2025,256, 10-60
https://doi.org/10.1039/D4FD00153B
Making the InChI FAIR and sustainable while moving to inorganics
The InChI standard facilitates chemical compound identification across platforms, with v1.07 fixing numerous issues and enhancing transparency via GitHub. This update aims to better represent molecular inorganic compounds, addressing previous limitations.
Faraday Discuss., 2025,256, 503-519
https://doi.org/10.1039/D4FD00145A
Prediction rigidities for data-driven chemistry
We demonstrate the wide utility of prediction rigidities, a family of metrics derived from the loss function, in understanding the robustness of machine learning (ML) model predictions.
Faraday Discuss., 2025,256, 322-344
https://doi.org/10.1039/D4FD00101J
Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction
We evaluate FlanT5 and ByT5 across tokenisation, pretraining, finetuning and inference and benchmark their impact on organic reaction prediction tasks.
Faraday Discuss., 2025,256, 413-433
https://doi.org/10.1039/D4FD00104D
A critical reflection on attempts to machine-learn materials synthesis insights from text-mined literature recipes
Machine-learned regression or classification models built from historical materials synthesis datasets have limited utility in guiding the predictive synthesis of novel materials, but anomalous recipes can inspire surprising new synthesis strategies.
Faraday Discuss., 2025,256, 614-638
https://doi.org/10.1039/D4FD00112E
Data-efficient fine-tuning of foundational models for first-principles quality sublimation enthalpies
We present an accurate and data-efficient protocol for fine-tuning the MACE-MP-0 foundational model for a given system. Our model achieves kJ/mol in predicting sublimation enthalpies and below 1% error in the density of ice polymorphs.
Faraday Discuss., 2025,256, 120-138
https://doi.org/10.1039/D4FD00107A
Beyond theory-driven discovery: introducing hot random search and datum-derived structures
Ephemeral Data-Derived Potential (EDDP)-driven long high-temperature anneals combined with AIRSS, termed as hot-AIRSS, enable the exploration of low-energy configurations of complex materials.
Faraday Discuss., 2025,256, 61-84
https://doi.org/10.1039/D4FD00134F
Re-evaluating retrosynthesis algorithms with Syntheseus
Syntheseus provides reference models and search algorithms as well as metrics to evaluate and improve synthesis planning tools.
Faraday Discuss., 2025,256, 568-586
https://doi.org/10.1039/D4FD00093E
Sequence determinants of protein phase separation and recognition by protein phase-separated condensates through molecular dynamics and active learning
We investigate three related questions: can we identify the sequence determinants which lead to protein self interactions and phase separation; can we understand and design new sequences which selectively bind to protein condensates?; can we design multiphasic condensates?
Faraday Discuss., 2025,256, 235-254
https://doi.org/10.1039/D4FD00099D
Modelling ligand exchange in metal complexes with machine learning potentials
We introduce a strategy to train machine learning potentials using MACE, an equivariant message-passing neural network, for metal–ligand complexes in explicit solvents.
Faraday Discuss., 2025,256, 156-176
https://doi.org/10.1039/D4FD00140K
Web-BO: towards increased accessibility of Bayesian optimisation (BO) for chemistry
Improving accessibility of data-driven optimisation for chemical tasks via a graphical user interface.
Faraday Discuss., 2025,256, 221-234
https://doi.org/10.1039/D4FD00109E
Discovery of highly anisotropic dielectric crystals with equivariant graph neural networks
We adopt the latest approaches in equivariant graph neural networks to develop a model that can predict the full dielectric tensor of crystals, discovering crystals with almost isotropic connectivity but highly anisotropic dielectric tensors.
Faraday Discuss., 2025,256, 255-274
https://doi.org/10.1039/D4FD00096J
Accurate and reliable thermochemistry by data analysis of complex thermochemical networks using Active Thermochemical Tables: the case of glycine thermochemistry
Active Thermochemical Tables (ATcT) are employed to resolve existing inconsistencies surrounding the thermochemistry of glycine and produce accurate enthalpies of formation for this system.
Faraday Discuss., 2025,256, 345-372
https://doi.org/10.1039/D4FD00110A
Knowledge distillation of neural network potential for molecular crystals
Knowledge distillation worked to improve the neural network potential for organic molecular crystals.
Faraday Discuss., 2025,256, 139-155
https://doi.org/10.1039/D4FD00090K
Embedding human knowledge in material screening pipeline as filters to identify novel synthesizable inorganic materials
A set of 6 filters based on chemical rules, human intuition, and practical constraints are developed to screen for synthesizable compounds. When applied to over 100 000 generated compounds in 60 phase diagrams, 27 are identified as possibly.
Faraday Discuss., 2025,256, 587-600
https://doi.org/10.1039/D4FD00120F
How big is big data?
The advent of larger datasets in materials science poses unique challenges in modeling, infrastructure, and data diversity and quality.
Faraday Discuss., 2025,256, 483-502
https://doi.org/10.1039/D4FD00102H
Optical materials discovery and design with federated databases and machine learning
New hypothetical compounds are reported in a collection of online databases. By combining active learning with density-functional theory calculations, this work screens through such databases for materials with optical applications.
Faraday Discuss., 2025,256, 459-482
https://doi.org/10.1039/D4FD00092G
Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes
Leveraging natural language processing models including transformers, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism.
Faraday Discuss., 2025,256, 275-303
https://doi.org/10.1039/D4FD00087K
Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery
We derive maximum and realistic performance bounds based on experimental errors for commonly used machine learning (ML) datasets for regression and classification and compare them to the reported performance of ML models.
Faraday Discuss., 2025,256, 304-321
https://doi.org/10.1039/D4FD00091A
Predictive crystallography at scale: mapping, validating, and learning from 1000 crystal energy landscapes
We demonstrate the reliability and scalability of computational crystal structure prediction (CSP) methods for small, rigid organic molecules by performing in-depth CSP investigations for over 1000 such compounds.
Faraday Discuss., 2025,256, 434-458
https://doi.org/10.1039/D4FD00105B
Integration of generative machine learning with the heuristic crystal structure prediction code FUSE
We integrate generative machine learning with heuristic crystal structure prediction in FUSE. The combined result shows superior performance over both components, accelerating the pace at which we will be able to predict and discover new compounds.
Faraday Discuss., 2025,256, 85-103
https://doi.org/10.1039/D4FD00094C
Mapping inorganic crystal chemical space
We enumerate binary, ternary, and quaternary element and species combinations and present a two-dimensional representation of inorganic crystal chemical space, labelled according to whether the combinations pass standard chemical filters and if they appear in known databases.
Faraday Discuss., 2025,256, 601-613
https://doi.org/10.1039/D4FD00063C
Discovering synthesis targets: general discussion
Faraday Discuss., 2025,256, 639-663
https://doi.org/10.1039/D4FD90064B
Discovering trends in big data: general discussion
Faraday Discuss., 2025,256, 520-550
https://doi.org/10.1039/D4FD90063D
Discovering structure–property correlations: general discussion
Faraday Discuss., 2025,256, 373-412
https://doi.org/10.1039/D4FD90062F
Discovering chemical structure: general discussion
Faraday Discuss., 2025,256, 177-220
https://doi.org/10.1039/D4FD90061H
About this collection
We are delighted to share with you a selection of the papers associated with a Faraday Discussion on Data-driven discovery in the chemical sciences. More information about the related event may be found here: http://rsc.li/data-fd2024. Additional articles will be added to the collection as they are published. The final versions of all the articles presented and a record of the discussions will be published after the event.
The Discussion will involve four central themes – each focused on different aspects of chemical "discovery", and each aiming to promote the exchange of ideas between the molecular and materials communities: Discovering chemical structure, Discovering structure–property correlations, Discovering synthesis targets, Discovering trends in big data.
On behalf of the Scientific Committee, we hope you join us and participate in this exciting event, and that you enjoy these articles and the record of the discussion.