Seoin
Back
*a,
Alán
Aspuru-Guzik†
bc,
Michele
Ceriotti
d,
Ganna
Gryn'ova
ef,
Bartosz
Grzybowski
ghi,
Geun Ho
Gu
j,
Jason
Hein
k,
Kedar
Hippalgaonkar
lm,
Rodrigo
Hormázabal
n,
Yousung
Jung†
op,
Seonah
Kim
q,
Woo Youn
Kim
r,
Seyed Mohamad
Moosavi
s,
Juhwan
Noh
t,
Changyoung
Park
n,
Joshua
Schrier
u,
Philippe
Schwaller
v,
Koji
Tsuda
wxy,
Tejs
Vegge
z,
O. Anatole
von Lilienfeld†
caaab and
Aron
Walsh
acad
aDepartment of Chemical and Biomolecular Engineering, Institute of Emergent Materials, Sogang University, Seoul, Republic of Korea. E-mail: sback@sogang.ac.kr
bDepartments of Chemistry, Computer Science, University of Toronto, St. George Campus, Toronto, ON, Canada
cAcceleration Consortium and Vector Institute for Artificial Intelligence, Toronto, ON M5S 1M1, Canada
dLaboratory of Computational Science and Modeling (COSMO), École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
eHeidelberg Institute for Theoretical Studies (HITS gGmbH), 69118, Heidelberg, Germany
fInterdisciplinary Center for Scientific Computing, Heidelberg University, 69120, Heidelberg, Germany
gCenter for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science (IBS), Ulsan, Republic of Korea
hInstitute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
iDepartment of Chemistry, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
jDepartment of Energy Engineering, Korea Institute of Energy Technology (KENTECH), Naju, 58330, Republic of Korea
kDepartment of Chemistry, University of British Columbia, Vancouver, BC V6T 1Z1, Canada
lSchool of Materials Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
mInstitute of Materials Research and Engineering, Agency for Science Technology and Research, 2 Fusionopolis Way, 08-03, Singapore 138634, Singapore
nLG AI Research, Seoul, Republic of Korea
oDepartment of Chemical and Biomolecular Engineering, KAIST, Daejeon, Republic of Korea
pSchool of Chemical and Biological Engineering, Interdisciplinary Program in Artificial Intelligence, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
qDepartment of Chemistry, Colorado State University, 1301 Center Avenue, Fort Collins, CO 80523, USA
rDepartment of Chemistry, KAIST, Daejeon, Republic of Korea
sChemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
tChemical Data-Driven Research Center, Korea Research Institute of Chemical Technology, Daejeon, 34114, Republic of Korea
uDepartment of Chemistry, Fordham University, The Bronx, NY 10458, USA
vLaboratory of Artificial Chemical Intelligence (LIAC) & National Centre of Competence in Research (NCCR) Catalysis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
wGraduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan
xCenter for Basic Research on Materials, National Institute for Materials Science, Tsukuba, Ibaraki 305-0044, Japan
yRIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan
zDepartment of Energy Conversion and Storage, Technical University of Denmark, 301 Anker Engelunds vej, Kongens Lyngby, Copenhagen, 2800, Denmark
aaDepartments of Chemistry, Materials Science and Engineering, and Physics, University of Toronto, St George Campus, Toronto, ON, Canada
abMachine Learning Group, Technische Universität Berlin and Berlin Institute for the Foundations of Learning and Data, 10587, Berlin, Germany
acDepartment of Materials, Imperial College London, London SW7 2AZ, UK
adDepartment of Physics, Ewha Women's University, Seoul, Republic of Korea
First published on 6th December 2023
In light of the pressing need for practical materials and molecular solutions to renewable energy and health problems, to name just two examples, one wonders how to accelerate research and development in the chemical sciences, so as to address the time it takes to bring materials from initial discovery to commercialization. Artificial intelligence (AI)-based techniques, in particular, are having a transformative and accelerating impact on many if not most, technological domains. To shed light on these questions, the authors and participants gathered in person for the ASLLA Symposium on the theme of ‘Accelerated Chemical Science with AI’ at Gangneung, Republic of Korea. We present the findings, ideas, comments, and often contentious opinions expressed during four panel discussions related to the respective general topics: ‘Data’, ‘New applications’, ‘Machine learning algorithms’, and ‘Education’. All discussions were recorded, transcribed into text using Open AI's Whisper, and summarized using LG AI Research's EXAONE LLM, followed by revision by all authors. For the broader benefit of current researchers, educators in higher education, and academic bodies such as associations, publishers, librarians, and companies, we provide chemistry-specific recommendations and summarize the resulting conclusions.
In addition to brief talks from the attendees, the conference placed emphasis on panel discussions on the themes of Data, New applications, Machine learning (ML) algorithms, and Education. This Perspective aims to effectively communicate the insights and discussions from these panels to the broader research community.
Numerous recent review and perspective articles have extensively explored the role of data science, ML and AI in various domains of experimental chemistry, including general chemistry,1 synthetic chemistry and chemical reactions,2–5 as well as theoretical topics such as chemical compound space exploration6 and force-field development.7,8 Additionally, recent reviews have addressed the application of autonomous research systems in materials science,9–16 organic chemistry,17–19 inorganic chemistry,20 porous materials,21 nanoscience,22,23 drug formulation24,25 and biomaterials.26 Reviews also exist on the topic of self-driving laboratories27,28 and their low-cost incarnations.29 While previous recommendations have covered ‘best practices' in machine learning for chemistry,30 including uncertainty quantification,31 our focus in this Perspective is to present specific recommendations derived from a very rich set of panel discussions by many active researchers in the field rather than reiterating those already discussed themes. We refer the reader to them to more in-depth conversations.
Continuing with the focus on AI, the Whisper program32 was used to transcribe the panel discussions, and EXAONE33 was used to generate automated summaries. These algorithmically generated summaries served as the initial drafts of the following works, which we subsequently edited and annotated to ensure clarity. Through this process, it became clear that the panel discussions encompassed overlapping topics, highlighting the shared challenges in the field of AI in chemical science. To underscore these critical challenges, we have reorganized the discussions into common themes: data, new applications, ML algorithms, and education.
Furthermore, the availability of high/multi-fidelity benchmark datasets is essential.35 The benefits of improved training data efficiency when using multi-level learning in chemical compound space have been demonstrated on multiple occasions.36,37
When dealing with high-cost, high-fidelity data acquisition, the development of automated workflows that incorporate uncertainty quantification,38 encompassing both epistemic (model's inability to fit the data distribution) and aleatoric (noise in the data) uncertainties, along with active learning, can be beneficial. Moreover, delta learning methods and incorporation of physical rules as inductive bias within the machine learning algorithms have shown to reduce the size of required data.39 Furthermore, sampling techniques such as entropic sampling and self-learning population annealing can serve as effective data acquisition strategies. These techniques enable effective weighting of the density of states of the final property in relation to input descriptors, facilitating a comprehensive understanding of different regions of the chemical space. In addition to forward models, observations have suggested that machine learning can also contribute to knowledge-augmented data generation within a discrete and sparse chemical space, particularly in the context of inverse generative design.12,40,41
Despite the significant emphasis on developing theoretical strategies for efficiently constructing databases with high-fidelity data, there is a need for additional efforts to ensure that these databases are also user-friendly for interdisciplinary research, i.e., permit even non-domain-expert AI practitioners to interact with the data with minimal intervention. This accessibility is essential for facilitating the test of new algorithmic developments. For instance, when the first large quantum dataset with coordinates and multiple molecular properties for more than 100000 small organic molecules, QM9, was published in 2014, the total energy was included alongside the free atomic energy. While experts can easily calculate derived properties from this information, such as reaction energies or atomization energies by respectively subtracting the total energies of constitutional isomers or subtracting the free atomic energy from the total energy for any given stoichiometry, this process can pose an unnecessary barrier for non-experts, requiring them to invest time and effort in understanding the underlying definitions of basic chemical properties. Hence, the development of easy-to-use, web browser-based interfaces for predictive models is of great importance.42 At the same time, the systematic management of meta-information remains important to ensure the reliability of the constructed database. For example, tools such as AiiDA43 and NoMaD44,45 record comprehensive data provenance for ‘static’ materials simulations.
Finally, it is important to distinguish between multiple datasets categories: smaller, more accurate, and computationally challenging ones that serve specific practical purposes, and datasets specifically designed for benchmarking ML models. This differentiation helps avoid situations where research solely focuses on improving model performance to surpass benchmarks without effectively translating those advancements into practical applications, (overfitting). In this context, dynamic management of databases within the relevant research community proves to be fruitful, as discussed below.
(1) Tasks: clearly defined tasks with precise mathematical interpretation, physical meaning, and chemical purpose.
(2) Accessibility: availability of easily accessible gold-standard datasets in a standardized format, publicly accessible and ready for use.
(3) Metrics: specification of one or more proposed quantitative metrics for each task to measure success.
(4). Evaluation: continuously updated leaderboards that rank state-of-the-art methods and/or data-splits that allow us to better track the model improvements and generalization to out-of-domain (OOD).
(5) Discovery: ability to generate new data as needed, by “Augmenting with chemical knowledge.”
Large databases of reactions, such as USPTO,52 Pistachio,53 Reaxys,54 and SciFinder,55 do exist. However, the knowledge contained within these databases falls short regarding quality, diversity, and accessibility. For instance, while USPTO offers open access, its quality may be lower compared to the limited, paid access but higher-quality Reaxys. Reproducibility has also become a point of concern. Additionally, despite the vast number of experimental data available in these reaction databases, only a limited number of reaction types have sufficiently large numbers of examples, typically a few hundred or more, which hinders the development of practical/useful/general AI models.56 Efforts such as the Open Reaction Database are notable for trying to address these limitations,57 but remain populated with data from USPTO, with only a few hundred brand-new entries – this poses a question of how to best incentivize synthetic chemists to deposit their results (both positive and negative) into such databases.
Correspondingly, purely data-driven approaches in organic synthesis planning would greatly benefit from maximal training data efficiency when learning. Potential solutions to enhance efficiency include Delta-learning and transfer learning,58 multi-level learning,36,37 and few-shot learning techniques.59 However, the challenge of sparse data becomes particularly pronounced when attempting to identify the scope of “impossible” reactions. If a certain reaction is not listed in a database, one often assumes it cannot happen. But this assumption is mostly true for the types of reactions that happen often. As mentioned earlier, such classes are relatively limited in number and occurrence.60
When high-quality datasets are lacking, an alternative, albeit more labor-intensive approach, is expert coding within programs like Chematica or AllChemy. These programs can perform advanced-level synthesis planning, even for complex natural products.61
One conclusion reached with broad consensus is the ever-increasing need for improved quality and open databases in all AI-related efforts, not only for reaction data but also for describing rules of chemical reactivity, or the properties of experimentally-available and virtual ligands to find new catalysts.62,63 Moreover, new featurization schemes may be necessary, particularly ones that consider stereochemical, steric hindrance, and long-range interaction aspects of reactions on complex scaffolds.
Similarly, it was emphasized that the codes utilized to generate the data should be accessible, unless licensed, and well-documented. Such practices align with the increasing adoption of FAIR (Findable, Accessible, Interoperable, and Reusable) policies in the scientific community.64,65 Another related challenge is facilitating broader access to proprietary data and/or establishing new repositories where researchers can deposit results of both successful and, importantly, unsuccessful experiments they have conducted.
On the former issue, the panelists agreed that professional non-profit organizations, such as the American Chemical Society (ACS), should consider opening up their extensive repositories or, at the very least, enabling broader academic access. Currently, the SciFinder dataset contains approximately 100 million reactions, yet it remains completely inaccessible for downloads, severely limiting systematic data analyses. Given its status as a non-profit organization, the ACS is seen to have an ethical obligation to share the datasets it accumulates. While the CAS Common Registry initiative66 is appreciated, restricted licensing hinders research progress. Thinking more broadly, policies that require disseminating a complete set of data and code as a requirement for publication will help accelerate progress in this field. ACS has started defining research data policy recommendations to achieve this goal.67 An excellent example of this is RSC's new journal Digital Discovery,68 which has a dedicated data and code reviewer to assess submitted materials for documentation and reproducibility.
Simultaneously, the experimental literature tends to exhibit bias towards “success stories” while failing experiments often go unreported.76,77 This bias can arise from various factors, such as the superior performance or ease of synthesis and characterization of certain materials for unrelated applications. Consequently, the available data on chemical space for exploration with AI becomes limited, impeding the discovery of genuinely novel systems. From a modeling perspective, a data point perceived as a “failure” in experimental terms can be just as valuable for training models as a data point from a “successful” experiment. Although the concept of a “Journal of Failed Research” remains elusive, the panel suggested that well-documented and openly available metadata from experiments, regardless of outcomes, could address this limitation by providing theoreticians with more extensive and diverse training sets in terms of structure and composition. Moreover, it was highlighted that the context of an experiment matters in defining what constitutes a “failed experiment”. For instance, a seemingly failed experiment in one context may actually lead to successful outcomes or the discovery of new compounds in a different context.
During the discussions, the topic of how AI empowers creativity in chemistry was addressed. It was acknowledged that AI is ultimately a tool that accelerates technological advancements and scientific discoveries. The progress made in this field has undeniably expedited the pace of invention. It can also be argued that AI enhances the occurrence of “eureka moments” by facilitating new insights and understanding. This aspect is intricately linked to the exploration of new concepts and the perception of reality. As a creative discipline, chemistry is driven by scientists motivated to uncover novel phenomena, unencumbered by pre-established physical laws. For example, this could involve stabilizing challenging structures, creating unconventional solvation environments, or discovering previously unknown and aesthetically pleasing spin states. Therefore, by leveraging AI to comprehend the existing knowledge and venture into unexplored territories, creative pursuits in chemistry can be truly enhanced. In particular, the question of what it entails for AI to gain scientific understanding based on data is a very relevant question due to the advent of large language models (LLMs) and their applications to chemistry.78–81 In this context, philosophical and conceptual frameworks like the one proposed by Krenn, et al. are needed.82
The concept of self-driving labs was also discussed,9 with considerations given to the expenses associated with building, maintaining, and operating such facilities, especially when tailored for testing various optimization algorithms. The idea of “virtual labs” emerged as an alternative, where multi-level modeling is utilized to mimic real-world experiments. For example, in the context of batteries, simulations running on materials could be linked to single-cell and battery-pack configurations to understand the key influences from microstructure to system performance.
There is also a need to approach data dynamically. Building data in a multi-modal capacity to capture different scales or incorporating new experiments and calculations is critical for aiding chemical discovery. It is crucial to emphasize the importance of top-down approaches, starting from the meso/micro-scale phase-field60 and seamlessly coupling them with ML potentials92 for autonomous parameterization. Additionally, to enable more meaningful AI-driven discoveries, it is highly desirable to restrict the search to compounds that are easy to synthesize and provide synthesis recipes.
An additional example of algorithmic developments, partially inspired by chemical applications, involves the construction of models that incorporate physical symmetries into their structure. In the case of interatomic potentials, since the early stages of this field the crucial insight has been the requirement for models to be exactly invariant to rotations, translations, and atom index permutations.93 More recently, these ideas have been expanded to create physics inspired models that build upon covariant features/representations, an extension motivated by the widespread presence of vectorial and tensorial targets in quantum chemistry.94 It is noteworthy that these developments have progressed independently and in parallel with similar efforts in computer science,95 albeit formulated using different terminology and with less mathematical generality.
During the panel discussions, intriguing questions were raised regarding the potential integration of data-centered and expert methods and the extent to which this integration could be achieved.96,97 Hybrid approaches were proposed as a means to leverage the encoded knowledge of experts while maintaining the flexibility and adaptability of data-driven approaches. It was also observed that the raw reaction rules derived from either of these approaches can be significantly enhanced through further refinement using quantum mechanical (QM) or molecular mechanical (MM) calculations. For instance, MM methods can be employed to calculate strains and estimate the applicability of reaction rules to cyclization reactions.98
Another notable example of a hybrid approach involves breaking down the barriers between different methodologies. This includes merging electronic structure theory and machine learning99 or creating a unified framework that combines simulations and experimental data.100 Such models have the potential to learn by effectively integrating diverse sources of information.
Considering these fundamental AI advancements for enabling chemical discovery, it was noted that most multi-objective, multi-fidelity constrained problems addressed in self-driving labs today tend to prioritize higher performance based on predefined objectives. However, to advance chemistry knowledge, algorithms need to be further tailored for interpretability, extrapolation to learn new science, and hypothesis testing, which fundamentally require different approaches. A recent example involves dedicated exploration of the Pareto front, allowing the extraction of local correlations with near-optimal performance to aid in result understanding.104,105
The subsequent topic of discussion revolved around using the acceleration and discovery of new molecules/materials successfully validated in the lab as metrics of success in applying machine learning in chemistry. However, going beyond the speed of material development, true discovery of new concepts,82 such as topological materials, remains elusive. This led to the question of exploring deeper paths in AI to unlock such possibilities.106 One potential avenue is considering an automatic system that generates novel questions, although formulating the problems is typically within the domain of human experts. In scientific discovery, anomalies or outliers often lead to new findings. Optimization algorithms are already designed to find regions of high uncertainty in the parameter space, which are often unexplored. Rewarding data points in those regions, even if only a small percentage results in actual discoveries, can lead to the real discovery of new phenomena. Additionally, digitizing existing knowledge in chemistry and creating a comprehensive corpus of our current understanding can help define a concept of “known unknowns” for AI, making the idea less vague and facilitating exploration beyond what is already known. An example was shared regarding an automated robotic system developed by David MacMillan's group at Princeton University, which achieved “accelerated serendipity” by assembling molecules with no known history of interactions and rewarding accidental reactivity.107 This approach resulted in discovering new reactions or improved methods for existing reactions. Furthermore, emphasizing the uncertainty quantification of AI models was highlighted as a critical step, as rewarding areas of large uncertainty in active learning frameworks necessitates the quantification and understanding of the epistemic and aleatoric uncertainty of the models,38,108,109 and the errors at each step.
Various educational strategies were explored during the discussions. One extreme example is Nanyang Technological University in Singapore, where university students are mandated to have coursework in computational thinking, data science, and machine learning. Similarly, Imperial College London and the Denmark Technical University have university-wide initiatives to incorporate data and machine learning competencies within the undergraduate curriculum. Another approach involves offering dedicated single courses such as “Data Science for Chemistry” or “Autonomous Discovery” as upper-level electives.110,111 Some participants shared experiences of incorporating aspects of ML/AI/data science into existing courses or pedagogical laboratory experiences.112–116 Some of these adaptations were driven by the restrictions imposed by the COVID-19 pandemic. For instance, alternative machine-learning-oriented “computational labs” were developed as substitutes for traditional wet labs. Additionally, remote-control access to laboratory equipment117 and mailing students Lego kits to build and operate their autonomous systems were also explored. A recent review of low-cost self-driving laboratories collects many of the above efforts in comprehensive categorizations.118
Many science and engineering degree programs already require computer programming or numerical computing courses. Historically, these courses were taught in FORTRAN or MATLAB, although the recent trend is to move towards Python, which has become the standard language for machine learning. There are both advantages and disadvantages to having this course taught by a computer science department, considering university politics and topical relevance to students. On the one hand, departments may be protective of their specific areas of study, and other departments may lack the staffing necessary to support the teaching of new classes. On the other hand, students often benefit from direct applications of programming to their primary coursework, which may be lacking in broader service courses. Regardless of how it is offered, it is crucial that students learn elementary programming as early as possible, as it serves as a foundational skill for the other topics covered in the curriculum. It also enables students to undertake projects in their final year focusing on automation or modeling. By adopting this approach, we can create a new generation of students proficient in coding.
Data management encompasses various aspects, including importing, visualization, and adhering to scientific practices such as FAIR data principles. It also involves the development of ontologies, schema, and understanding of intellectual property rights. Incorporating data management into the education of all chemists is crucial, as data generation is inherent to the field, and funding agencies as well as publishers require data management policies. One approach to instilling these practices is to have students create data management plans for projects or upload data from teaching labs to actual repositories. Emphasizing the importance of reporting every repetition of an experiment is essential. Comprehensive data management practices will greatly benefit students when preparing papers, and reinforcing these practices throughout their undergraduate and graduate education is highly valuable.
Statistics is a well-established field and requires no introduction. However, an ideal curriculum would place greater emphasis on computational approaches to statistics.119
Another potential application for training is using “body cam” footage or similar technologies to provide mentorship in the laboratory. The COVID-19 pandemic, with its need for remote work and limited laboratory occupancy, presented opportunities for pilot projects exploring augmented reality. In these projects, a trainer could supervise trainees from a remote location and provide relevant information directly into the trainee's field of view.
When incorporating new computational material into coursework, trade-offs need to be made. Constructive overlaps can be found by substituting programming exercises for lengthy symbolic derivations or incorporating data analysis and sharing exercises instead of traditional laboratory report writing assignments. However, it is inevitable that some content will need to be removed. For instance, some institutions have chosen to reduce math components or replace manual experimental laboratory work with computer-based assignments, which has been well-received by students but has also caused tension within departments. Another approach could involve creating summer coding “bootcamps” that provide anywhere from 1–12 weeks of intensive coding experiences for undergraduate and graduate students, leveraging theory faculty members and inviting guest speakers. However, it is important to recognize that these extracurricular experiences may not engage all students and require faculty to donate their time. More case studies are needed to further explore these trade-offs, and the evolution of curriculum is expected to progress slowly.
Barriers and challenges exist in promoting the incorporation of machine learning into chemistry education. Many chemists outside of the subfield may not perceive it as essential and may lack the necessary skills to teach the material. However, there is value in providing a rigorous education based on fundamentals, and statistical data analysis may serve as a starting point that can act as a gateway to statistical learning methods.
The prospect of increased productivity through AI and autonomous research presents an opportunity to elevate the career value of chemists. However, changing perceptions about the role of AI in chemistry and securing investments in autonomous laboratories remain challenges. To attract attention and support, it is not enough to have robots in laboratories; the robots should engage in groundbreaking chemistry and contribute to discoveries that would otherwise be impossible.
Footnote |
† The symposium was organized by Yousung Jung, Alán Aspuru-Guzik, and O. Anatole von Lilienfeld. The authors are listed in alphabetical order, except for the first author who took charge of organizing the initial draft written by all co-authors who contributed to different sections. |
This journal is © The Royal Society of Chemistry 2024 |