Claudio
Avila
*,
Adam
West
,
Anna C.
Vicini
,
William
Waddington
,
Christopher
Brearley
,
James
Clarke
and
Andrew M.
Derrick
Pfizer UK R&D Ltd, Pharmaceutical Sciences Small Molecule (PSSM), Chemical Research and Development (CRD), Discovery Park House, Ramsgate Road, CT13 9NJ, Sandwich, UK. E-mail: Claudio.Avila@science.cl
First published on 24th July 2024
Across the chemical sciences, synthesis planning is a key aspect for defining synthesis routes, starting from idea generation, combining literature searches and laboratory experimentation, and including scaling-up considerations for large scale manufacturing. This iterative process, which relies heavily on information sharing, is crucial in pharmaceutical development, where drug candidates are transformed into commercially viable Active Pharmaceutical Ingredients (APIs), impacting the access to medicines for billions of people. In this work, we demonstrate that by capturing chemical pathway ideas digitally, at the point of conception, we can systematically merge these ideas with synthetic knowledge derived from predictive algorithms. This serves as a preliminary step for further route evaluation. To achieve this, we introduce a new method for storing, analysing, and displaying chemical information using graph databases and graph representations, illustrated with the commercial synthesis planning of the GLP-1 inhibitor Lotiglipron. Compared to traditional methods, graph databases naturally fit the substrate-arrow-product model traditionally used by chemists, offering a modern alternative to store and access chemical knowledge. This framework facilitates a universal chemistry approach, allowing to share and combine data from many different sources and organisations, and enabling new ways to optimise the complete route selection process.
In pharmaceutical development, the selection of a synthetic route for commercial manufacturing of an Active Pharmaceutical Ingredient (API) can be a long iterative process.4 The search for an API (or target molecule) begins by identifying the ‘pharmacophore’, defined by the IUPAC as “an ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target and to trigger (or block) its biological response”.5 Starting from this, research groups propose a diversity of targets (an array of analogous molecules with similar therapeutic potential), which are used during the testing phases of the drug discovery process,6 and elaborate a simple synthetic route to rapidly access them. From this process, an enabling chemistry route emerges, often used to deliver early campaigns to fund clinical and toxicological trials, typically in the kilogram production range and under intense time constraints.
Once the API moves into the later stages of clinical development (human trials), this route is seldom orientated toward the objectives of a commercial manufacturing process, which involves achieving commercial feasibility often at multi-tonne scales and meeting the quality attributes required by regulatory agencies.7,8 As a result, the synthesis planning for the commercial manufacturing stage starts with scarce data relevant to the final objective.3,9
To progress towards a more optimal synthetic route, all possible theoretical ideas need to be gathered. This may include full synthetic routes, fragments of routes, or individual reactions that may be of use later. This step is often called ‘idea generation’ or ‘brainstorming’,10 and it requires a large number of contributors to achieve a diverse set of ideas, to ensure the optimal solution is amongst this initial set.11 The number of ideas generated at this stage is usually large (50 or more routes), this hinders how data is displayed and makes decision making difficult. Effective idea visualisation is essential to proceed to the next step.
The ideas are then organised and triaged, serving as the basis for the execution of the synthesis plan in the laboratory. The consequences of decisions made during this triage, where limited experimental data is accessible, propagate through all stages of development and even to the final commercial API manufacturing process.12
In pharmaceutical development, a common approach used for route selection is to apply the SELECT12,13 criteria (or similar standards used across the chemical industry14). SELECT includes a series of factors to account for Safety: including process safety, and exposure to substances harmful to health; Environmental: the volume of natural resources consumed, and the generation of substances harmful to the environment;15Legal: intellectual property rights to produce the drug, and legal requirements for control and use of intermediates and reagents; Economics: meeting the cost of goods targeted for commercialisation, and the investment required to support the desired production; Control: including the chemistry and physical process parameters (PP) and the control of the quality attributes (QA);16 and Throughput: availability of raw materials, and timescale of the manufacturing process.13
The practical implementation follows a series of logical steps, in which a multidisciplinary team is required to aggregate multiple types and sources of information (theoretical, semantic, qualitative, quantitative, etc.). Traditionally, the input of experienced process scientists is needed to cover process chemistry, process safety, regulatory compliance, process engineering, and manufacturing considerations.9,12 Their input is combined with initial laboratory trials, literature references, and any other accessible source of information, such as results from modelling tools regarding process safety17 and physical property predictions like solubility.18 The objective is to determine the most promising routes, prioritising those with the highest value and likelihood of success. Following a comprehensive scrutiny of these factors, directed by SELECT, a few routes emerge as promising choices for subsequent investigation in the laboratory. Ultimately, a singular route is selected as the commercial route for large scale manufacturing.
After the initial route selection, the work focuses on the process development i.e. identifying the necessary unit operations and the process conditions for each step. During this stage, the individual steps of the selected route may be refined, this can include step reordering or the decision to telescope one or more steps.19 However, more often changes are limited to unit operations, and reaction conditions (solvents, catalysts, and reagent selection, temperature, etc.). Very occasionally, a major drawback is identified that requires revisiting the route selection process. In such instances, all the data collected up to that point are not discarded but are utilised to (re)evaluate alternative synthesis paths.
Fig. 1A depict a flow diagram illustrating the complete development process, while Fig. 1B shows the data flow of a random molecule synthesis, illustrating the steps considered by the described traditional approach. These network diagrams show individual routes appearing as branches from the main target molecule, formed by a group of single reaction steps and the corresponding intermediate molecules.
The determination of an optimum route is a multi-factor problem, and the human-led solution is vulnerable to bias, even when applying the SELECT criteria rigorously.3,20 Lack of supporting information or chemistry knowledge leads to some ideas being left aside without adequate assessment and decisions can align behind a single local optimum just because more is known about this route.21 One key barrier for the traditional approach is the inherent human bias to gravitate towards familiar methodologies and well-established procedures.22 Decisions often draw from the experiences and successes of the scientists involved in the process, leading to a potential lack of diversity in the exploration of new synthetic pathways. This gap is confirmed by the emergence of generics into the market, appearing as soon as key patents have expired, with alternative viable commercial routes in other branches of the chemical space.23 Branches that may have even been present in the original synthesis planning stage but were not explored.
Several challenges contribute to perpetuate this problem. For instance, the lack of centralised data systems or access to individual applications acting in isolation, and the absence of common data formats and repositories,24 currently act as a barrier preventing a systematic analysis of the entire collection of ideas and data.25,26 In addition, the lack of utilisation of equivalent information from past projects poses a significant obstacle in chemical research.27 Despite differences in target molecules, steps, and transformations, the potential for transferrable knowledge remains largely untapped due to the absence of supportive systems.28,29 This results in missed opportunities for cross-project insights, leading to redundant efforts and resource inefficiencies.
To overcome these barriers, there is a need to implement supportive systems to enable the systematic capture, organisation, and dissemination of knowledge across various projects.30,31 Such systems could take the form of centralised databases, collaborative platforms, or knowledge management tools designed to promote information sharing and collaboration.32,33 In theory, as Fig. 1C illustrates, for synthesis route planning these ideas could be programmatically enriched with additional data e.g. data from additional experimental sources, literature references, theoretical or predictive information, etc.; subsequently enabling the calculation of metrics intrinsic to the entire dataset.34,35 Moreover, algorithms could be applied across the complete information network,36–38 potentially unveiling unprecedented insights beyond the scope of conventional human capabilities.
This paper illustrates how introducing key digitalisation elements can evolve the traditional route selection methods into a more advanced approach. Storing chemical knowledge directly on graph databases enables direct digitalisation of human inputs, real-time access to scientists, holistic overviews for decision makers, and the enrichment of the metadata by direct application of software and algorithms on the database. Finally, implementing advanced network metrics will facilitate identification of the global optimum route, which in turn could facilitate access and lower cost of medicines for patients globally.
Today, graph representations are emerging in a variety of scientific fields.40–42 In chemistry, graph representations have been used to visualise synthesis paths,36,43,44 and to create representations of large chemical networks.36,45,46 These representations have proved useful to tackle problems in risk management of chemical threats,31 and to organise results delivered by predictive retrosynthesis software,45,47 in an accessible way for chemists and data scientists.48 Building upon these insights, this paper proposes advancing beyond mere visualisation to leveraging graph databases as comprehensive data management systems.
A graph database is a management system that stores information in the form of nodes (objects containing properties or attributes that describe the data it represents) and relationships or edges (arrows defining connections between nodes, indicating the nature of the relationship and its direction),49 providing a powerful new method of storing highly complex and variable information.50–52 Defining a graph database for chemistry is straightforward because the traditional representation of chemical reactions, conceptually describing the transformation of chemical substances from reactants to products as a diagram, aligns naturally with a graph structure (this symbolic translation is depicted in Fig. 2A). For instance, a starting material (a molecule node) is transformed into a product (another molecule node) by a chemical reaction (a reaction node), in which the edges provide the relationship (if a substrate is a precursor or a product).
Furthermore, in a graph database, molecule nodes can contain specific properties such as names or identifiers (SMILES representations, INCHI keys, IUPAC names, etc.), as well as other fields capturing specific physical or chemical information. In the same way, a reaction node can contain specific properties such as rection conditions, yields, selectivity, purity, and other scores. Additional nodes can be created representing different objects without interfering with any of the data and relationships established. In this work, a route node was also created to be able to store individual route attributes and introduce more advanced network analysis. Extra nodes can be added to the graph database containing users, projects, ideas information, etc.; which might be helpful to add additional management layers, but we omitted them in this paper for clarity.
Some rules need to be established to preserve the logic of the graph database. For instance, a molecule node must be unique, which means if different representations are available for the same molecule, all of these should be gathered under a single node. This enables consistency and facilitate to implement efficient searching strategies. Conversely, reaction nodes are unlimited, allowing to capture all the possible options to do a transformation.
Currently, most of the retrosynthesis software packages available use this type of graph representation,53 and translating this into a graph database schema results in a basic data model (Fig. 2B). A graph data model could become extremely complex depending on the numbers of nodes and relationships defined, the extent of the layers, and the processes modelled.54 Applying the basic data model proposed allows for connecting different molecule nodes to various reaction nodes, potentially resulting in large interaction networks (an example is given in Fig. 2C), from which the path from any molecule to any desired product can be established. The proposed schema was successfully used in a number of internal projects at Pfizer. An example is given for the route selection of Lotiglipron (PF-07081532), a GLP-1 (Glucagon-Like Peptide-1) receptor agonists developed for indications including type-2 diabetes and obesity.55 The results obtained are used to illustrate some of the concepts introduced, and specific details are presented in the ESI Section.†
In the context of commercial route selection, node properties were customised to include specific aspects concerning the SELECT criteria. A summary table for each of the nodes defined for this work is presented in Fig. 3A. Each type of node contains specific properties associated to the particular domain it represents (molecule, reactions, routes). During the data capture process, the scientists can provide insights in any form: chemical drawings, captured by the molecule nodes; reaction conditions and metadata captured by the reaction nodes; rankings, scores, comments, and suggestions, all of them channelled accordingly to the most appropriate graph structure aligned with any of the SELECT aspects.
Subsequently, the types of nodes defined can capture any information relevant to the process. For instance, labels for impurities and side products can be incorporated within the relationships or the reaction nodes. While impurities and side products are strictly molecules, their role within a specific reaction determines their labelling. Occasionally, an impurity may serve as the desired product in a different process, and vice versa. Therefore, capturing this information should focus on the specific chemical transformation rather than the molecules themselves. This example aligns with the proposed data model.
Fig. 3B shows a small network from which two routes can be identified, illustrating how specific information can be stored in the corresponding route nodes. From a higher-level perspective this approach allows connections to all the different multidisciplinary aspects in a single data structure, in a similar way as applied in a variety of other graph databases applications.52,56
In this context, once a suitable storage solution was identified (the graph database), the focus was placed on capturing the information directly from the scientists, and at the same time, enhancing the process by allowing them to construct over other scientist's contributions. In order to achieve this dynamic construction, we proposed a standardised three step procedure including: (a) the idea capture at source (using a scientist user interface, or by accessing insights already digitalised); (b) a translation layer to enrich and transform the captured data to fit the graph data model (algorithm layer); and (c) the storage of this information into the graph database. Fig. 4A illustrates the complete process.
The creation of a basic rendering tool was required to capture direct ideas from scientists (using commercial drawing packages such as Chemdraw or Biovia draw), producing outputs that were channelled to the graph database across an intranet network. In this case, a drawing canvas was embedded, automatically loading a molecule selected by the scientist from the network being created (which served as the basis for the upcoming idea). Within this canvas, the scientist could make any change or disconnection, adjusting fragments or synthons into molecules feasible to exist (without undefined atoms). Upon submission, the different fragments were incorporated into the graph database and the relationships created (an example of such rendering interface is shown in Fig. 4B). If the desired starting point was not present already in the graph, disconnections should be made from any other suitable molecule linked to the main target, ensuring all the ideas are connected (no isolated nodes are allowed). If a more appropriate connection is introduced later, this will be automatically reflected on the graph structure.
From a system perspective, when the scientist proceeds to submit, an algorithm picks all the individual fragments from the canvas (extracted as SMILES or INCHI keys), individually searching for them in the graph database. If the fragment is not found, a new molecule(s) and reaction nodes are created, with the corresponding ‘relationships’ established (substrate to reaction node, reaction to product node, or any other edge). Conversely, if the fragment is available already in the database, only the reaction node is created with the corresponding relationships. While this process takes place, an additional data enrichment step can be implemented by using the individual identifiers and querying them in other accessible systems (Fig. 4A, centre).
In addition of capturing the scientist rendering, a form was deployed alongside to include additional metadata, such as reaction conditions, chances of success, scalability, etc. Similarly, this information was parsed, channelled, and stored directly into the properties of the corresponding nodes (an example is provided in ESI Section 1).† Moreover, additional automated mechanisms could also be implemented to gather ideas from various sources, including ingesting data from other database systems, extracting ideas from literature references, or retrieving previously registered ideas in the chemistry sections of ELNs. After performing validation checks to preserve the graph integrity, such as confirming that the molecules remain unique in the database and that the corresponding encoding is correct (done by a preliminary cross-validation search), as well as including the potential enrichment of the nodes and relationships, the data is subsequently written into the graph database. Fig. 4B shows a resulting network (small) as seen from a graph database (in this case using Neo4J).
Depositing the data in a centralised graph allowed other scientists to visualise the contributions as they were created, being able to add on top of them (introducing additional metadata in already created nodes), or use this information to generate new ideas. For Lotiglipron, a summary is provided in ESI Section 2,† also showing the raw visualisation of the data in the graph database.
For Lotiglipron, a first attempt to enrich the human-generated network of ideas consisted in capturing some interesting routes generated by predictive software, and adding them manually into the graph. For this, we used the software package ASKCOS,59 which is currently under development by the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium.60 ASKCOS is a retrosynthesis package designed to generate machine learning-driven synthesis routes, allowing users to input a target molecule, and then generating potential synthetic routes based on models that have been trained with a variety of different chemical databases. Implementation details can be found on ESI Section 3.† In this case, ideas were manually transferred to the graph database, and we anticipate this process will be straightforward in the future by ingesting data automatically from any of those tools.
Additional manual filtering steps were needed to remove the noise accompanying suitable predictions. Usually, these algorithms work well with simpler molecules, but when facing complex transformations, as those found in the pharmaceutical industry including heterocycle formation, bypassing unwanted or unsafe chemistry, or providing alternatives to generate desired chiral species (as in Lotiglipron case), these still tend to fail. This noise can also affect the human creativity and decision making by overwhelming the graph, and masking areas of interest that could be expanded. For this reason, we proposed a separate process, gathering and filtering synthetic ideas in parallel, and merging with human contributions only at the end of a brainstorm cycle. The direct value of these algorithms is still under scrutiny,53 and current failures are justified by the lack of appropriated data training repositories. Current datasets available are reactions stored in ELNs (still requiring individual curation), information captured in public repositories such as patents (for instance, datasets created from USPO data), or extracted automatically from literature sources. Most of these cases still contain biases towards positive data (only successful cases reported). We suggest this situation could be completely overturned by using the graph database itself as a source of curated data for algorithm training.
Fig. 5 shows the resulting network obtained for Lotiglipron. In Fig. 5A, the ‘scientist view’ integrates both human and ASCKOS predicted ideas. It is designated ‘scientist view’ since this shows how the data is presented to the scientist on a user interface, where duplicate molecules are allowed for clarity, aiding in the identification of individual branches of the tree structure. Below, Fig. 5B shows the corresponding native graph database representation, illustrating how the data is organised and stored within the database. While the information presented in both views is the same, the scientist visualisation differs from the graph database representation, where molecules nodes are unique.
When the database contains a significant number of transformations, specific graph features begin to appear such as cycling recurrences. In these, a substrate A feeds into a reaction to produce substrate B, which then feeds back into substrate A. Also, key reactions and key intermediates can be identified by observing ‘hubs’ nodes, where many relationships are pointing (an example of this are intermediates 211 and 294 in the graph database diagram, Fig. 5B).
The Lotiglipron graph was subjected to an automated algorithm designed to identify individual synthesis routes. This algorithm works by transforming the initial cyclic graph, which represents all possible synthesis pathways, into an acyclic type of graph (this operation is performed outside the database, on a duplicated dataset). In the acyclic form, pathways do not loop back on themselves, simplifying the structure and making it easier to analyse. By converting the graph, straightforward and well-known methods such as depth-first search (DFS) and breadth-first search (BFS),61 are applied to identify and extract the individual synthesis routes efficiently. These methods systematically explore the nodes and edges, traversing the entire network and assigning specific nodes identifiers as suggested in Fig. 2 and 3.
As Fig. 6 shows, the algorithm identified six routes obtained from human suggestions, and six additional routes obtained from artificial suggestions. These nodes were subsequently enriched with further annotations and data concerning to the specific route properties, and this higher-level layer allowed the optimisation of the route selection. After this step, the graph database was ready to be interrogated (ESI Section 4† provides instructions to reconstruct the full Lotiglipron network in a graph database).
The native Neo4J language Cypher was directly used for querying the database. This approach differs to others suggested in literature, where programming languages have been created to interact with chemical information.62 Decoupling the query language from the chemistry language introduce robustness and allow experts from other fields to manipulate the data without any prior chemistry knowledge.
An example query that the graph database is optimised to answer is ‘find the shortest route from the target molecule up to the starting materials under specific constrains’. This is also known as Dijkstra's algorithm. This query can deliver the route with the minimum number of steps, which at the same time minimise the weight of any of the constrains imposed. Elaborating the queries and constraints goes beyond the scope of this paper, but we envisioned the calculation of additional metrics cascaded across the entire network during the enrichment phase, and using those values for resolving the query. As an example, some of the network metrics considered are listed below:
- Environmental metrics,34,63 for instance estimating the process mass intensity (PMI) of each step and determining the routes with the lowest cumulative PMI value (cPMI).
- Safety,26 determining the potential flammability and explosiveness of all the substrates and intermediates, and obtaining the average route safety by pondering the accumulated hazards across all the route steps.
- Cost of syntheses,36 determining the costs of the individual steps (observed or estimated), and applying rules such as excluding the use of precious metals that would make the large scale process unviable.
- Legal right to operate,25 identifying disconnections in the network which are common to industrial patents.
The shortest path query implemented without any constrain returned routes 11 and 12 (Fig. 6), using the Dijkstra's algorithm. However, this result only considered the number of steps and not the full SELECT criteria. Upon careful examination, both suggested transformations are theoretically possible but there is a lack of scalability information. For instance, route 11 is impractical because the starting materials are not commercially available in the required form. A similar problem is observed with route 12. In these cases, the predictive algorithm did not further expand the starting materials, only adding partial information. This highlights an important point: queries like this are useful only if they include information on commodities or commercially available materials. To make meaningful comparisons, all routes should be traced back to compounds of similar complexity to assess their respective potential effectively. The same point is valid for all the parameters covered by SELECT criteria.
Another area where this holistic approach could aid is in the environmental and carbon footprint of the process. Incorporating factors such as chemical hazards, waste minimisation, and energy consumption as part of the decision-making process will help processes move towards net zero and further lower production costs. Some aspects of Green Chemistry could be applied today across an entire graph network, such as PMI as previously described.
Regarding material costs and availability, accuracy during the synthesis planning is difficult to achieve. Often decision makers turn to non-specialised catalogue companies for a guide price and then apply an economy of scale factor. Here is another opportunity of connected graph database, historical and up-to-date pricing for raw materials, pre-GMP7 intermediates and transformations8 can be built into the graph allowing predictive algorithms to make better cost estimates of these new entities. This layer enables the optimisation of realistic productions costs, which may raise the priority of other route options often overlooked based purely on synthetic characteristics. Realising new synthetic routes based on cost and synthetic novelty offers companies options either to protect intellectual property or to allow freedom to operate. Thus, protecting or disclosing different routes is based on different weightings of the design parameters. A weighting that may change over the lifecycle of the product as new information becomes available and is added to the graph.
Just considering the API alone, large pharmaceutical companies typically cover a significant proportion of the business space. This operational model requires large resource investments to operate. To become more agile some companies are becoming more modular and outsourcing activities to contract research organisations (CROs), universities and research institutes. In this paradigm, a key limitation is the efficiency of data and knowledge transfer between organisations. Domain specific information lacks standardisation and its application is not systematic. Inevitably data is lost, transferred in a non-digital manner, or not shared at all. In the proposed approach, specific graphical user interfaces could be designed to allow controlled portions of the graph to be shared with these third parties, allowing enrichment of the chemical information directly in the database. Access control would be exercised based on centralised polices of data protection.
Besides, chemical process development traditionally relies on inputs from chemists, scientists, and engineers following well-established workflows. However, decisions are still subjective and lacking algorithmic support. Critical data from large-scale manufacturing and other sources remain disconnected. Our approach introduces a method to define commercial synthesis routes. Furthermore, it can be expanded to encompass various engineering and operational aspects, including unit operations and physical transformations, effectively bridging the gap between route formulation and manufacturing.
The irruption of graph databases is opening completely new horizons for chemical sciences. At the molecular level traditional chemistry representations can be directly represented as a graph. Graph databases excel in representing and storing complex relationships between entities, making them well-suited for capturing the intricate structures of molecules, and at the same time, providing a suitable place to store individual domain properties. At the synthetic route level the ability to identify relationships between various chemical entities, reactions, and properties, and to enable simple query and data retrieval would satisfy to a large extent many scientist's needs.
Capturing chemical data in graph databases empowers data reusability, offering access to knowledge from past projects. It transforms the dynamics of the decision-making by providing a contemporaneous picture, which changes with new information. Such platforms emerge as a way of training advanced AI models, addressing one of the main problems these tools are facing today.
Finally, a long-term proposal involves creating a universal chemistry data framework centred around graph database technology, accessible to all. Identifying the most suitable agency, forum, or legal body to establish these standards poses challenges. IUPAC stands out as a natural candidate due to its international recognition and expertise in chemical sciences. However, implementation is complex, requiring data sharing and cooperation among organisations. Commercial entities can also contribute significantly by providing necessary infrastructure. For instance, platforms similar to Wikipedia, which are privately funded but open to the public, could serve as models for facilitating widespread data sharing and collaboration. Alternatively, a consortium of companies, universities, and institutions could financially support the creation of a global centralised repository, resembling the features described in this paper. This repository could host public and private layers, allowing each company private access while enabling disclosure of legally protected or obsolete parts of their explored chemical space to the public. Such centralised repository could bring an unprecedented level of collaboration across the chemical sciences and bridge the gap to many other disciplines.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00120f |
This journal is © The Royal Society of Chemistry 2024 |