Silvia
Chines
a,
Christiane
Ehrt
b,
Marco
Potowski‡
a,
Felix
Biesenkamp
a,
Lars
Grützbach
a,
Susanne
Brunner
c,
Frederik
van den Broek
d,
Shilpa
Bali
d,
Katja
Ickstadt
c and
Andreas
Brunschweiger
*a
aTU Dortmund University, Department of Chemistry and Chemical Biology, Otto-Hahn-Str. 6, 44227, Dortmund, Germany. E-mail: andreas.brunschweiger@tu-dortmund.de
bUniversität Hamburg, Bundesstr. 43, 20146, Hamburg, Germany
cTU Dortmund University, Department of Statistics, Vogelpothsweg 87, 44227, Dortmund, Germany
dElsevier B.V., Radarweg 29, 1043 NX Amsterdam, The Netherlands
First published on 1st September 2022
Databases contain millions of reactions for compound synthesis, rendering selection of reactions for forward synthetic design of small molecule screening libraries, such as DNA-encoded libraries (DELs), a big data challenge. To support reaction space navigation, we developed the computational workflow Reaction Navigator. Reaction files from a large chemistry database were processed using the open-source KNIME Analytics Platform. Initial processing steps included a customizable filtering cascade that removed reactions with a high probability to be incompatible with DEL, as they would e.g. damage the genetic barcode, to arrive at a comprehensive list of transformations for DEL design with applicability potential. These reactions were displayed and clustered by user-defined molecular reaction descriptors which are independent of reaction core substitution patterns. Thanks to clustering, these can be searched manually to identify reactions for DEL synthesis according to desired reaction criteria, such as ring formation or sp3 content. The workflow was initially applied for mapping chemical reaction space for aromatic aldehydes as an exemplary functional group often used in DEL synthesis. Exemplary reactions have been successfully translated to DNA-tagged substrates and can be applied to library synthesis. The versatility of the Reaction Navigator was then shown by mapping reaction space for different reaction conditions, for amines as a second set of starting materials, and for data from a second database.
DELs have proven to be productive for hit identification for many target classes such as hydrolases, kinases and epigenetic targets.3–7
Despite its substantial success, this technology still poses many challenges. One particular challenge is the accessible chemical space of “DEL-compatible” chemistry, defined by the limits of DNA chemical stability. Extreme pH values are certainly prohibited, as well as strong oxidants and many metal catalysts, especially in combination with forcing reaction conditions, and mutagens.8 To date, validated DEL chemical reaction space was quite narrow, as stated in the review article by Franzini and Randolph in 20169 and then further supported by Fitzgerald and Paegel in 2021.10 Both concluded that DEL design depended on “a few good reactions”,10 namely robust transformations which use a vast scope of building blocks. However, the limited toolbox of validated DEL chemistry has been criticized in the literature as it biases chemical space coverage.9 Encoded chemical space coverage might have contributed to the lower productivity of DELs in terms of identification of clinical development candidates as compared to other hit-finding technologies.11
As chemical reactions are a key source of scaffold diversity, expanding the chemical reactions space for DNA-encoded libraries is an attractive strategy to diversify encoded library design. Accordingly, in recent years, large efforts have been dedicated towards implementing diverse reactions into DEL design.12 They included application of modern methods in catalysis, such as photocatalysis13 and micelles,14 and use of resins for DNA-immobilization,15 as well as development of barcoding strategies, such as solid-support-initiated encoded chemistry,16 and PNA-encoded chemistry (peptide nucleic acids).17 Despite the technological advances that enable a larger DEL reaction scope, the main modus operandi for reaction selection remains to screen scientific journals according to combinatorial chemistry awareness, to the medicinal chemist's toolbox, or to concepts such as DOS (diversity-oriented synthesis).18,19 An alternative to individual researchers' knowledge of chemical reactions would be a data science-based approach allowing for navigation of huge chemistry databases such as USPTO,20 Reaxys21 and CAS.22 Nowadays, data science tools have been applied successfully to support chemists23,24 and remarkable work has been conducted, mainly covering three aspects: predicting synthetic strategies (Fig. 1A), classifying reactions, and designing libraries (Fig. 1B). MEGAN (Molecule Edit Graph Attention Network) from the Jastrzębski group25 and DeepReact + by Gong et al.26 stand out for reaction conditions optimization, while the landmark research works of Segler et al.27 and Chematica from the Grzybowski group, as well as MEGAN, support retro-synthetic analysis.25,28 For reaction classification and mapping, some examples are: the attention based neural network developed by Schwaller et al.;29 the ReactionCode by Delannèe et al., which classifies, searches and balances reactions;30 the Reaction Recommender by Ghiandoni et al., which assists with reaction selection via a multi-label classification algorithm;31 and the reaction difference fingerprints developed by Andronov et al.32 Lastly, in the area of library design, two tools shall be mentioned: eDESIGNER, developed in 2020 by Martìn et al.,33 and Synthl, presented by Zabolotna et al.34 The former uses established reactions in the DEL field, and the latter, though being useful for building block-based library design, is not tailored to DEL chemistry. Thus, there is a need for computational tools that aid in identification of suitable reactions for DEL synthesis from large chemistry databases. However, chemists, who lack the technical knowledge of programming, may find it challenging to adapt existing computational workflows to a task such as DEL chemistry design, because they are based on complex algorithms such as machine learning, deep learning or attention networks.
With our computational workflow called “Reaction Navigator”, we aim to efficiently identify a comprehensive suite of reactions for DNA-encoded library synthesis from building blocks sharing a common functional group. This tool charts the chemical reactions space to facilitate the process of reaction identification and, by consequence, of DELs design. The algorithm is user-friendly, as it is based on pre-programmed nodes within the KNIME Analytics Platform interface, to process, analyze and visualize large data.35 This software has already proven its utility in the chemical sciences36 and it is supported by a large community via the KNIME Forum and the KNIME Hub.37
Fig. 2 (A) Input and rationale: reactions starting with the carboxaldehyde building block with unknown products. (B) Complete scheme of the KNIME workflow with the five modules. LGs: leaving groups. |
The remaining app. 44000 reactions following the filtering step in module I were subsequently scored according to the reagents/catalysts. As in the provided dataset the distinction between reagents and catalysts was not clear, we merged them under the name mediators. The compatibility for DNA-encoded chemistry was based on the potential of a reagent for mutagenic lesions, and especially the redox potential of a given reagent, and the practical feasibility. The first two criteria affect the DNA stability and, by consequence, its reliability as a barcode at later stages and the latter influences the applicability of the reaction conditions in a DEL environment, for example working under inert atmosphere with 96-well plates could be challenging. KNIME report I describes the scoring system that we devised. A “0” means that the catalyst has higher potential to damage DNA: these are for instance many transition metal-based reagents. A “1” was given to reagents that need to be tested, and a “2” was given to reagents with high probability for or proven DNA compatibility. These are for instance proline-based organocatalysts. Finally, the score “3” was assigned to catalysts that proved to be DNA-compatible. Furthermore, solvents with higher boiling point were given preference, as they are more suitable for microliter-scale reactions. When choosing a cluster of interesting reactions, the reaction conditions can be sorted by the score, reaction temperature, and solvent boiling point and the top-ranking conditions can be selected for testing on DNA-encoded substrates. The scored reactions were processed in the second module. In this step, reactions yielding unstable products such as acetals, side products or involving duplicate identical reactants were excluded.43,44 The last part of this module deleted salts and additives from the reaction schemes, as they would have affected the following descriptors calculation (KNIME report II in the ESI†). This module removed roughly 11000 reactions.
With 33000 reactions in hand, we manually added 33 published DEL reactions, although some were already present in the provided dataset, to function as landmarks in the final map that facilitate the navigation of the chemical reactions space. The complete list is depicted in Fig. S6 in the ESI.† Some reference examples are the Ugi four-component reaction,45 the Biginelli three-component reaction,46 or the SnAP reaction.47
As a result of the descriptors calculation two reactions featured prominently in the data set: reductive amination accounted for app. 7500 reactions, while aldol condensation of aldehydes and ketones corresponded to roughly 800 entries. We decided to filter them by similarity search, as they represented the equivalent of the hay in the haystack for our purpose. This additional filter left us with around 18000 transformations. A noteworthy observation is that only one sixth of the initial data set was kept at this point.
Fig. 4 Charting and classification. (A) Scatter plot of the clustered reactions in the 3D chemical space. The dimensions are the three variables RINGS, BONDS and HETEROATOMS and the colors of the dots depend on the cluster affiliation. The small dots belong to the Reaxys data set, while the big dots represent the reference reactions. The cluster affiliation confirms the chemical similarity as emphasized by the reference reactions circled in yellow: the two closer purple dots are the Ugi and the Cushman reactions, while the green dot is the Petasis reaction. (B) Pie chart featuring the classes proportion in the data set. (C) Highlighted areas in the scatter plot, identified by the rule-based classes, and examples for each area.64–67 Enlarged in the ESI.† |
In this last module, we investigated three aspects of the chemical reaction space: versatile starting materials, common vs. rare catalysts and the accessible scaffolds. Firstly, by counting the number of clusters in which reactants were found, we could identify versatile starting materials, which form a variety of diverse scaffolds. Ten of these are shown in the ESI (Fig. S17†). The most prominent ones were malononitrile, found in 42 clusters, and dimethylcyclohexane-1,3-dione, found in 35 clusters.
Secondly, we aimed to identify versatile metal-catalysts and reagents with the intent of testing their DNA-compatibility as they potentially gave access to many different reactions for DEL design. To do so, we extracted the metal centers (metal ions in salts or complexes as well as the elements) from the database and analyzed their shares (Table S6 in the ESI†). In the pie chart in Fig. 5 it is noticeable that 70% of reactions did not contain any reagents in the provided data set, and the metal centers we defined covered slightly more than half of the remaining portion. The pie chart on the right-hand side in Fig. 5 shows that among the defined metal centers, the most common are zinc in the form of zinc powder and diethylzinc with 365 reactions, and titanium as titanium tetrachloride with 280 reactions. Copper as copper(II) bis(trifluoromethanesulfonate) and silver mainly in the form of silver acetate followed with 202 and 197 reactions respectively. Molybdenum and hafnium, with 3 and 4 reactions each, were examples of rare metal centers in this database.
Fig. 5 Pie chart on the left-hand side: 70% of the reaction data did not contain a mediator in the provided table and 14% showed mediators that fell outside the defined mediators (e.g. organocatalysts). Pie chart on the right-hand side: shares of the defined metal centers in the provided data set. Zinc accounted for 19%, titanium for 15%, silver and copper for 10% and palladium for 9%. The rest of the metal centers were grouped as rare (see ESI for the full detailed Table S6†). |
Additionally, we were curious to learn about a possible correlation between metal centers and reaction clusters. This information would guide the choice of catalysts when optimizing reaction conditions. A note must be made as a premise: on average, roughly 70% entries per each cluster featured an undefined mediator, for example organocatalysts. It is striking, that four out of the five clusters mentioned above, containing aldol additions or condensations, aminoalkylations or amidoalkylations and cyanation reactions, show a similar pattern in terms of catalysts, as visible in the heat map in Fig. S18 in the ESI,† plotting clusters vs. metal centers. Indeed, cluster 12, 48 and 52 each comprise 40% zinc-catalyzed and 30% titanium-catalyzed reactions. A clear correlation is visible for zinc and cluster 48, as depicted in the bar chart and respective pie chart in Fig. 6. Fig. 6 also illustrates one exemplary reaction from this cluster.53
Fig. 6 Correlation between cluster 48 and the zinc metal center, with an example of a zinc-catalyzed reaction in cluster 48.68 |
A similar trend, but less marked, is detectable for titanium, which promoted the same reaction types. This observation may signify that these two metals are often employed to catalyze the same class of reactions, in some cases even as a combination. Another correlation could be noticed for palladium and cluster 50, which consist of acylations. The comprehensive heat map and the results of the whole analysis are described extensively in the ESI (Fig. S21A–C†).
Finally, we aimed to analyze the accessibility of frequently used or “privileged” scaffolds in medicinal chemistry from the aldehyde functional group. For this, we referred to the work of Taylor et al., who enumerated the most common scaffolds in approved drugs.54 We searched for the first 15 entries from that publication in our data set and could determine tetrahydropyran to be the most commonly formed scaffold in 35 reactions, followed by pyridine with 19 reactions (Fig. S22 in the ESI†). In contrast, three out of 15 scaffolds were not detected in this dataset: cephem, indole and penam.
Additionally, our analysis uncovered four reactions that are likely not suitable for DEL synthesis as they employ nucleosides as starting materials. We would call them anti-reactions, as they may compromise DNA integrity. They involve malononitrile, epoxides and acrylic aldehyde as shown in Fig. S23 in the ESI.†55–58 Such reactions may be feasible on nucleobase-protected DNA barcodes.
The first reaction extracted was a metal-catalyzed quinoline synthesis. The synthesis of this scaffold was reported in several publications under the usage of diverse metals that should be compatible to an on-DNA approach.59 Therefore, we started to invest the transfer of the quinoline synthesis onto a CPG-bound 10mer pyrimidine–aldehyde conjugate 1 with aniline 2a and N-Boc propargylamine 3a in the presence of different metal salts in a mixture of dimethylacetamide and triethyl orthoformate at 50 °C (Scheme 1 and Table S8,† entries 1–5). In absence of any metal salt no product formation was detectable (Table S8,† entry 1). Also in case of FeCl3, Yb(OTf)3 or Sc(OTf)3 as promoter no desired product 4 was formed (Table S4,† entries 2–4). To our delight in the presence of Cu(OTf)2 the desired DNA-quinoline 4 was formed with 59% conversion (Table S8,† entry 5), a finding which is consistent with the literature that Cu(OTf)2 is a suitable catalyst for the quinoline synthesis.59 Encouraged by these results we further improved the product formation by varying the concentrations of aniline 2a (Table S8,† entries 5–7), N-Boc propargylamine 3a (Table S8,† entries 5, 8–10) and of the promotor Cu(OTf)2 (Table S8,† entries 9 and 11), and investigating of the impact of the solvent (Table S8,† entries 11 and 17) as well as the reaction temperature (Table S8,† entries 12 and 18). As optimal reaction conditions we found the treatment of CPG-bound TC–aldehyde conjugate 1 with aniline 2a (200 mM), N-Boc propargylamine 3a (400 mM) and Cu(OTf)2 (20 mM) in dimethylacetamide at 80 °C overnight (Table S8,† entry 17). The DNA-quinoline conjugate 4 was formed with a conversion of 75% and 16% of undefined by-products, that could be easily removed by semipreparative HPLC.
Scheme 1 On-DNA reaction to form the quinoline scaffold. 10mer TC = 5′-TTC CTC TCC T-3′-CPG. AMA = 30% aqueous ammonia/40% aqueous methylamine, 1:1 (vol/vol). |
With the optimized reaction conditions in hand, we started to transfer the reaction onto a CPG-coupled 10mer ATGC–aldehyde conjugate 5. However, beside the product formation with 60% conversion we observed 33% DNA degradation (Table S9,† entry 1). To reduce the DNA damage we fine-tuned the reaction conditions further by using lower amounts of the promoter Cu(OTf)2 or reducing the reaction time (Table S9,† entries 2–4). To our delight the reaction proceeded with 5 mM concentration of the promotor with a conversion of 67% and an acceptable degree of DNA damage of 14% (Table S9,† entry 3). Based on these findings we investigated the scope of the quinoline synthesis on CPG-bound ATGC oligonucleotide–aldehyde conjugate 5 using diverse anilines 2 (Scheme 2). Independent of the electronic properties of the substituents, reactions with fluorine, bromine, ethyl, tert-butyl as well as methoxy groups proceeded smoothly leading to the desired products (6b–q). However, the position of the substituents had an impact on the reactivity of the anilines. While substituents in para or meta position resulted in moderate to excellent conversions of 61 to 87%, substituents in ortho position decreased the conversions to lower degrees of 38%. In case of bulkier substituents like a bromine or tert-butyl in ortho position the conversion dropped to below 5% (6h and 6m). For meta substituents the formation of regioisomers is possible, however the formation was not detected by the analytical methods. Cyano-substituted aniline led to the desired product 6r with a lower conversion of 19%. Nevertheless, hydroxy and ester functions were well tolerated with conversions up to 69% (6s–6w). The methyl ester is not stable under cleavage conditions and reacts in the presence of AMA to the corresponding methyl amide (6u). Also, anilines containing two different substituents were competent substrates for the quinoline synthesis with conversions of 75 and 83% (6x and 6y).
Last we tested further alkynes 3. Tert-butyl (S)-2-ethynylpiperidine-1-carboxylate can be used in the quinoline synthesis. The conversion to product 6aa proceeded with a low value of 18%. Phenylacetylene reacted smoothly to the desired product 6ab with 61% conversion, even if the product is quite aromatic and therefore a less attractive scaffold. In most cases the formation of undefined by-products as well as an acceptable degree of DNA damage were observed. However, the target DNA-quinoline conjugates 6 could be isolated by semi-preparative HPLC purification in all cases.
The second interesting scaffold-forming reaction was promoted by iron(III) or nickel(II) salts and produced a substituted pyrrole scaffold from an aldehyde, an amine and β-ketocarbonyls with nitromethane as solvent.60 Only few conditions were screened and good product formation was achieved at 80 °C for 6 hours (Table S13 in the ESI†). Notably, the temperature was lower than the original publication as well as the reaction time. The reaction of five different amines 8 was explored on the aldehyde-substrate 7: benzylamine, p-methoxyaniline, p-nitroaniline, 2-aminopyridine and cyclopentylamine (Scheme 3). We could clearly detect product formation for the electron-rich benzylamine and p-methoxyaniline, while the electron-poor p-nitroaniline, 2-aminopyridine, and cyclopentylamine were not competent substrates (Table S14 in the ESI†). Currently, we are investigating a broader scope of starting materials for this reaction.
The third transformation was a retro-aza-Michael addition involving a ring expansion reaction, promoted by ethyl aluminium iodide (Et2AlI). The reaction between a cyclopropane-based thioester 12 and an imine was expected to form a diversely substituted pyrrolidine scaffold.61 When applied to the protected DNA substrate 11 on controlled pore glass (CPG), the reaction did not yield any product 14 but we observed significant DNA degradation. Therefore, we decided to test MgI2 as a suggested alternative mediator, but this Lewis acid surprisingly cleaved the C6-amino linker at the 5′-terminus of the oligonucleotide (Scheme 4 and Table S16 in the ESI†).
Scheme 4 On-DNA reaction to form the pyrrolidine scaffold. Product formation was detected by MALDI-TOF. 10mer TC = 5′-TTC CTC TCC T-3′-CPG, DCE = dichloroethane. |
Experimental data from unsuccessful reactions are still very valuable as they allow us to improve the filtering cascade.
In the scatter plot in Fig. S25,† where some examples are highlighted with the respective reaction cores, it is noticeable that reactions clustered together either form similar scaffolds or involve similar transformations.
In both cases a striking reduction in the number of reactions was observed during the treatment of reactions with leaving groups. From 10000 reactions, only ca. 1400 aldehyde-starting and 1000 amine-starting reactions were clustered, respectively. The leaving groups treatment is still to be improved by machine learning paving the way for fruitful collaborations, using for example the rxnmapper to identify the missing substructures in the products.69
Despite its simplicity, the Reaction Navigator conferred us deep insight into the potential DEL-compatible chemical reactions space (Fig. 4) and precious information to guide future research, such as correlations between metal catalysts and certain reaction types, reactant versatility and scaffold accessibility. Above all, the charted reactions suggested some examples that were successfully tested on DNA-tagged substrates and could potentially be implemented in library synthesis.
However, limitations remain which are due to incomplete knowledge about DNA chemical stability under given reaction conditions. These are exemplified by the identification of clusters of reactions that have a low probability to be translatable to DEL, such as Grignard reactions which need to be performed under strictly dry conditions, or titanium tetrachloride-promoted reactions which will set free large amounts of hydrogen chloride. Data quality is a second limiter. In fact, a standardized method for presenting reaction schemes and reaction conditions data is indispensable for further development of cheminformatics or machine learning techniques. Promising advancements in this direction have been proposed, on the one hand, with ELNs (Electronic Laboratory Notebooks)62 and, on the other hand, with the creation of the ORD (Open Reaction Database) in 2021.63 Furthermore, manual steps such as the scoring process and the leaving groups treatments might be improved with machine learning techniques to reduce the human intervention to a minimum. Finally, the chemical stability of DNA needs to be investigated under a multitude of reaction conditions to improve the filter cascade, as indicated by the failed reaction.
Obvious applications and extensions of the Reaction Navigator are experimental testing of more examples from the charted reactions on DNA-tagged substrates, and an application of the workflow to further functional groups commonly used in DEL synthesis such as amines, aryl/alkyl halides, and carboxylic acids. Additionally, the Reaction Navigator could be tailored to navigate the plausible reaction space for other classes of (bio-)molecules such as proteins, peptides, or natural products that shall be diversified by chemical methods. In summary, the Reaction Navigator is a data science tool with a low barrier to use for the non-expert which can aid the chemist in decision making. With this rational, we expect to see a surge of approaches directed toward the expansion of the chemical and biological reaction space in library design.
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2sc02474h |
‡ Current address: Serengen GmbH, Emil-Figge-Str. 76a, 44227, Dortmund, Germany. |
§ Copyright© 2022 Elsevier Limited except certain content provided by third parties. Reaxys® is a trademark of Elsevier Limited. Reaxys data were made accessible to our research project via the Elsevier R&D Collaboration Network. |
This journal is © The Royal Society of Chemistry 2022 |