Daniel
Bosch
a,
Jun
Wang
b and
Lluís
Blancafort
*a
aDepartament de Química, Institut de Química Computacional i Catàlisi, Universitat de Girona. Facultat de Ciències, C/M. A. Capmany 69, 17003 Girona, Spain. E-mail: lluis.blancafort@udg.edu
bJiangsu Key Laboratory for Chemistry of Low-Dimensional Materials, Jiangsu Engineering Laboratory for Environment Functional Materials, Huaiyin Normal University, No. 111 West Changjiang Road, Huaian 223300, Jiangsu, P. R. China
First published on 4th July 2022
Eumelanin is the biopolymer responsible for photoprotection in living beings and holds great promise as a smart biomaterial, but its detailed structure has not been characterized experimentally. Theoretical models are urgently needed to improve our knowledge of eumelanin's function and exploit its properties, but the enormous amount of possible oligomer components has made modelling not possible until now. Here we show that the stability and lowest vertical optical absorption of 5,6-dihydroxyindole (DHI) eumelanin dimer components can be modeled with deep neural networks, using fingerprint-like molecular representations as input. In spite of the modest data set size, average errors of only 6 and 9% for stability and S1 absorption energy are obtained. Our fingerprints code the connectivity and oxidation patterns of the dimers in a straightforward, unambiguous way and can be extended to larger oligomers. This proof-of-principle work shows that machine learning can be applied to help solve the structural challenge of melanin.
In spite of these promising aspects, eumelanin's potential as a smart biomaterial has not been realized yet, mostly because its atomic-level structure is unknown.1 Eumelanin is formed by oligomers of 5,6-dihydroxyindole (DHI) and DHI-2-carboxylic acid (DHICA) that undergo oxidative polymerization to yield linear, macrocyclic or polycyclic oligomers (see Fig. 1). However, the exact oligomer composition and the way the oligomers assemble into eumelanosomes are not known due to the heterogeneous, chemically intractable nature of the biopolymer, which prevents further characterization. It is thought that the heterogeneous nature of the oligomers contributes to the broad absorption spectrum,2 and the redox ability and the ensuing conductive properties are probably related to the presence of readily interconvertible quinone and catechol groups,8,12 but the structural basis of the different functions has not been determined precisely yet. The structural knowledge gap is therefore one of the main obstacles to realizing melanin's potential.
Fig. 1 Hierarchical model of the eumelanin structure: DHI/DHICA monomers and oligomers obtained by oxidative polymerization (only DHI oligomers shown) and further assembly to eumelanosomes. |
Theoretical models can help to address the urgent challenge of improving melanin's structural characterization, but generating them is a difficult task because there are hundreds of thousands if not millions of possible oligomers due to the manifold of possible connectivity and oxidation patterns, which leads to combinatorial explosion.13,14 Therefore it has been speculated that machine learning techniques can be applied to advance towards the elucidation of melanin's structure.3 Here we show that it is in fact possible to apply deep neural networks (NNs) to model the main properties of eumelanin DHI dimers: their thermodynamic stability, which is important in order to understand which fragments may be present in the biopolymer, and the lowest vertical electronic absorption, which is related to the broad absorption spectrum and the photoprotecting ability. The key to this application is the use of fingerprint-like representations15,16 that allow to codify the structure of melanin oligomers in a straightforward way. Our work therefore shows how machine learning can contribute to the solution of melanin's pressing structural challenge.
Fig. 2 Examples of a linear dimer, #71 (a), and a cyclic one, #339 (b), together with their dimer names,17 and codification of their connectivity on the basis of bonds, fragments and sites. F1 and F2 stand for fragment 1 and 2 of the dimer, while 2@F1 represents site 2 on fragment 1, and similar for the other codes. |
The structures of the dimers need to be codified into numeric molecular representation strings that provide the input values for the NN. The oxidation pattern can be codified in a straightforward way using six binary digits that represent the six heteroatoms of the dimer. This is illustrated in Fig. 2 with two dimers, a linear (#71) and a cyclic one (#339). We also present the dimer names following the nomenclature of ref. 17. In these digits, a reduced and an oxidized site are codified by ‘0’ and ‘1’, respectively. Codifying connectivity is possible in a number of different ways, and in Fig. 2, we propose three approaches: (a) a bond-based approach where the connectivity is characterized by the numbering of the atoms that form the interfragment bond (see Fig. 1 for the numbering); in the linear #71 dimer, there is one bond between atoms 3 and 4 signified as (3,4), and in the cyclic #339 dimer, there is a bond between atoms 2 and 4, and another one between atoms 3 and 3, signified as (2,4) and (3,3). (b) A fragment-based approach, where every fragment is signified by the atoms that form the interfragment bonds; #71 is composed of fragments (3) and (4), and #339 of fragments (2,3) and (3,4). (c) A site-based approach where the dimers are signified by the sites having an interfragment bond; in #71, this is site 3 on fragment 1 (3@F1) and site 4 on fragment 2 (4@F2), and in #339, they are sites 2 and 3 on fragment 1 (2@F1, 3@F1), and sites 3 and 4 on fragment 2 (3@F2, 4@F2).
In Fig. 3 we show how these approaches are translated into molecular strings that codify the structure in an unambiguous, unique way. Complete rules to generate the strings are given in ESI, Section ESI2.† The first three are quasi-binary (QB) representations where the digits have possible values of ‘0’ and ‘1’ with exceptions where values of ‘2’ are also possible. The quasi-binary bond-based (QBB) representation has 19 digits codifying the possible interfragment bonds in the dimer. This includes C–C bonds for the linear and cyclic dimers, and also N–C and O–C bonds in the cyclic dimers (O–C bonds are codified using the numbering of the carbon position carrying the corresponding oxygen, i.e. a bond to site 5 or 6 means a bond to the oxygen on C5 or C6). These digits have a value of ‘0’ except for the ones that represent the bonds of the dimer. For instance, dimer #71 has value ‘1’ at the digit corresponding to the (3,4) bond. The quasi-binary fragment based (QBF) representation has 10 connectivity digits that represent the possible fragments of the linear and cyclic dimers, and dimer #71 has value ‘1’ at the digits of the (3) and (4) fragments. Finally, the quasi-binary site-based (QBS) representation has 12 connectivity digits representing possible connectivity sites, and dimer #71 has value ‘1’ at the digits corresponding to site 3 in the first fragment and 4 in the second one. All representations also have 6 digits that codify oxidation in the way explained above. Dimer #71 has two ‘1’ values at the digits that code the oxygen atoms of the first fragment. Finally, there is one additional digit that codifies stereochemistry of the interfragment bond in the linear dimers (cis or trans) or resolves ambiguous regiochemistry in the connectivity of the cyclic dimers.
Fig. 3 Generation of molecular string representations for linear dimer #71 of Fig. 2. The representations of cyclic dimer #339 are given in Fig. S2.† |
We have also tested two alternative representations to the QB ones. The first one is a multiple-valued reduced (MVR) representation where we have three digits representing connectivity, oxidation and interfragment stereochemistry, and every possible connectivity, oxidation and stereochemistry pattern is codified by a different value. These digits have 28, 32 and 3 possible values, respectively. The second one is a multiple-valued site-based (MVS) representation where connectivity and oxidation are codified by 7 and 3 digits, respectively. Every single digit signifies connectivity or oxidation in equivalent positions of the two fragments, with possible values of ‘0’ to ‘3’ (e.g. a ‘0’ in connectivity site 2 means that there is no bond at site 2 of either fragment; a ‘1’ means a bond in fragment 1, a ‘2’ means a bond in fragment 2 and a ‘3’ means a bond in both fragments). This scheme is general and can be extended to larger oligomers without changing the number of digits (only the possible values of the digits increase).
Different NNs have been trained using the five representations as input and Grel and Eexc as endpoints. We have trained independent networks for Grel and Eexc. The data set is divided into 60% training, 20% validation and 20% test sets. The details are provided in ESI, Section ESI3,† together with the necessary data set and Python code18 to reproduce the results. The initial tests with a small number of 200 training epochs were carried out to compare the QBB and MVR approaches, using one or two hidden layers with up to 16 nodes per layer. With QBB, the test errors for Grel and Eexc are approximately 12 kcal mol−1 and 0.36–0.40 eV, respectively. With MVR, the errors are significantly higher, 24–34 kcal mol−1 and 0.39–0.55 eV. Therefore, the MVR representation was discarded, and further tests to improve the training conditions were carried out with QBB. The final conditions are two hidden layers of 7 nodes each, 5000 training epochs and a learning rate of 1 × 10−2. For Eexc, we use an additional transfer learning step (500 initial epochs with a single hidden layer). The details on the initial tests and prediction runs (see below) are given in ESI, Sections ESI4.1 and ESI4.2.†
The prediction capacity of the NN was tested with the QB representations and the MVS one, using the conditions described above. Ten independent runs were carried out for each representation, and in Table 1, we present the average training and prediction errors together with the standard deviation (SD) and the maximal absolute error (MAE). For Grel, the QB representations give training and prediction errors of 6–7 and approximately 9 kcal mol−1, i.e. approximately 5% of the total range. The MAE is 22–27 kcal mol−1. For Eexc, the training and prediction errors are 0.2 and 0.4 eV, i.e. 8–9% relative prediction error. With MVS, the prediction errors are slightly higher than those of the QB models, 6% for Grel and 9% for Eexc, and the MAE, too. The training curves (see the ESI†) show oscillations in the training and validation loss evolution that are probably due to the relatively small size of the training set, which leads to overtraining. Improving the model's performance may be achieved, for instance, with regularization techniques. Such fine-tuning is beyond the scope of the present proof-of-principle work, but it will certainly be necessary to apply the models in real predictions for larger oligomers.
Endpoint | Input | Training errora | Prediction errorb | MAEc |
---|---|---|---|---|
a Average ± standard deviation (SD) of training loss for 10 runs. b Average ± SD of validation and test losses for 10 runs, see Tables S5 and S6 (ESI) for details. c Maximum absolute error. | ||||
G rel | QBB | 5.8 ± 0.7 (3%) | 9.1 ± 0.9 (5%) | 25.0 (14%) |
QBF | 6.2 ± 0.7 (3%) | 8.5 ± 0.8 (5%) | 21.6 (12%) | |
QBS | 6.6 ± 0.8 (4%) | 9.1 ± 1.6 (5%) | 27.2 (15%) | |
MVS | 8.7 ± 1.0 (5%) | 11.3 ± 2.1 (6%) | 29.2 (16%) | |
E exc | QBB | 0.25 ± 0.05 (6%) | 0.38 ± 0.02 (9%) | 1.05 (25%) |
QBF | 0.24 ± 0.03 (6%) | 0.33 ± 0.03 (8%) | 0.83 (20%) | |
QBS | 0.24 ± 0.02 (6%) | 0.36 ± 0.03 (8%) | 1.04 (24%) | |
MVS | 0.30 ± 0.03 (7%) | 0.37 ± 0.02 (9%) | 1.29 (30%) |
Scatter plots of DFT against the data predicted with QBF and MVS are provided in Fig. 4. For Grel (Fig. 4a and c), the predicted data are evenly distributed around the blue identity line which marks an ideal prediction. MVS has a slightly broader spread than QBF. The molecule with the lowest energy (DFT Grel = 0 kcal mol−1) is highlighted with an arrow. It is the compound with the highest MAE with MVS and QBF because the Grel low energy region is underrepresented in the data set and is difficult to predict, see Fig. S4a.† For Eexc (Fig. 4b and d), the predictions are satisfactory for the group of compounds with Eexc 3.5–4.7 eV (upper right corner of Fig. 4b and d), but for the compounds with Eexc < 3 eV, the prediction is worse. The DFT values cover a range of approximately 2.3 eV (0.5–2.8 eV), but the predictions cover a narrower range (0.9–2.2 eV), and the MAE increases to almost 1 eV. This is consistent with the view that excited-state properties are more difficult to predict than ground-state ones.19
Overall, the NNs have good predictive capacity, with errors smaller than 10%. The performance is good because the structural set is rather homogeneous. Therefore, our fingerprints provide complete, unique molecular representations. This is different from the use of key-based fingerprints in drug design, for example, where they codify the presence of substructures, and different compounds can have the same fingerprint.15 The data set also has underlying trends that correlate the endpoints with the structure, e.g. cyclic and oxidized structures have lower Grel, and fully reduced compounds have higher Eexc. Our representations capture these structural features in a straightforward way, which contributes to their predictive power. Importantly, the fingerprint-type representations can be easily extended to larger structures, either by increasing the number of digits or, following the MVS scheme, increasing the possible values of every digit. The latter option is attractive because it provides the basis for a general representation covering oligomers of different sizes, and in future work, we will assess whether predictions for larger oligomers need new representations with more digits, or the general representation is preferable.
We also plan to extend the work to larger oligomers, where the number of compounds increases exponentially, and larger training sets will become available. In that case, NNs can be used for high-throughput screening of desired oligomer properties. Other possible applications are the use of NNs to correct semiempirical or density functional tight binding methods and develop a fast, eumelanin-specific hybrid quantum-mechanical machine-learning method similar to AIQM1;20,21 or using the fingerprints for clustering-based analyses that may be helpful to prune the vast chemical space of oligomers or generate structure–property relationships with interpretative power. In the long term, these applications shall enable the generation of reliable theoretical melanin models.
Footnote |
† Electronic supplementary information (ESI) available: Data set and neural network training details; the data set and code. See https://doi.org/10.1039/d2sc02461f |
This journal is © The Royal Society of Chemistry 2022 |