Georgia Melagraki* and
Antreas Afantitis*
Novamechanics Ltd, Nicosia, Cyprus. E-mail: melagraki@novamechanics.com; afantitis@novamechanics.com
First published on 23rd September 2014
Engineered nanoparticles (ENPs) are being extensively used in a great variety of applications with a pace that is increasingly growing. The evaluation of the biological effects of ENPs is of utmost importance and for that experimental and most recently computational methods have been suggested. In an effort to computationally explore available datasets that will lead to ready-to-use applications we have developed and validated a QNAR model for the prediction of the cellular uptake of nanoparticles in pancreatic cancer cells. Our insilico workflow was made available online through the Enalos InSilicoNano platform (http://enalos.insilicotox.com/QNAR_PaCa2/), a web service based solely on open source and freely available software that was developed with the purpose of making our model available to the interested user wishing to generate evidence on potential biological effects in the decision making framework. This web service will facilitate the computer aided nanoparticle design as it can serve as a source of activity prediction for novel nano-structures. To demonstrate the usefulness of the web service we have exploited the whole PubChem database within a virtual screening framework and then used the Enalos InSilicoNano platform to identify novel potent nanoparticles from a prioritized list of compounds.
The evaluation of NPs biological activity and toxicity by in vitro and in vivo studies is costly and time consuming and therefore alternative novel techniques that are fast, inexpensive and reduce the animal testing are required.11–17 To date a great number of Quantitative Structure Activity (QSAR) models have been proposed in literature. These models usually cover the biological profile of small organic molecules and have been proven accurate in predicting the biological effect for a wide range of molecular scaffolds. This is not the case for NPs that have recently emerged as important chemical structures with a wide range of significant properties that find applications in different areas of interest. Although ‘classic’ QSAR models own a great proportion of their success in the presence of organized databases, no such databases are available for NPs. Experimental data are scarce and produced by different groups of scientists following different protocols and it is often difficult to select and combine the available information from different sources. On top of that, the structural characteristics of NPs cannot be encoded by the “conventional” widely used 2D and 3D molecular descriptors. NPs include organic as well as inorganic elements with sometimes unknown composition and highly complex structures that demand new approaches for developing molecular descriptors. These hurdles have already been recognized and now international efforts are being organized towards the development of large datasets for NPs and the computational exploration of these results.
The potential of computational methods for advancing risk assessment of NPs is commonly accepted and a few computational attempts to predict the toxicity of NPs are reported in the literature the last few years.18–24 As mentioned, although “classic” Quantitative Structure Activity Relationship (QSAR) models have been for long proposed in the literature to assess different properties of compounds, Quantitative Nanostructure Activity (QNAR) models have not yet been extensively studied and limited examples have been published.1,5,7,9 Many factors have contributed in this including major hurdles such as lack of organized datasets and inadequate descriptors for NPs. On top of that attempts on the computational exploration of the activity of NPs and the produced QNAR models in principal are not made directly available to the community to be further used as useful tools for the risk assessment of novel NPs. Thus their utility is quite limited, whereas an online version of the model could spread the knowledge gained and generate more advancement in the field.
One of the few organized datasets on NPs that has been presented in literature includes the cellular uptake of 109 NPs in pancreatic cancer cells (PaCa2). Each NP within this dataset includes the same metal core (iron oxide/NH2 cores) but different surface modifiers which are organic small molecules conjugated to the NP surface.25 Different computational approaches have been proposed in literature for the exploitation of this dataset with interesting results in model development. Recent models presented in the literature are briefly discussed below.
In 2010 Fourches et al.26 presented a QNAR model based on MOE descriptors calculated for the organic molecules conjugated to the NP surface and k-nearest neighbors (kNN) methodology. The proposed model was proven robust and accurate as indicated by external predictions, cross validation and Y randomization. Winkler and coworkers27,28 also studied this dataset and generated quantitative, predictive and informative models of cellular uptake using a pool of molecular descriptors. In a recent publication, Y. T. Chau and C. W. Yap29 used four different modeling methods, namely Naive Bayes, logistic regression, k nearest neighbor and support vector machine, to develop candidate models. A consensus model was developed using the top 5 candidate models and validated by repeating the entire model development process five times using different combinations of training and validation sets. The final consensus model had a sensitivity of 86.7 to 98.2% and a specificity of 67.3 to 76.6%. In a different publication Toropov et al.30 used CORAL software to build a QSAR model for the prediction of cellular uptake of this dataset. The software gave satisfactory and stable predictions of the cellular uptake of NPs in PaCa2 cancer cells for five random splits. Another attempt was made by Ghorbanzadeh et al.31 who presented an artificial neural network that was built based on descriptors calculated with Hyperchem program and Dragon. The results revealed the accuracy and reliability of the proposed model and moreover a sensitivity analysis indicated that the number of hydrogen-bond donor sites in the organic coating of a NP is the predominant factor responsible for cellular uptake. Moreover, Liu et al.32 proposed a robust Relevance Vector Machine (RVM) model built with nine descriptors, which demonstrated prediction accuracy as quantified by a 5-fold cross-validated squared correlation coefficient. Ensemble learning based QNAR models for predicting the biological effects of this dataset were also constructed by Singh et al.33 based on simple structural descriptors and various statistical parameters suggested robustness of the model. Finally a recent attempt for the modeling of this dataset was reported by Kar et al.34 in their publication were a statistically significant regression – based QNAR model was developed using a PLS method and a small number of interpretable descriptors.
In this work we present a fully validated and predictive QNAR model that was developed based on Mold2 descriptors and the kNN algorithm. Our model was made publicly available through Enalos InSilicoNano platform (http://enalos.insilicotox.com/QNAR_PaCa2/), a web service developed with the aim to facilitate NPs design and evaluation. The user can draw a new structure, enter a SMILES notation or upload many structures in an sdf file. By the click of a button a prediction is made available together with a value that indicates if the structure can be tolerated by the model in terms of its domain of applicability. We have used our web service in a virtual screening framework mining PubChem database. We have successfully retrieved several potent inhibitors with the aim to prioritize compounds for screening. This online tool could be a useful aid for the decision making of both research groups and regulatory bodies interested in NPs' design and screening.
Our overall strategy is targeting the development of a validated QNAR model and the release of this model to the wider community through a web service. For the model development a KNIME35 workflow was developed that executes the following procedures: (i) data preprocessing, (ii) descriptors calculation, (iii) variable selection and model development, (iv) model validation, (v) domain of applicability determination. In the proposed workflow all these computational steps were incorporated and this complete line of operations was made feasible with the invaluable help of our in house made Enalos KNIME nodes, namely Enalos Mold2 node, Enalos Model Acceptability Criteria node and Enalos Domain – Similarity node.36 These nodes have been developed by Novamechanics Ltd and are publicly available through the KNIME Community and the company's website.37
The CfsSubset variable selection with BestFirst evaluator method was then applied on the training data to select the most significant descriptors.40,41 Among the available descriptors, nine have emerged as the most critical in capturing the significant structural characteristics that affect the biological profile of the studied NPs as proposed by the variable selection algorithm. These descriptors include:
Geary topological structure autocorrelation length-7 weighted by atomic van der Waals volumes (D461), Geary topological structure autocorrelation length-5 weighted by atomic Sanderson electronegativities (D467), number of total quaternary C-sp3 (D599), number of group secondary amines (aliphatic) (D649), number of group donor atoms for H-bonds (with N and O) (D712), number of group CH3R and CH4 (D714), number of group phenol or enol or carboxyl OH (D753), number of group Al2–NH (D758) and hydrophilic factor index (D775). Their physical meaning is briefly described below.
Descriptors D461 and D467 encode information as described by Geary topological structure autocorrelation length-7 weighted by atomic van der Waals volumes and length-8 weighted by atomic Sanderson electronegativities. Geary index is a general index of spatial autocorrelation and is a distance-type function varying from zero to infinite. In each descriptor the index is either weighted by atomic van der Waals volumes or atomic Sanderson electronegativities.42 The hydrophilic factor index (D775) accounts for the hydrophilicity of each of the structures described. All other descriptors included are counting for the number of different important features present in the structure such as total quaternary C-sp3 (D599), secondary amines (D649), donor atoms for H-bonds (D712), the presence of CH3R and CH4 (D714), phenol or enol or carboxyl OH (D753) and Al2–NH (D758).
The proposed KNIME workflow gave us the opportunity to test the performance of various algorithms included in the WEKA suite of programs and select the combination that best describes our data. The kNN algorithm was selected to describe the significant correlation among the selected descriptors and the cellular uptake in PaCa2. This algorithm outperformed various different algorithms that were also tested. The kNN methodology was applied on our training data with an optimized value of k equal to 2.43 Euclidean distance was used with all nine descriptors and contributions of neighbors weighted by the inverse of distance.
R2 is the coefficient of determination between experimental values and model prediction on the test set (R2pred). Mathematical calculations of R2o, R′2o, k, and k′ are based on regression of the observed activities against the predicted activities and vice versa using the equations described in Materials and methods section.
The model was also quite stable to the inclusion–exclusion of compounds measured by the ten-fold cross validation procedure. The R2L10O was calculated equal to 0.74. In addition the Y-randomization test was used as a method for testing the robustness and statistical significance of the model. Since low values of the correlation coefficient were measured we can eliminate the possibility of chance correlation.
The values of all the above statistical tests illustrate the accuracy, significance and robustness of the proposed model.
It is important that the limitations of the model are also described via the domain of applicability. This gives an important indication as the user can freely and creatively design novel molecules but will be warned for the reliability of the prediction when the structural characteristics cannot be tolerated by the model. After model validation, the domain of applicability of our model was also defined to ascertain that a given prediction can be considered reliable.47–50 The applicability domain limit value was defined equal to 2.153 based on the equation provided in Materials and methods section. All compounds in the test set had values in the range of 0.019–1.06 except for one which slightly falls outside with a value of 2.29. The predictions for all compounds that fell inside the domain of applicability of the model can be considered reliable.
Our proposed model requires only structural information from the small organic molecules involved and was proven accurate and reliable for given applicability limits. Thus our model could be used as a useful aid to the costly and time consuming experiments for determining cellular uptake of NPs and could further be used to screen existing databases or virtual chemical structures to identify NPs with desired properties. In this effort, the applicability domain will play an important role as it will filter out chemical structures that could not be tolerated by the model.
To enable its role and make the model predictions available to the interested users, our proposed model was made publicly available online through Enalos InSilicoNano platform.51 Enalos InSilicoNano platform is a webservice that can host several validated and predictive models that can be utilized in the NPs design process. Our validated model was made publicly available through this platform and can thus be of help to the wider community of end users interested in NP's design. The web service needs no special computational skills and can be easily used by different groups of scientists like chemists, biologists etc. or even non experts involved or interested in the NPs biological evaluation.
Enalos InSilicoNano platform has a user friendly interface with minimum steps required and no authentication and authorization procedure. To initiate a prediction the user must first select the model of interest from the drop down menu provided. When the model “QNAR_PaCa2” is selected the prediction can be initiated when a structure or a batch of structures is uploaded. For that the web service provides three different options described as follows: (i) the user draws a chemical structure of interest using the drawing tool. The user can easily select from the different panels the atoms, bonds or substructures of interest and construct the molecule. What is important is that the user can also open, save and convert files with a variety of chemical formats (i.e. SMILES, IUPAC chemical Identifier, MDL MOL file) using the drop down menu of the online sketcher; (ii) the user enters the SMILES notation of a structure or several structures separated by newlines. Even if the SMILES notation is not initially known it is important that the chemical sketcher included gives the users the opportunity to design the chemical structure and then copy the structure as SMILES from the Edit drop down menu. This is very significant as it facilitates the generation of several structures since the user can make several modifications using the sketcher and copy all structures as SMILES so that a prediction for the whole set of produced structures is generated. The user can thus visualize the modifications and make multiple predictions at once; (iii) the user can select and import an SDF file (.sdf) with several structures.
When structures are uploaded in either way a prediction can be generated by clicking the submit button. The output is then presented in a different html page. The results include the predicted value for each structure entered and an indication of whether the prediction could be considered reliable based on the domain of applicability of the model. A screen shot of the web service and the results page are presented in the following schemes.
Our developed KNIME workflow integrated with Enalos InSilicoNano web service made the online prediction of the biological effects of NPs feasible. In the web service presented in Scheme 2 and 3, the user can design or enter a chemical structure and get the prediction. The workflow behind the interface calculates the descriptors and generates the output. It is important that the output will appear on screen within seconds. The user can experiment with different scaffolds and substituents and study the structural characteristics that are responsible to induce a certain effect. The user can take advantage of the proposed QNAR model and immediately scan the structures of interest for a preliminary in silico testing. In this way we overcome a main point of controversy for QSAR models in general, that their results are not available for sharing and implementation. As recently highlighted52 the advantages of making models available for use as software tools will increase in the future and this will enable the re-use of knowledge and will boost further developments. Enalos InSilicoNano platform uses a pipeline tool, KNIME, to address exactly this need of using and testing the models directly available on the web. With this platform we aim to address the need to reduce the amount of time spent by scientists in referencing disparate sources of data to aid decision making related to NPs design and bioactivity profile. Enalos InSilicoNano is launched as an efficient port where models can be developed and published directly on the web using a user friendly interface.
Within this proposed strategy Enalos InSilicoNano platform emerges as a key component for evaluating novel nano-structures that have not been experimentally evaluated or even synthesized. It is also important to highlight that our proposed methodology and tools can also be expanded and applied to polymer–nanoparticle composites that are now gaining increasing attention.
We have succeeded to generate a novel computational activity assessment platform for nanoparticles by integrating two open science platforms: KNIME that combines a rich graphical workflow environment for integration of diverse analytics and Enalos InSilicoNano a platform for hosting and publishing models directly on the web allowing the researchers to do virtual screening and/or design of novel nanoparticles. Two milestones have been reached within this work, the first is the development of a validated QNAR model and the second is the development of a web service that will immediately give the opportunity of exploiting the model's results. To demonstrate the usefulness of the model we have also proposed a virtual screening framework that could be used to identify novel potent structures.
ID | Smiles | Observed PaCa2 cellular uptake (log10 [NP]/cell) | Predicted PaCa2 cellular uptake (log10 [NP]/cell) |
---|---|---|---|
a Test Set. | |||
1 | FC(F)(F)C(O)OC(O)C(F)(F)F | 4.17 | 4.17 |
2 | FC(F)(Cl)C(O)OC(O)C(F)(F)Cl | 3.95 | 3.95 |
3 | FC(F)(F)C(F)(F)C(O)OC(O)C(F)(F)C(F)(F)F | 4.08 | 4.08 |
4 | CC1(C)CC(O)OC1O | 4.11 | 3.80 |
5 | OC1OC(O)CC1 | 3.98 | 4.11 |
6a | CC1CC(O)OC1O | 3.58 | 3.65 |
7 | CC1C(C)C(O)OC1O | 3.48 | 3.80 |
8 | CCCCCC(O)OC(O)CCCCC | 3.65 | 3.65 |
9 | CC1CC(O)OC1O | 3.64 | 3.65 |
10 | OC1OC(O)c2cc(ccc12)C(O)c1ccc2C(O)OC(O)c2c1 | 3.51 | 3.53 |
11 | OC1OC(O)c2cc(ccc12)N(O)O | 3.27 | 3.29 |
12a | Brc1ccc2C(O)OC(O)c3cccc1c23 | 3.63 | 3.52 |
13 | OC1OC(O)c2ccc3C(O)OC(O)c4ccc1c2c34 | 3.67 | 3.68 |
14 | Fc1c(F)c(F)c2C(O)OC(O)c2c1F | 3.83 | 3.84 |
15 | OC1OC(O)c2cc(cc3cccc1c23)N(O)O | 4.11 | 4.09 |
16 | Oc1cccc2C(O)OC(O)c12 | 3.97 | 3.97 |
17 | OC1OC(O)C2C3CCC(CC3)C12 | 3.9 | 3.87 |
18 | Clc1ccc2NC(O)OC(O)c2c1 | 4.18 | 4.17 |
19 | OC1OS(O)(O)c2ccccc12 | 3.88 | 3.93 |
20 | ClC1C(Cl)C(O)OC1O | 3.84 | 3.87 |
21a | CC(O)SC1CC(O)OC1O | 3.59 | 3.85 |
22 | Clc1cc2C(O)OC(O)c2cc1Cl | 4.12 | 4.07 |
23 | OC1OC(O)C2C3OC(CC3)C12 | 3.82 | 3.80 |
24 | OC1OC(O)C2C3CCC(C12)C1C3C(O)OC1O | 3.63 | 3.65 |
25 | OC1OC(O)C2CCCCC12 | 3.89 | 3.86 |
26 | OC1OC(O)c2ccccc2-c2ccccc12 | 3.77 | 3.77 |
27 | OC1OC(O)c2ccc(c3cccc1c23)N(O)O | 3.93 | 3.92 |
28 | OC1OC(O)C2C1C1C2C(O)OC1O | 3.77 | 3.86 |
29 | CCCCCCCCCCCC(O)OC(O)CCCCCCCCCCC | 3.82 | 3.82 |
30a | OC(O)c1ccc2C(O)OC(O)c2c1 | 3.55 | 3.62 |
31 | Cc1ccc2C(O)OC(O)c2c1 | 3.98 | 3.97 |
32 | OC1OC(O)c2c1cccc2N(O)O | 3.5 | 3.54 |
33 | OC1Cc2ccccc2C(O)O1 | 3.78 | 3.81 |
34 | OC1CCCC(O)O1 | 4.07 | 4.06 |
35a | OC1CN(CCN2CC(O)OC(O)C2)CC(O)O1 | 3.93 | 3.76 |
36 | OC1Nc2ccccc2C(O)O1 | 4.44 | 4.43 |
37 | CN1C(O)OC(O)c2ccccc12 | 3.36 | 3.38 |
38a | CC1CC(O)OC(O)C1 | 3.91 | 3.68 |
39 | OC1OC(O)C2C1CCCC2 | 3.73 | 3.74 |
40 | CC(O)OC1C(OC(C)O)C(O)OC1O | 3.91 | 3.91 |
41a | Brc1c(Br)c(Br)c2C(O)OC(O)c2c1Br | 3.8 | 3.66 |
42a | OC1OC(O)C2CCCCC12 | 3.93 | 3.88 |
43 | OC1OC(O)C2C1CCC2 | 3.69 | 3.71 |
44 | ICC(O)OC(O)CI | 3.42 | 3.42 |
45 | ClCC(O)OC(O)CCl | 3.63 | 3.62 |
46 | ClC1C(Cl)C2(Cl)C3C(C(O)OC3O)C1(Cl)C2(Cl)Cl | 3.47 | 3.49 |
47a | CCCCCCCCCCCCCCCC(O)OC(O)CCCCCCCCCCCCCCC | 3.55 | 3.78 |
48 | Nc1ccc2C(O)OC(O)c3cccc1c23 | 3.64 | 3.63 |
49a | CCCCCCCCCC(O)OC(O)CCCCCCCCC | 4.03 | 3.78 |
50a | OC1CC2(CCCC2)CC(O)O1 | 4.06 | 3.88 |
51 | OC1OC(O)C2C3CCC(C3)C12 | 3.94 | 3.91 |
52 | OC1OC(O)c2cccc3cccc1c23 | 3.96 | 3.95 |
53 | OC1CCC(C(O)O1)c1ccccc1 | 4.02 | 4.00 |
54a | Clc1c(Cl)c(Cl)c2C(O)OC(O)c2c1Cl | 3.83 | 3.66 |
55 | Clc1ccc(Cl)c2C(O)OC(O)c12 | 3.9 | 3.88 |
56a | CC1(C)CCC(O)OC1O | 3.94 | 3.80 |
57 | CCCCCN | 3.78 | 3.78 |
58 | CC(C)CC(C)N | 3.85 | 3.85 |
59 | NC1C(O)CC(CO)C(O)C1O | 3.36 | 3.36 |
60a | CCCCCCN | 3.75 | 3.77 |
61 | CC(C)(C)N | 3.86 | 3.87 |
62 | CC(C)CN | 3.72 | 3.74 |
63 | CC(C)(C)CN | 3.75 | 3.91 |
64 | CC(C)CCN | 3.83 | 3.82 |
65 | CCC(N)CC | 3.81 | 3.82 |
66 | CCC(C)(C)N | 4.07 | 3.91 |
67 | NCCN | 3.46 | 3.46 |
68 | CCCCCCCCCCCCCCCN | 4.06 | 4.03 |
69 | NCCCN | 3.49 | 3.49 |
70 | NCCCCN | 3.48 | 3.48 |
71 | NCCCCCCN | 3.62 | 3.62 |
72 | CCCCC(CC)CN | 3.95 | 3.94 |
73 | CCCCCCCCCCCCCCCCN | 3.97 | 4.00 |
74 | CCCCCC(C)N | 3.63 | 3.64 |
75a | CCCCCCCCCCCCCCN | 4.27 | 4.02 |
76 | NCCNCCN | 3.77 | 3.77 |
77 | NCC12CC3CC(CC(C3)C1)C2 | 2.84 | 2.87 |
78 | NCCc1ccc(O)c(O)c1 | 2.53 | 2.53 |
79 | NCCc1ccc(O)cc1 | 2.77 | 2.77 |
80a | NCCCNCCCCNCCN | 2.41 | 2.37 |
81 | NCCNCCCNCCN | 2.23 | 2.24 |
82 | NCCNCCNCCNCCNCCN | 2.54 | 2.54 |
83 | NC12CC3CC(CC1C3)C2 | 3.12 | 3.14 |
84 | NC1C2CC3CC(C2)CC1C3 | 3.18 | 3.15 |
85 | NCC(O)O | 2.57 | 2.58 |
86 | COC(O)C(N)Cc1ccccc1 | 3.39 | 3.39 |
87 | NC(CO)C(O)O | 3.36 | 3.35 |
88 | CC(O)C(N)C(O)O | 3.21 | 3.21 |
89 | NC(Cc1c[nH]c2ccccc12)C(O)O | 3.19 | 3.19 |
90 | NC(Cc1ccc(O)cc1)C(O)O | 3.07 | 3.07 |
91a | CC(C)C(N)C(O)O | 3.27 | 3.04 |
92 | NCCCCC(N)C(O)O | 3.25 | 3.25 |
93 | NC(C(O)O)c1ccc(Cl)cc1 | 3.06 | 3.07 |
94 | CC(N)C(O)O | 2.9 | 2.90 |
95a | NC(CCCNC(N)N)C(O)O | 3.15 | 3.28 |
96a | NC(CC(O)O)C(O)O | 3.29 | 3.35 |
97 | NC(CCC(N)O)C(O)O | 3.32 | 3.32 |
98 | NC(CCC(O)O)C(O)O | 3.4 | 3.40 |
99 | NC(Cc1c[nH]cn1)C(O)O | 3.38 | 3.38 |
100 | CSCCC(N)C(O)O | 3.23 | 3.23 |
101 | NC(Cc1ccccc1)C(O)O | 3.29 | 3.29 |
102 | OC1CCC(O)O1 | 4.24 | 4.11 |
103a | CC(O)OC(C)O | 4.05 | 3.80 |
104 | CC1CC(O)OC1O | 4.04 | 4.06 |
105 | OC1COCC(O)O1 | 3.99 | 3.96 |
106 | OC1OC(O)c2ccccc12 | 3.9 | 3.92 |
107 | OC(O)CC1CC(O)OC1O | 4.03 | 4.03 |
108 | Fc1ccc(F)c2C(O)OC(O)c12 | 3.91 | 3.87 |
109 | OC(O)CN(CCN1CC(O)OC(O)C1)CCN1CC(O)OC(O)C1 | 4.1 | 4.09 |
Within our KNIME workflow we have included Enalos Mold2 KNIME node36 that is able to calculate a number of 777 descriptors that account for the topological, geometric and structural characteristics of the small molecules. From this original pool of descriptors a number was removed as some of the descriptors do not have any discrimination power (no variation) and for this a node called ‘Low Variance Filter’ was applied.39
Subsequently a machine learning method that could best model the available dataset was applied. We have thus incorporated in our KNIME workflow k-nearest neighbors (kNN) methodology.43 kNN methodology belongs to instance-based (or lazy) learning that classifies objects based on the closest training examples in the feature space. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (a positive integer, typically small). For our dataset we have used an optimal k value and Euclidean distance with all descriptors and contributions of neighbors weighted by the inverse of distance.
For external validation the partitioning KNIME node was applied and the dataset was separated into training and validation set leaving a number of 20 compounds for the external validation of the model. All compounds included in the test set were not involved by any means in the training procedure.
To evaluate the models performance the following statistical criteria were used: the coefficient of determination between experimental values and model predictions (R2), validation through an external test set, leave-many-out cross validation procedure and Quality of Fit and Predictive Ability of a continuous QSAR Model according to Tropsha's tests.45,46 The latter was made feasible by including Enalos Model Acceptability Criteria node in our workflow.
In particular the formulas for calculating Tropsha's tests45 are given below:
(1) |
(2) |
(3) |
In the above equation ntest is the number of compounds that constitute the validation data set, ȳtr is the averaged value for the dependent variable for the training set, yi, ỹ, i = 1, …, ntest are the measured values and the QSAR model predictions of the dependent variable over the available validation set and is the average over all ỹ, i = 1, …, ntest.
Tropsha et al.45 considered a QSAR model predictive, if the following conditions are satisfied:
R2cvext > 0.5 | (4) |
R2pred > 0.6 | (5) |
(6) |
0.85 ≤ k ≤ 1.15 | (7) |
APD = 〈d〉 +Zσ | (8) |
Calculation of 〈d〉 and σ was performed as follows: first, the average of Euclidean distances between all pairs of training compounds was calculated. Next, the set of distances that were lower than the average was formulated. 〈d〉 and σ were finally calculated as the average and standard deviation of all distances included in this set. Z was an empirical cutoff value and for this work, it was chosen equal to 0.5.47–50 Enalos Domain – Similarity node that executes the aforementioned procedure is included in our workflow and was used to assess domain of applicability of the proposed model.47–50
Through Enalos InSilico platform, toxicity, biological activity and property predictions can be obtained for chemical structure provided by the user. Structures can be designed, entered as SMILES or imported in SDF format. The QNAR model described in this work can be selected from the pull down menu of the available workflows already developed and provided by the Enalos InSilicoNano platform.
PubChem database was used to retrieve potent compounds in the described virtual screening framework using the most active compound in our original database, compound 36, which has a PaCa2 cellular uptake value equal to 4.44 expressed as decadic logarithm of the concentration (pM) of NP per cell (log10[NP]/cell). All compounds included in the PubChem database were compared to compound 36 in a similarity context on the basis of 42 integer value descriptors of molecular structure, called Molecular Quantum Numbers (MQNs). MQNs count elementary features that matter most for the properties of organic molecules: atoms, bonds, polar groups, and topological features.56 The MQN-space organises molecules by their global structural features, but also by their similarity in biological activity. Distances in MQN-space can be used to search for analogues of known drugs. The MQNs form a scalar fingerprint which can be used to measure the similarity between pairs of molecules and enable ligand-based virtual screening.
This journal is © The Royal Society of Chemistry 2014 |