Alston
Lo
*a,
Robert
Pollice
abc,
AkshatKumar
Nigam
d,
Andrew D.
White
e,
Mario
Krenn
f and
Alán
Aspuru-Guzik
abgh
aDepartment of Computer Science, University of Toronto, Canada. E-mail: alston.lo@mail.utoronto.ca; r.pollice@rug.nl; alan@aspuru.com
bChemical Physics Theory Group, Department of Chemistry, University of Toronto, Canada
cStratingh Institute for Chemistry, University of Groningen, The Netherlands
dDepartment of Computer Science, Stanford University, California, USA
eDepartment of Chemical Engineering, University of Rochester, USA
fMax Planck Institute for the Science of Light (MPL), Erlangen, Germany
gVector Institute for Artificial Intelligence, Toronto, Canada
hCanadian Institute for Advanced Research (CIFAR) Lebovic Fellow, Toronto, Canada
First published on 1st July 2023
String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).
However, in an ML setting, this grammar can carry two intrinsic weaknesses. First, many strings constructed from SMILES symbols are syntactically invalid due to the rigidity of the SMILES grammar, i.e., the strings cannot be interpreted as molecular graphs.4,5 In particular, SMILES requires branch brackets and ring numbers to appear in matching pairs (e.g., C(CC and C1C are invalid), so a single misplaced or missing token could ruin the validity of a SMILES string. This is problematic because ML models that produce SMILES strings, especially generative models, can be prone to these syntactic errors, rendering a significant fraction of their output meaningless. One strategy is to constrain the ML architecture to reduce the number of invalid structures, which has been demonstrated successfully in the literature.6–8 This approach, of course, needs significant computational effort and cannot be transferred directly to other systems without model retraining, model architecture adjustments, or domain-specific design considerations. An alternative and more fundamental solution is to define representations that are inherently robust. A first step towards this direction was taken by DeepSMILES,9 a string-based representation derived from SMILES that reworked some of its most syntactically susceptible rules. While DeepSMILES solves most of the syntactical errors, it does not address the second weakness of SMILES, namely, that even syntactically valid strings may not necessarily correspond to a physical molecule. Typically, this occurs when a string represents a molecular graph that exceeds normal chemical valences, in which case we call the string semantically invalid. For example, the SMILES string COC is semantically invalid because it erroneously specifies a trivalent oxygen atom, which is chemically unstable and reactive.
To eliminate both syntactic and semantic invalidities in string-based molecular representations on a fundamental level, an entirely new representation termed SELF-referencIng Embedded Strings (SELFIES) has been proposed by some of us.10 By construction, SELFIES is 100% robust to both syntactic and semantic errors. That is, any combination of SELFIES symbols specifies a molecular graph that obeys chemical valences. This is achieved through a small Chomsky type-2, context-free grammar11 that is augmented with self-referencing functions to handle the generation of branches and rings. Since its release, SELFIES has enabled or improved numerous applications, ranging from molecular design12–15 to interpretability16 to image-to-string and string-to-string translations,17,18 and has been extended to incorporate functional groups and other fragments.19 For an extensive summary of its applications and opportunities, we refer readers to the recent community paper on SELFIES.20
Herein, we introduce selfies 2.1.1, the latest version of the open-source Python implementation of SELFIES. In particular, we provide a detailed look into its history, developments, underlying algorithms, design, and performance. Together with the community, we have recently overviewed potential extensions and formulated 16 concrete future projects for SELFIES and other robust molecular string representations.20 We hope that this manuscript will also help in developing some of these extensions and ideas. Our software package selfies can be installed with “pip install selfies” and is available at GitHub (https://github.com/aspuru-guzik-group/selfies) under the Apache 2.0 license, along with comprehensive documentation and tutorials.
One major modification we made is that selfies now uses directed molecular graphs to internally represent SMILES and SELFIES strings. This has afforded selfies greater efficiency and flexibility, and enabled a number of additional extensions to be made. For example, we added support for aromatic molecules by kekulizing SMILES strings with aromatic symbols before they are translated into SELFIES. Furthermore, we handle species with partial charges, radicals, explicit hydrogens, non-standard isotopes, and stereochemical definitions in a fully syntactically and semantically robust way. Besides the standard constraints for the number of valences, users can now specify their own constraints and we provide built-in relaxed and stricter constraint presets that can be selected conveniently. Most recently, we introduced the ability to trace the connection between input and output tokens when translating between SELFIES and SMILES. Table 1 gives a brief changelog of the major releases of selfies and their associated advancements.
Version | Year(s) | Description |
---|---|---|
0.1.1 | (Jun) 2019 | • Initial release of selfies |
0.2.4 | (Oct) 2019 | • Release of selfies that implements the representation from Krenn et al.10 |
1.0.x | 2020–21 | • Expanded the support of selfies to a greater subset of SMILES strings, including strings with aromatic atoms, isotopes, charged species, and certain stereochemical specifications. To do so, the underlying grammar used by selfies was both streamlined and generalized |
• Added support for the customization of the semantic constraints used by selfies | ||
• Significantly improved the efficiency of translation between SELFIES and SMILES | ||
• Added a variety of utility functions to make the handling of SELFIES strings convenient | ||
2.0.x | 2021 | • Updated the SELFIES alphabet to be more human-readable and standardized |
• Improved handling of stereochemical specifications in SELFIES involving ring bonds | ||
2.1.x | 2022 | • Added support for explaining translations between SELFIES and SMILES through attributions |
While the ideas outlined in the initial publication10 that ensure the validity of the representation remain at the core of selfies, the manifold implementation improvements and extensions are the novelties that we detail in this paper. Hereafter, unless specified otherwise, we will use selfies to refer to selfies 2.1.1 in particular and SELFIES to refer to the representation that selfies 2.1.1 implements. We will provide a complete and formal description of the updated representation in Section 3 and describe the API of selfies in Section 4.
Within SELFIES, these chemical valences are encoded as a constraint function where is a finite universe of the atom types (e.g., ) of interest and The valences represented by ν dictate that an atom A must assume ν(type(A)) incident bonds in total. Note that if a SMILES string obeys the valences k, each of its atoms A makes at most ν(type(A)) explicit bonds within the string. There is a possibly-strict inequality in this case due to the way SMILES automatically adds implicit hydrogens until chemical valences are satisfied. In practice, the mapping ν is rationally chosen to align with both physical considerations and established cheminformatics packages such as RDKit.21 For example, a plausible setting might map
ν(C) = 4, ν(N) = 3, ν(O) = 2, ν(F) = 1 | (1) |
We formulate chemical valences in this manner to emphasize that although SELFIES depends on ν, it is not fixed to any particular setting of ν. That is to say, SELFIES can enforce rule sets induced by any arbitrary mapping even if they are not chemically meaningful. To highlight an absurd example, the uniform constraints ν(·) = 1000 can be used in principle, which corresponds to effectively having no semantic constraints at all. In this sense, SELFIES can be thought of as a general framework for an adjustable set of constraints ν. In the ensuing discussion, we will describe SELFIES under the assumption that some constraint function ν is fixed beforehand.
[…][…][…]⋯[…][…] | (2) |
(3) |
(4) |
collectively specify an atom type type(α) in a SMILES-like fashion (the atom's isotope number, atomic number, chirality, number of attached hydrogens, and charge, respectively, and sometimes optionally). Notably, each SELFIES atom symbol is semantically unique, i.e., different atom symbols are not interchangeable. This is not the case in SMILES due to shorthand abbreviations in how attached hydrogens and charge can be represented. For example, the SMILES atom symbol pairs ([Fe++], [Fe+2]) and ([CH], [CH1]) are interchangeable. To create a more standardized alphabet of symbols, we remove this redundancy in SELFIES.
[βBranch] | (5) |
(6) |
β1,β2 ∈ {−,/,\}, and not both β1 = β2 = − | (7) |
are SMILES-like bond symbols and ∈ {1, 2, 3}, similar to branch symbols. The second ring symbol type (eqn (6)) is used to handle stereochemical specifications across ring bonds (see Section 3.5).
• The dot symbol, which can be used to express multiple disconnected fragments in a single SELFIES string, similar to its role in SMILES. The dot symbol is interpreted by treating it as delimiter and splitting the SELFIES string across the symbol. Then, each token is treated as an independent SELFIES string.
• The [nop] (for “no-operation”) symbol, which is a special padding symbol ignored by selfies.
Table 2 provides examples of SELFIES atom, branch, and ring symbols.
Type | Examples |
---|---|
Atom | [#13C], [O], [C@@H1], [N + 1] |
Branch | [Branch3], [#Branch1], [Branch2] |
Ring | [Ring1], [/\Ring3], [Ring2] |
Misc. | ., [nop] |
Specifically, the SELFIES grammar takes
(8) |
The derivation of a simple chain starts with the initial string σ0 = S. Recall that the SELFIES symbols dictate how production rules are applied. For simple chains, this is achieved by having each pair of SELFIES atom symbol and nonterminal symbol A ∈ V determine a production rule of the form A → α A′, where α ∈ Σ* is a terminal string and A′ ∈ V ∪ {ε}. Then, a sequence of replacements is iteratively performed by treating the SELFIES string as a queue of SELFIES symbols. At each step, the head of is popped‡ and, with a nonterminal symbol in the current string σt, is used to select and apply a production rule to get the next string σt+1. Note that σ0 = S is itself a single nonterminal symbol, and each rule induced by a SELFIES atom symbol replaces one nonterminal symbol by another. Hence, throughout the derivation, the current string σt will always contain at most one nonterminal symbol and there is never any ambiguity as to how or which production rule is applied. Once the current string has only terminal symbols or is empty, the process ends (since SELFIES strings are finite, termination necessarily occurs). The final derived SMILES string is read off by dropping all nonterminal symbols.
We now fully enumerate the SELFIES atom symbol to production rule mapping. Let [βα] be a generic atom symbol, as described in eqn (3). Based on this symbol, we first define the terminal string
(9) |
S → X | (10) |
(11) |
(12) |
(13) |
In eqn (10) and (11), the nonterminal symbols Xm are intuitively memorizing the maximum number of bonds that the most recently derived atom can adopt; the nonterminal symbol Xm can be understood as encoding that the last atom can make at most m bonds. When the next atom is derived, the bond connecting it to the preceding atom has its order decreased minimally such that the bond constraints are always satisfied.
Examples: To show these production rules in a concrete setting, we will translate the SELFIES string
(14) |
along with the constraints in eqn (1). The derivation of its corresponding SMILES string would proceed step-wise as follows:
(15) |
(16) |
under the same constraints. The derivation proceeds as
(17) |
producing the final SMILES string [CH3][13CH1]O. Note that isotopes are assumed to share the same valence, and when hydrogen atoms are specified in an atom type, its valence is decremented accordingly.
The derivation process extends that for simple chains (in Section 3.3), where we pop SELFIES symbols step-by-step off of a queue . We only add an additional rule for when we dequeue a branch symbol from . Let this symbol be [βBranch], as in eqn (5), and let A be a nonterminal symbol in the current string σt. If A ∈ {S, X1}, then this specifies the application of the production rule A → A. Effectively, the branch symbol is ignored in this case. If A = Xi for i ≥ 2, then we perform a replacement:
A → ρXi−d0 | (18) |
First, symbols are popped from and converted into integer values by the mapping summarized in Table 3. Let c1⋯, c be the indices in first-to-last order of retrieval. In the event that contains fewer than symbols, the missing indices are set to have a default value of 0. Next, these indices are identified with a natural number by treating them as hexadecimal digits:
(19) |
Index | Symbol | Index | Symbol |
---|---|---|---|
0 | [C] | 8 | [#Branch2] |
1 | [Ring1] | 9 | [O] |
2 | [Ring2] | 10 | [N] |
3 | [Branch1] | 11 | [N] |
4 | [Branch1] | 12 | [C] |
5 | [#Branch1] | 13 | [#C] |
6 | [Branch2] | 14 | [S] |
7 | [Branch2] | 15 | [P] |
All other symbols are assigned index 0 |
Then, N symbols from (or all symbols in if fewer exist) are consumed to form a new SELFIES string, and with start symbol (instead of S = S as before), this substring is recursively derived into a SMILES string ρ0. We take ρ = ε if ρ0 = ε, and ρ = (ρ0) otherwise.§
Examples: To provide an overview of branch derivation, we translate a SELFIES string representing acetic acid:
(20) |
Processing the first two SELFIES symbols [O][C] results in the string OC X3, after which the symbol [Branch1] is dequeued. Since = 1, we consume the next symbol [C] in and identify it with N = 1. Hence, we create the SELFIES substring [O] from popping the next symbol in and, with start symbol X2, recursively derive it into the SMILES substring ρ = (O). Then, performing the replacement in eqn (18) gives the string OC(O) X1, and processing the last symbol [C] in finally produces a SMILES string OC(O)C for acetic acid. Another SELFIES string that corresponds to OC(O)C is:
(21) |
The derivation is largely similar to that before. The major difference is that when the branch symbol is dequeued, the next = 2 symbols [C][Ring1] are identified with N = 1 + 16(0) + 1 = 2, and then, the SELFIES substring [O][F] is used to again derive ρ = (O).
Per eqn (6), there are two forms of SELFIES ring symbols. To simplify the ensuing discussion, however, we will begin by only considering the first form. When a ring symbol [β Ring ] is popped from the queue of SELFIES symbols a nonterminal symbol A in the current derived string is used to specify a production rule. If A = S, then we apply the rule A → A, and the ring symbol is effectively skipped. If A = Xi, then we replace:
A → Xi−min(i,d(β)) | (22) |
In addition, we consume the next symbols of (or all symbols in if fewer exist) to specify a number by eqn (19). Then, the ring symbol would indicate that a ring closure should be formed between the ring-initiating atom and the N-th atom previously derived from it (or simply, the first atom if less than N such atoms exist). Here, the derivation order is the order in which atoms are realized through the production rules in eqn (10) and (11). By ring-initiating atom, we also mean the atom at which bonds would be made if the ring symbol were instead an atom symbol. Often, this coincides with the last-derived atom, as is the case in:
NC(C)COC*†X4 | (23) |
NC(C)COC*(C)(C†)X1 | (24) |
For brevity, we will refer to the ring-initiating atom as the right ring atom and its counterpart the left ring atom, as the latter precedes the former in a SMILES string under derivation order.
Although a ring symbol specifies a closure between the left and right ring atoms, such a bond cannot be naively added since it may cause valences to be violated for the left ring atom immediately (e.g., consider the case where this atom has already attained its maximum valence) or in the future. Hence, SELFIES postpones the creation of ring closures to a final post-processing step. Instead, the ring closure candidates are pushed to a temporary queue and once all the SELFIES symbols have been processed, the items in are revisited in first-to-last order. Based on the state of the ring atoms, a candidate may be rejected (and no ring bond is made) or executed.
Specifically, given a potential ring closure indicated by symbol [β Ring ], let m1 and m2 be the number of additional bonds that the left and right ring atoms can make, respectively. If m1 = 0 or m2 = 0, we must reject the candidate since adding the ring closure would exceed one of the valences of the ring atom. The candidate is also rejected if its left and right ring atoms are not distinct, to avoid unphysical self-loops. Otherwise, the candidate is accepted, and, assuming there is no pre-existing bond between its two ring atoms, we form a new bond of order d0 = min(d(β1), m1, m2) between them. If a prior bond does exist (e.g., if a duplicate ring closure is specified earlier in ), then we increment the order of this existing bond as necessary. That is, if the existing bond is of order d1, then we promote it to a bond of potentially-higher order min(3, d1 + d0).
Examples: We translate a SELFIES string representing methylcyclobutane:
(25) |
The first five symbols produce the string CCCCC X4, after which the ring symbol [Ring1] is dequeued. Since = 1, the next and final symbol [Ring2] specifies a single ring bond between the final C and its N = 3rd preceding atom. This produces the SMILES string CC1CCC1. Note that incrementing the indexing symbol:
(26) |
increments the distance of the ring closure, hence producing a SMILES string for cyclopentane C1CCCC1. Appending a copy of the ring and index symbols:
(27) |
The second ring symbol form [β1β2 Ring ] in eqn (3) behaves nearly identically to [Ring ], and is used to support specification of stereochemistry across single ring bonds. The only difference occurs when a ring closure candidate produced by [β1β2 Ring ] is accepted, and a new ring bond is added between the two ring atoms. In this case, if β1 ∈ {/, \}, then we add the bond character β1 before the numeric ring tag on the left ring atom, and similarly with β2 and the right ring atom. For instance, if the example eqn (25) used the symbol [/−Ring1] instead of [Ring1], then the derived SMILES string would be CC/1CCC1.
Element | Maximum bonds | ||
---|---|---|---|
Charge 0 | Charge +1 | Charge −1 | |
H, F, Cl, Br, I | 1 | — | — |
B | 3 | 2 | 4 |
C | 4 | 5 | 3 |
N | 3 | 4 | 2 |
O | 2 | 3 | 1 |
P | 5 | 6 | 4 |
S | 6 | 7 | 5 |
Line 5 translates the SMILES string for benzene into the SELFIES string in Line 11. Notably, SELFIES does not support aromatic atom symbols (e.g., c) in the same way as SMILES, so encoder( ) performs an internal kekulization if it is passed an aromatic SMILES string. Line 7 guards against errors raised by encoder( ) when being passed SMILES strings that are syntactically invalid, semantically invalid (i.e., violate the constraints described in the next subsection), or unsupported. An unsupported SMILES string uses features of SMILES that are not implemented in SELFIES, such as the wildcard * and quadruple bond $ symbols; the API reference of selfies further details which SMILES strings are currently supported. Line 10 applies the roundtrip( ) function to a SMILES string c1ccccc1 for benzene. Indeed, this round-trip translation recovers a SMILES string C1CCCCC1 that is different than the original string, but still specifies the (kekulized) benzene molecule.
In greater detail, given an input SMILES string, encoder( ) first performs a kekulization if it contains any aromatic atom symbols, as was in the example above. Next, the actual translation process begins. In the simplest case, if the input represents a simple atom chain, then a translation to SELFIES is performed by essentially grouping each atom symbol with its preceding bond symbol, if any. For example, the SMILES string O[13CH]C#N would be partitioned into O, [13CH], C, #N and turned respectively into SELFIES symbols [O][13CH1][C][#N]. Branches are recursively translated and the result is used to work backwards to find the appropriate branch and indexing symbols to prepend. If there are multiple plausible choices, we use the one in which the branch symbol [βBranch] has minimized and β representing the bond connecting the branch to the parent chain. For instance, C(O)O is encoded as [C][Branch1][C][O][O] instead of [C][#Branch2][C][C][O][O], despite both SELFIES strings producing C(O)O under the derivation process. Finally, ring closures are handled similarly in that we work backwards to find the appropriate ring and indexing symbols. If there are multiple choices, we use the one in which the ring symbol [β Ring ] (or [β1β2 Ring ]) has minimized and β (or β1, β2) representing the bond of the ring closure.
Note that by changing the pool of SELFIES symbols from which we sample from, we can change the distribution of produced molecules.
The attributions are a list of AttributionMap objects, one for each output symbol. Each AttributionMap contains the output symbol, its index, and a list of Attribution objects, each of which holds an input symbol (and its index) that is responsible for the output symbol. Note that a single output symbol may be attributed to multiple input symbols because it may be determined by both atom symbols and branch or ring symbols. Tracing the relationship between symbols can enable alignment between SMILES and SELFIES so that per-atom properties can be connected on both sides of the translation.
However, a limitation of the default constraints is that SELFIES cannot represent existing molecules that violate them, such as perchloric acid (which features a hypervalent Cl making 7 bonds). Moreover, the catch-all constraint may be too relaxed to ensure the validity of SELFIES strings containing atom types outside those in Table 4 (e.g., Si, Se). Hence, users may wish to instead use custom constraints that are tailored to the SELFIES strings being worked with. To this end, selfies provides the key function set_semantic_constraints( ). The following provides a minimal example:
Here, the constraints dictionary encodes a set of custom constraints; specifically, explicit constraints on the neutral and ±1 charged variants of C (as in Table 4) and a catch-all constraint (of 4 maximum bonds). Line 8 then sets constraints as the underlying semantic constraints that selfies will operate under, which changes the subsequent behaviour of encoder( ) and decoder( ) appropriately. Note that the pre-existing constraints are fully replaced in Line 8; any constraint that is not explicitly specified in constraints would be thus removed.
For convenience, selfies provides a couple of preset constraints to serve as templates that can be easily modified. These can be obtained as follows:
The currently-set constraints can also be viewed by:
Furthermore, selfies includes functions to extract a vocabulary of symbols from a dataset of SELFIES strings, and to convert SELFIES strings into label or one-hot encodings. Consider the following example:
Here, we are given a list dataset of SELFIES strings. Line 7 uses a utility function of selfies to extract the set alphabet of SELFIES symbols that appear in the dataset, which is used in Line 13 to create a symbol to index mapping termed stoi. Next, lines 17–22 use another utility function selfies_to_encoding( ) to create a label and one-hot encoding of the first SELFIES string in the dataset. Under the hood, this function first pads the input string to length pad_to_len by appending to it sufficiently many copies of the symbol [nop] (for “no-operation”), which is a special padding symbol in selfies that is automatically ignored by decoder( ). Then, the padded SELFIES string is tokenized, and stoi is used to convert each of its symbols into integer labels and one-hot vectors. Since the padded SELFIES string may now contain [nop], this symbol must be added to stoi, which is done through Line 8. Lastly, the reverse encoding can be performed using the encoding_to_selfies( ) utility:
Table 5 summarizes the various utility functions introduced within this section.
Function | Description |
---|---|
len_selfies( ) | Computes the symbol length of a SELFIES string |
split_selfies( ) | Tokenizes a SELFIES string into its constituent symbols |
get_alphabet_from_selfies( ) | Extracts a minimal vocabulary from a dataset of SELFIES strings |
selfies_to_encoding( ) | Converts a SELFIES string into a label and/or one-hot encoding |
encoding_to_selfies( ) | Recovers a SELFIES string from its label and/or one-hot encoding |
get_semantic_robust_alphabet( ) | Provides an alphabet of semantically-constrained SELFIES symbols |
Fig. 2 The roundtrip translation time of 1000 randomly-sampled SMILES strings from the DTP open compound collection as a function of size, measured in number of atoms. |
Overall, the SELFIES community has grown rapidly and we are actively engaging in constructive discussions about the current implementation and future improvements. While selfies 2.1.1 supports almost all important features of SMILES, there are still many new features on our agenda. We outlined many of them in a recent perspective,20 for example, extensions to polymers, crystals, molecules with non-covalent bonds, or reactions. Our vision is that SELFIES will become a standard computer representation for molecular matter. We encourage the community to implement it into their workflows, report errors in the current implementation, and propose changes and new features that will help them to succeed in their goals.
This journal is © The Royal Society of Chemistry 2023 |