Kyle
Mills
*a,
Kevin
Ryczko
b,
Iryna
Luchak
c,
Adam
Domurad
d,
Chris
Beeler
a and
Isaac
Tamblyn
abe
aUniversity of Ontario Institute of Technology, Oshawa, Ontario, Canada. E-mail: kyle.mills@uoit.net; isaac.tamblyn@nrc.ca
bUniversity of Ottawa, Ottawa, Ontario, Canada
cUniversity of British Columbia, Vancouver, British Columbia, Canada
dUniversity of Waterloo, Waterloo, Ontario, Canada
eNational Research Council Canada, Ottawa, Ontario, Canada
First published on 20th March 2019
We present a physically-motivated topology of a deep neural network that can efficiently infer extensive parameters (such as energy, entropy, or number of particles) of arbitrarily large systems, doing so with scaling. We use a form of domain decomposition for training and inference, where each sub-domain (tile) is comprised of a non-overlapping focus region surrounded by an overlapping context region. The size of these regions is motivated by the physical interaction length scales of the problem. We demonstrate the application of EDNNs to three physical systems: the Ising model and two hexagonal/graphene-like datasets. In the latter, an EDNN was able to make total energy predictions of a 60 atoms system, with comparable accuracy to density functional theory (DFT), in 57 milliseconds. Additionally EDNNs are well suited for massively parallel evaluation, as no communication is necessary during neural network evaluation. We demonstrate that EDNNs can be used to make an energy prediction of a two-dimensional 35.2 million atom system, over 1.0 μm2 of material, at an accuracy comparable to DFT, in under 25 minutes. Such a system exists on a length scale visible with optical microscopy and larger than some living organisms.
Machine learning methods are rapidly being adopted by chemists, physicists, and materials scientists, and have performed well at making predictions in the fields of dynamical mean-field theory, many-body physics,6,7 strongly correlated materials,8–10 phase transitions and classification,11–15 and materials exploration and design.16–23 Machine learning models have been shown to be of sufficient accuracy to provide fast and accurate chemical insights.20,24–28
Convolutional deep neural networks have been used to predict the kinetic energy of hydrocarbons and were successful in reproducing the Kohn–Sham potential energy surfaces,29 and have been used to classify reciprocal-space diffraction patterns for crystal lattices.30 Deep neural networks have proven their classification power in astronomical applications31 and particle physics applications32–34 but have yet to be widely adopted throughout the physics community for accurate numerical predictions. There are a growing number of methods being proposed to capture relevant chemistry within representations of atomic environments35 and a consensus is forming calling for the need to incorporate physics into network design and utilize the physics of the underlying problem to motivate the use of specific network structures and techniques.36–38 The work of Brockherde et al.39 focuses on low dimensional systems and small molecules. In addition their architecture is specialized to either work with the external potential or electron density using kernel ridge regression. These models are highly specialized and do not do any sort of energy partitioning. Their respective runtime is based on the computational complexity of kernel ridge regression for training. Depending on the dimensionality of the input, we have found that kernel ridge regression requires large amounts of RAM, whereas the required memory is less for deep neural networks (DNNs).
DNNs have the ability to replace both classical40 and quantum mechanical operators.41,42 In comparison to other machine learning methods, convolutional deep neural networks prevailed as the most accurate and best-scaling method for all but the most simple cases.41,43 Deep neural networks were able to learn the mapping from spin configuration to energy for multiple cases of Ising-like classical spin models. For the case of a confined quantum particle, a deep neural network successfully learned the energy of the ground state, first excited state, and kinetic energy.41 A similar approach was able to map the structure of two dimensional hexagonal crystal lattice energies computed within the density functional theory framework.42 All of this was accomplished via the aforementioned “featureless” deep learning; the network was presented with raw spatial data, without any preliminary attempt at manual feature selection.
Traditionally with deep neural networks, the training process employed is not transferable to systems of arbitrary length scales. In practice, this means that for square, L × L lattice systems, a model trained on 4 × 4 configurations can not be used on an 8 × 8 cell (or vice versa) without retraining at least part of the network. With smaller or larger inputs impossible, this size limitation is clearly a major shortcoming of deep neural network techniques. Beyond the practical limitations, from a fundamental standpoint it is unsatisfying to use a model that has no concept of extensivity.
A physical property is extensive if it can be divided among subsystems. A common example is the number of particles in a system. When a sample is divided evenly into two subsystems, the number of particles in each subsystem is halved. This is in contrast to intensive quantities such as temperature that are unchanged by subdivision or addition of subsystems.
Maintaining extensivity has not been a focus of machine learning researchers, as in traditional vision- and audio-based applications of deep learning, extensivity is not a common requirement. Most classification problems (e.g. identification of an animal or shape in an image) are invariant to the physical dimensions of an image (e.g. number of pixels that comprise a cat). Indeed, absolute scale is not normally recorded in a photograph, and therefore scale invariant models are necessary and commonplace. Furthermore, photographs, handwriting, and audio recordings too large to be processed by the deep neural network can be resized, cropped, and segmented without destroying the features necessary to make a prediction;44 there is no absolute spatial scale upon which the label of interest depends.
In a physical measurement or simulation however, the physical scale of a pixel matters critically. Consider the case of an X-ray diffraction experiment where the interference pattern recorded at the detector depends strongly on the wavelength of scattered light and the physical length scale over which the signal is collected. It is not possible to reconstruct the signal properly unless such parameters are known and are consistent. Extensivity is critical in describing chemical systems; configuration interaction with single and double excitations (CISD) is infamous for its lack of extensivity.
In this work, we propose a general method that preserves the extensivity of physical quantities, and also accommodates arbitrary input size. We propose a new deep neural network structure, which once trained, can operate on (effectively) arbitrary-sized inputs and length scales while maintaining the physical requirement of extensivity. Unlike atom-centred approaches, we avoid the problem of energy assignment or projection onto specific atoms by forcing the neural network to automatically learn by viewing the entire structure at once.
We call our approach Extensive Deep Neural Networks (EDNNs), employing domain decomposition to solve the problem of operator evaluation across length scales. Although domain decomposition techniques have a long history in computer simulation and modelling, here we have taken a new approach and allow the model itself to identify and self-optimize the overlap of tiles at domain boundaries.
Previous work45 has identified the necessity of extensivity.46,47 Our method is sufficiently general to allow for applications to any system in which extensivity holds, such as the spin and atomic systems we demonstrate in this work and even the charge density (e.g. a scalar field representing an extensive quantity). Furthermore, our method results in a model that can be evaluated on arbitrarily large system sizes, which we demonstrate.
Now, one might ask: “If the property we wish to predict is extensive, can we not just split the input into blocks and add up the individual answers?” Fig. 1 provides an example. If the goal of the neural network is to count the number of dots in the box (left), then division into non-overlapping subsystems will indeed yield the correct answer. If, however, the task is to count the number of multicolored pairs (right), the process is not so straightforward, and subdivision without accounting for the boundary yields erroneous answers.
The spatial extent over which features in the configuration influence the value of an operator is known as locality. In general, operators (such as the number operator, Hamiltonian, magnetization, etc.) may be described as local, semi-local, or fully non-local. In the density functional theory framework,48,49 exchange–correlation functionals are often identified in these categories. The local density approximation is considered local, generalized gradient approximations like PW9150 and PBE51 considered semi-local and some hybrid functionals (e.g. B3LYP52,53) and exact exchange considered non-local.
We define l to be the length scale of an operator's locality. For example, in the counting example above, the number operator is fully local (l = 0), as computing ε requires only local knowledge. The nearest-neighbour example is non-local with l = 1, meaning knowledge of the surrounding region is necessary to make a prediction. The gradient operator using a second-order finite difference method is an example of a semi-local operator with l = 2.
For many systems, such as the Coulomb (1/r) interaction, there is no hard cut off, but typically one expects the importance of a feature to diminish as the distance from the feature increases. For example, in a material, the screening environment (i.e. the importance of many-body effects) has a strong influence over how quickly this attenuation occurs. In metals it occurs quickly, but in large band-gap insulators the falloff is much more gradual. Even though quantum mechanics involves fully non-local operators, it has been noted that matter is, in practice, near-sighted.54
This idea of operator locality is the primary motivation for the subdivision technique used in EDNNs: an L × L training example is divided into N = L2/f2 non-overlapping regions of size f × f. We call these regions focus. Then to each focus region, we provide overlapping context of width c. Each of these N tiles (Fig. 2b) of size (f + 2c) × (f + 2c) is then fed into an identical neural network (Fig. 2a), and the N individual outputs are summed to impose the extensivity of the operator. The loss is computed with respect to this final, summed value, and backpropagation is used to update the weights. In a normal domain-decomposition technique, some method to compensate for the inherent double-counting of the overlapping context regions would be necessary, however with EDNNs, we leave the task of rectifying this double-counting to the deep neural network itself; it must somehow learn to partially ignore the overlapping context regions.
As for the “brain” of the model, we chose to demonstrate the technique with a neural network; the neural network within the EDNN can be of arbitrary complexity. We have made use of both deep convolutional neural networks as well as fully-connected artificial neural networks, depending on the complexity of the underlying problem. Since the input to the neural network is much smaller than the overall example, network architectures that would normally be prohibitively expensive become tractable with EDNNs (e.g. within the EDNN framework it would be possible to use a fully-connected artificial neural network to process a many-megapixel input).
In our case, we have focused on neural networks. Details of training hyperparameters are provided in Table 1. The fully-connected artificial neural network used in training the Ising model is comprised of a fully-connected layer of size 32(f + 2c)2 (note that (f + 2c) is the size of a single tile), followed by two fully-connected layers with 64 outputs, which finally feed into a fully-connected layer of size 1. The convolutional deep neural network used to train the DFT models is constructed from 13 convolutional layers and two fully-connected layers. The first 2 layers are reducing and operate with filter sizes of 3 × 3 pixels. Each of these reducing layers operates with 64 filters and a stride of 2 × 2. The next 6 layers are non-reducing, meaning they have unit stride and preserve the resolution of the image. Each of these non-reducing layers operates with 16 filters of size 4 × 4. The ninth convolutional layer operates with 64 filters of size of 3 × 3 and a stride of 2 × 2. The last four convolutional layers are non-reducing, and operate with 32 filters of size 3 × 3. The final convolutional layer is fed into a fully-connected layer of size 1024. This layer feeds into a final fully-connected layer with a single output. The contribution from each tile is summed, and the loss is computed as the mean-squared error between this value and the true value.
Model | Total training examples | Batch size | Network | Optimizer | Learning rate | Loss | Epochs |
---|---|---|---|---|---|---|---|
Ising model | 100000 | 2000 | DNN | Adam57 | 0.0001 | MSE | 500 |
Hexagonal sheets | 18515 | 100 | CNN | Adam | 0.00001 | MSE | 500 |
Porous graphene | 501473 | 64 | CNN | Adam | 1 × 10−6 | MSE | 500 |
We note that parallel branching within neural networks is a common technique, usually taking one of two forms. The first technique, which is used by e.g. GoogLeNet (a.k.a Inception),55 uses repeating modules of parallel convolutional layers. Ref. 44 uses a similar approach, with multiple preprocessing techniques feeding a variety of data representations through many branches of a neural network architecture. Each branch of these neural networks has its own set of weights, and learns different features. Ultimately the output from each branch is concatenated to produce an ensemble of learned features that is subsequently fed into a decision layer.
The second approach employs a single set of weights, shared across the parallel branches such as in ref. 56. This technique facilitates efficient parallelization of the neural network training as each branch can be evaluated on separate hardware with little communication between devices. When training in parallel, the gradients from each separate branch are averaged, and the weights of each branch are updated synchronously. This effectively leads to a multiplicative speed-up in training and inference.
EDNNs are fundamentally different from these approaches, however, since in contrast to the former, the contribution from each branch is summed and not concatenated, and in contrast to the latter, the gradients used to update the weights are computed after the extensivity-imposing summation.
• c ≈ l in order to provide sufficient context to the network, but should not be too much greater so as to introduce redundant computations into the network. Context too large is not only inefficient for evaluation, but also makes the weight-optimization process more challenging as the network must learn to ignore a larger fraction of the input signal.
• Choosing f is a balancing act between parallelizability and overall computational cost. Minimizing f results in more tiles, which can be computed independently and thus parallelized efficiently. On the other hand, small focus leads to a greater overall computational demand, as for every focus region, there are overlapping context pixels that are being computed multiple times.
• A quantitative comparison of different focus and context pairs is difficult, as varying these parameters consequently changes the architecture of the neural network (e.g. number of weights, or required number of layers to reduce the image to a predetermined size through strided convolutions), modifying the fitting capabilities of the network.
We note that as the locality length scale of an operator grows, the optimal EDNN approaches a single tile being processed by a normal deep neural network; the context region is simply periodic padding. This is because for fully non-local operators, it is physically impossible to divide the problem up in the style of EDNNs since the full volume is needed to make an inference. Such systems are rare in practice, thankfully. EDNNs therefore represent a framework that can naturally handle the full continuum of possible screening environments. EDNNs can describe all phases of the electron gas. We note that when dealing with operators that have large values of l, it is often useful to recast the problem in reciprocal space, and such an approach could be useful too with EDNNs.
(1) |
(2) |
For these quantum mechanical systems (computed within the generalized gradient approximation of the density functional theory48,49 framework), it is not possible to determine a specific value of l. Within the hierarchy of approximations, the method we used to compute the total energy (PBE51) is considered to be a semi-local approximation, since the exchange–correlation potential includes only gradients of the charge density. Other terms within the total energy are fully non-local (although they are subject to screening by the electron gas). Nonetheless, total energy is an extensive quantity and again the EDNN performs favourably compared to previously reported (non-EDNN) results.
Using f = 64 and c = 32, we achieve a MAE of 1.122 meV Å−2 on our test set after 500 training epochs, notably better than the MAE of 2.529 meV Å−2 on a traditional (non-EDNN) network (Fig. 5).42 For this choice of (f, c) our input representation is divided into 16 tiles, enabling an inference speed-up factor of 16 due to the ability to calculate the contribution from each tile in parallel. This is on top of the inherent speed-up of evaluating the EDNN compared to DFT. A conservative estimate for this latter speed-up is on the order of 1 million in terms of CPU-hours.
The ability of an EDNN to learn to ignore, or compensate for redundant context can be explored by measuring the performance of the model when information is partially obfuscated through the application of a Gaussian blur to select regions of the input. In Fig. 6, we plot the performance of the network when blur is applied within the context region (edge) during inference. As expected, when the blur is applied at the periphery of the context region, the network reports very similar values for ε as when there is no blur present. As the blurring encroaches on more context, the predictions become poorer; the neural network is evidently learning to ignore the context region. This is to be expected, as the data will appear again in the focus region of another tile. This is how double-counting is avoided. When a small area in the center of the focus region is blurred, the neural network is able to make accurate predictions, (likely due to the limited amount of information being lost), but as the extent of the blur increases within the focus region, the accuracy of inference suffers greatly.
We ran molecular dynamics at a temperature of 1000 K using forces obtained through the density functional theory framework (using VASP58–61 and the PBE51 functional), collecting configurations from the molecular dynamics trajectories as training. In all, we collected 501473 training configurations and 60744 testing configurations.
We use the same Gaussian-based input representation, and the same deep convolutional neural network architecture as the “hexagonal sheets” investigation. The larger supercell size of the porous graphene sheets requires discretization on a larger grid; we chose a 384 × 384 grid. This results in more tiles required, but the training procedure remains identical in all other aspects. The results are shown in Fig. 7. The EDNN performs well with a median absolute error of 1.685 meV Å−2.
To demonstrate this, we used the neural network trained on 8 × 8 Ising configurations to make energy predictions of 128 × 128 Ising configurations sampled near the critical temperature. Without any further training, the median absolute error on these much larger configurations was 2.055J/L2. This is substantially larger than the 8 × 8 error, but this is to be expected. Since there is some error associated with the prediction of a given tile, it indeed makes sense that this error will scale with the extensivity of the system. In other words, the error relative to the absolute energy is still small. The predicted values and the prediction error is plotted against the true values in Fig. 8.
Fig. 8 An EDNN trained only on 8 × 8 Ising training examples is capable of making accurate predictions of the 128 × 128 Ising model near criticality. While the absolute error is higher at 2.055J/L2, the relative error is very small. While it appears the EDNN consistently overpredicts the energy, this is not an effect of large scale inference, but rather that the input configurations are from an energy window where the original EDNN also slightly overpredicted the energy. This is evident when compared to the appropriate region of Fig. 3b. |
Additionally we test the DFT model trained on a 12.5 Å × 13 Å, 256 × 256 grid (N = 60 atoms) on several larger domains up to 62.6 Å × 65.1 Å (1024 × 1024 grid, N = 1500 atoms). During inference, the EDNN performs similarly well predicting energies of configurations larger than those in the training set, as seen in Fig. 9, and does so many orders of magnitude faster than conventional numerical methods. This is a powerful feature of EDNN, as one can generate a testing set of many training examples for significantly less computational expense, and then apply it to larger systems without the (or worse) scaling inherent in Kohn–Sham density functional theory. The evaluation of the EDNN scales as . The fact that an EDNN can take a problem and map it to might seem suspicious at first. Recall though that HK DFT does scale linearly and that the polynomial scaling of Kohn–Sham DFT is due to the diagonalization of the Kohn–Sham Hamiltonian. We avoid such an evaluation during inference, and therefore we achieve scaling consistent with orbital free DFT.
As a proof of concept and to further demonstrate the exceptional scaling of the EDNN approach, we generated a porous graphene sheet comprised of 35.2 million atoms, with a supercell size of 1.0 μm2. EDNN inference is trivially parallel, so using a custom distributed TensorFlow implementation, we were able to compute the total energy of the sheet using 448 cores across 16 nodes in 24.7 minutes. A “ground truth” DFT calculation at this scale is intractable, but based on the results on smaller-scale tests (Fig. 8 and 9) we can confidently conclude that the relative error is comparable to that of a DFT method. These results are shown in Fig. 10 and 11.
Fig. 10 Using the model trained on many small porous graphene sheets, we used a multi-node, distributed TensorFlow implementation to make predictions on large sheets. The model evaluation time scales linearly with the cell area (and thus, under the assumption of homogeneous density, the number of atoms). The annotations refer to the number of atoms in the configuration. The EDNN allows for total energy calculations at DFT accuracy of more than 1.0 μm2 of material in 24.7 minutes (Fig. 12). Importantly, the model was trained on a dataset of configurations consisting of only around 500 atoms, and therefore collection of training data does not require accurate simulation of large configurations. All EDNN evaluations were carried out on a 20-node cluster with 28 cores per node. |
Since the neural network of the EDNN only operates on a single tile at a time, EDNNs permit the use of more computationally intensive network architectures (e.g. fully-connected networks), that would normally be infeasible. Additionally EDNNs are well suited for Monte–Carlo sampling, as local updates would only require the re-evaluation of nearby tiles, not the entire configuration.
Under the EDNN framework, there is no requirement to assign energy to a particular region of space or atomic species, unlike atom-centred methods. Rather, the network is simply told that the extensive property applies to the entire system. This is important because it is extremely flexible; it allows for a seamless method that can learn quantum molecular mechanics, implicit solvation energies, entropy, etc. Furthermore EDNNs can operate on any spatial field, such as the electron density.
EDNN can be thought of as a generalization of a complex, many-body force-field. Features are learned automatically by the EDNN, and domain decomposition is handled intrinsically. Traditional force-fields assume that extensive properties such as the total energy can be expressed as a many-body sum over interacting particles, and in some cases, an implicit solvation environment. While many different models exist, almost all share the common feature of expanding the total energy in terms of neighbour interactions, typically within a fixed cutoff radius. Bond lengths, angles, and partial charges are used as features, and the coefficients of the relative terms are trained against a fixed set of examples. Even methods designed for metallic environments (e.g. the embedded atom method) make use of structural “features” (e.g. the density of neighbours) which are then fed into a feature based decision algorithm. When used for atomistic modelling, EDNN accomplish the same task as a force-field without the requirement of hand selecting features. They are also extremely straightforward to implement in parallel, and do not require complex calculations of angles, dihedrals, feature vectors, or neighbour lists for efficient parallelization.
EDNNs are built on the idea that interactions are screened at some length scale, but that a priori, the user does not know what it is. The training data itself encodes this length scale, and the network takes advantage of it. The generality of the concept and the implementation are why EDNNs are so useful; the physical property that permits the decomposition of the problem is actually revealed by the data itself.
On a similar note, (i.e. that the data should reveal the physics, rather than incorporating it a priori) relates to the question of invariance; there are currently two schools of thought about how symmetries should be built into models. Within the chemical literature, there is currently a strong bias toward enforcing symmetries within models first, and then developing methods that are constrained by those symmetries. This “symmetry first” view is generally not what has been the approach in the computer vision/artificial intelligence community. We are of the belief that symmetries (e.g. rotations) can be learned and should not necessarily be enforced. This is both from a pragmatic point of view (too many constraints on a network during the learning process can restrict it to local minima), and from somewhat of a philosophical point of view. In deep learning, features can and should be learned from the dataset itself. The way to teach a neural network a physical law is to provide it data from which it can learn.
The (two dimensional, single channel) EDNN takes as input a batch of images , where NB is the batch size. EDNNs are not limited to 2-dimensional single-channel data and can be used for three (or more) dimensions and multiple channels.
We construct the input tensor X by taking a periodically-padded copy of x and performing a strided slice, starting from the origin with stride equal to the focus. We extract a patch of size (f + 2c) × (f + 2c), e.g.
Xb,i,j,d = xb,if–c:if+f+c,jf−c:jf+f+c,d,i,j ∊ [1…L/f] |
In words, X is comprised of L2/f2 tiles of size (f + 2c)2d-channel pixels for each of the NB images in a batch.
Each tile is passed individually through the approximation function (e.g. a neural network), which is parametrized by θ:
We call the tensor the “tile contributions”, that is, how much of the final answer is contained in each tile. The tile contributions are reduced through a summation, preserving only the batch dimension:
This vector, b is the predicted extensive quantity from the EDNN, one entry for each example in the input batch.
The batch loss is computed as the mean-squared error between the predicted and the true value (i.e. the “label”):
This loss function is then minimized as in a normal neural network, using some form of gradient descent to tweak the parameters θ so that the prediction matches the true answer as closely as possible.
This journal is © The Royal Society of Chemistry 2019 |