Global geometry of chemical graph neural network representations in terms of chemical moieties
Abstract
Graph neural nets, such as SchNet, [Schütt et al., J. Chem. Phys., 2018, 148, 241722], and AIMNet, [Zubatyuk et al., Sci. Adv., 2019, 5, 8] provide accurate predictions of chemical quantities without invoking any direct physical or chemical principles. These methods learn a hidden statistical representation of molecular systems in an end-to-end fashion; from xyz coordinates to molecular properties with many hidden layers in between. This naturally leads to the interpretability question: what underlying chemical model determines the algorithm's accurate decision-making? By analyzing the hidden layer activations of QM9-trained graph neural networks, also known as “embedding vectors” with dimension-reduction, linear discriminant analysis and Euclidean-distance measures we shed light on an interpretation. The result is a quantifiable geometry of these models' decision making that identifies chemical moieties and has a low parametric space of ∼5 important parameters from the fully-trained 128-parameter embedding. The geometry of the embedding space organizes these moieties with sharp linear boundaries that can classify each chemical environment within <5 × 10−4 error. Euclidean distance between embedding vectors can be used to demonstrate a versatile molecular similarity measure, comparable to other popular hand-crafted representations such as Smooth Overlap of Atomic Positions (SOAP). We also reveal that the embedding vectors can be used to extract observables that are related to chemical environments such as pKa and NMR. While not presenting a fully comprehensive theory of interpretability, this work is in line with the recent push for explainable AI (XAI) and gives insights into the depth of modern statistical representations of chemistry, such as graph neural nets, in this rapidly evolving technology.