Keisuke Takahashi*a and
Maeda Satoshi
*ab
aDepartment of Chemistry, Hokkaido University, North 10, West 8, Sapporo 060-8510, Japan. E-mail: keisuke.takahashi@sci.hokudai.ac.jp
bInstitute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0021, Japan. E-mail: smaeda@eis.hokudai.ac.jp
First published on 1st July 2021
Data science is introduced to identify the reactant, product, and reaction path in the chemical reaction network. Cobalt catalyzed hydroformylation is investigated where the reaction network is built via first principles calculations. The closeness centrality and high frequency node are found to be the reactant cobalt tetracarbonyl hydride. In addition, betweenness centrality uncovers three reaction paths which have the products of aldehyde, CH2O, and CO2, respectively. The energy profile determines that the reaction path leading to aldehyde is energetically favored; thus, the reaction path for cobalt catalyzed hydroformylation is identified without kinetics. Hence, the proposed approach can act as a first step towards understanding the complex chemical reaction network and towards further kinetic understanding of the chemical reaction.
Hydroformylation is selected as the prototype reaction where the reaction involves the production of aldehydes from alkenes.12,13 In particular, cobalt catalyzed hydroformylation is investigated for considering homogeneous catalysis as the details of the reaction process are rather complex.8,14,15 The chemical reaction network of cobalt catalyzed hydroformylation is constructed via first principles calculations where the atomic interactions of CO dissociated cobalt tetracarbonyl hydride HCo(CO)3 with ethylene (C2H4), hydrogen H2, and carbon monooxide (CO) are considered. Reaction paths in the chemical reaction network for cobalt catalyzed hydroformylation are sought for via data driven analysis based on graph theory.
Network visualization is performed in order to represent the calculated hydroformylation reaction as a network. In particular, the Force Atlas 2 algorithm is used in order to visualize the network as shown in Fig. 1.20 Network visualization is informed by the continuous algorithm and is force-directed where nodes repel each other while edges attract their respective nodes, making node placement dependent on the other nodes present within the network. Fig. 1 shows the overall reaction network of hydroformylation created by the AFIR method where the reaction network traces how the AFIR method navigates the reaction. Fig. 1 demonstrates that some nodes form clusters at the center of the network while other groups form a branch-like structure that stems out from the center of the network. Note that node numbers 26, 145, 1856, and 1812 are isolated from the network as a result of SCF convergence failure that occurred on paths to these nodes.
Here, the question arises concerning how the reactant, product, and reaction paths connecting them will be identified within the reaction network as shown in Fig. 1. In order to find the key nodes within the reaction network, network analysis is performed in terms of graph theory. In particular, harmonic closeness centrality and betweenness centrality are explored where closeness and betweenness represent how close a node is to other nodes and which nodes control the network, respectively.22 In other words, one can consider that harmonic closeness centrality can help indicate energetically stable structures while betweenness centrality can help indicate key intermediate compounds in the reaction. Harmonic closeness demonstrates that node number 54 has the highest score, indicating that node 54 accesses many neighboring nodes and has good agreement with the observation that node 54 has the highest frequency within the network. Note that node 54 is colored in yellow in Fig. 1. HCo(CO)4, the compound represented by node 54, has been previously reported to be an active catalyst for hydroformylation where the dissociation of CO from HCo(CO)4 is considered as the initial step for hydroformylation.24 Hence, analyzing closeness centrality helps determine the reactant within the calculated hydroformylation reaction network.
Similarly, betweenness centrality is investigated where the top 40 betweenness centrality nodes are selected and colored in red as shown in Fig. 1. Please see the ESI† for the top 40 betweenness centrality nodes. It is surprising to find that three paths appear when connecting the top 40 betweenness centrality nodes as shown in Fig. 1. More importantly, each path contains key molecules where paths (1), (2), and (3) result in aldehydes, formaldehyde (CH2O), and carbon dioxide (CO2), respectively. Given that hydroformylation is a reaction that produces aldehydes, this result therefore shows that betweenness centrality can be used to help identify reaction paths for aldehyde formation from node 54 (which contains HCo(CO)4) without kinetics.
Further details of paths (1), (2), and (3) in Fig. 1 are investigated in terms of the energy profiles as shown in Fig. 2. The energy profile of path (1) indicates that aldehydes (node 436) are formed in the node order 54 → 6 → 523 → 436 where an activation barrier of 204.05 kJ mol−1 is required when attempting to move from nodes 523 to 436. In the same fashion, paths (2) and (3) are analyzed via an energy profile. However, these paths encounter multiple high activation energy barriers and endothermic reactions when attempting to arrive at CH2O (node 1254) and CO2 (node 1517) in path (2) and path (3), respectively. Note that node 212 is not in top 40 ranking nodes with high betweenness centrality. Therefore, paths (2) and (3) (illustrated in Fig. 1) can be considered to be unlikely to occur while path (1) is energetically favored. Although the actual reaction path could be more complex with kinetic analysis, data science has provided a near-instant method of providing potential candidates for reactants, products, and reaction paths encountered within a complex reaction network. This approach therefore accelerates the identification of reaction paths within a reaction network without additional kinetics analysis. It must be noted that the frequency of nodes as well as betweenness centrality analysis are able to detect key nodes based on the structure of the network shown in Fig. 1 created by a development version of the GRRM program (version on April 9th, 2020), therefore, different network analysis might be required depending on the structure of network. In other words, the proposed approach can act as the first step towards deeper kinetic analysis into a reaction network and provide insight into where further investigation can occur.
![]() | ||
Fig. 2 Energy profiles of paths (1), (2), and (3) found by betweenness centrality. Ea is the activation energy barrier. Structures of reactant and key products are also represented. Color code: pink: cobalt; gray: carbon; white: hydrogen; red: oxygen. Note that the energies path is searched and calculated by AFIR with Gaussian 16 and numbers represent the node number in Fig. 1. |
The chemical reaction network is then kinetically investigated in order to compare kinetics against the proposed reaction path created using graph theory. The kinetically most feasible path from node 54 to the most stable catalyst–product complex 880 is extracted from the network and depicted in Fig. 3. The path is obtained by combining the shortest path in terms of overall rate constants with local equilibration paths. The local equilibration paths can be seen in the processes from 54 to 1931 and from 527 to 544. The path consists of many steps that include bond reorganization steps and fast steps such as pseudo rotation and conformation change. As can be seen in Fig. 3, the energy profile has three regions: (a) an initial region including node 54 which represents HCo(CO)4 + C2H4 + H2, (b) the intermediate region including node 194 which represents CH3CH2Co(CO)3 + CO + H2, including node 1078 which represents CH3CH2Co(CO)4 + H2, CH3CH2C(O)Co(CO)3 + H2, and (c) the final region which includes node 446 which represents HCo(CO)3 + CH3CH2CHO. This path shows agreement with the well-known Heck–Breslow mechanism, which justifies the use of this network in this study.14 Additionally, aldehyde is found in both the graph network and the kinetic study, where aldehyde is found to be produced at node 436 within the network while aldehyde is found to be produced at node 880 during the kinetic investigation. Thus, these results show that the kinetic study and chemical reaction network are both capable of finding nodes where aldehyde is produced.
![]() | ||
Fig. 3 Energy profile of stable path by kinetic analysis. Color code: pink: cobalt; gray: carbon; white: hydrogen; red: oxygen. Note that the energies path is searched and calculated by AFIR with Gaussian 16 and numbers represent the node number in Fig. 1. |
Here, one can also understand that the closeness centrality and betweenness centrality of the network shown in Fig. 1 reflects the search procedure used in the SC-AFIR algorithm. First, when Λi is used as an index to rank local minimum structures, the preference in a group that reaches equilibrium in a shorter timescale than tMAX is dependent on the Boltzmann distribution at TR. At the start of the search, only complexes among HCo(CO)3 + CO + C2H4 + H2 are considered and paths are computed from these structures. Once the structures in the intermediate region are found, searches are done preferentially from structures in the intermediate region since the structures in the initial region can transition to the intermediate region within tMAX. Finally, searches are done preferentially from structures in the final region since the structures in the initial and intermediate regions can transition to the final region within tMAX. It should be noted that many other possibilities that originate from the initial region are searched as the other regions are unknown at the start. This can account for why structures having the highest closeness centrality are found within the initial region. Node 54 is quasi-symmetric, having CC and H–H almost on the plane of H, Co, and C in the axial CO, making it a possible transit point among other HCo(CO)4 + C2H4 + H2 complexes in the initial region. Therefore, the closeness centrality is useful for identifying the most important structure within the initial region. As noted, the SC-AFIR searched various possibilities originating from the initial region and created many local areas within the network. By definition, a node that has high betweenness centrality is a node that can be viewed as having more control in the network as many paths lead through it. Given this, structures that have high betweenness centrality should correspond to key intermediates within paths that connect to different areas of the network. Various different chemical transformations are able to be identified by tracing nodes having high betweenness centrality. Identifying possible intermediates, regardless of their kinetic importance, is very important for actual mechanism studies on chemical reactions. Hence, betweenness centrality can be powerful for identifying key intermediates. Chemical reaction network has become possible to create by combining the first principles calculation with data science. However, it has been a challenge to extract knowledge from the network due to the high complexity of the chemical reaction. Here, data science, graph theory in particular, is found to be a powerful approach for quickly extracting knowledge from the reaction network without any kinetics analysis. Thus, combining graph theory and the first principles calculations helps accelerate the determination of the reaction path in a chemical reaction network.
Footnote |
† Electronic supplementary information (ESI) available: All structures files and data used in this work. See DOI: 10.1039/d1ra03395f |
This journal is © The Royal Society of Chemistry 2021 |