Chris
Beeler
abc,
Sriram Ganapathi
Subramanian
bd,
Kyle
Sprague
b,
Mark
Baula
b,
Nouha
Chatti
b,
Amanuel
Dawit
b,
Xinkai
Li
b,
Nicholas
Paquin
b,
Mitchell
Shahen
b,
Zihan
Yang
b,
Colin
Bellinger
c,
Mark
Crowley
*b and
Isaac
Tamblyn
de
aDepartment of Mathematics and Statistics, University of Ottawa, Ottawa, ON, Canada. E-mail: christopher.beeler@uottawa.ca
bDepartment of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada. E-mail: mark.crowley@uwaterloo
cDigital Technologies, National Research Council of Canada, Ottawa, ON, Canada. E-mail: colin.bellinger@nrc-cnrc.gc.ca
dVector Institute for Artificial Intelligence, Toronto, ON, Canada
eDepartment of Physics, University of Ottawa, Ottawa, ON, Canada. E-mail: isaac.tamblyn@uottawa.ca
First published on 20th February 2024
This paper provides a simulated laboratory for making use of reinforcement learning (RL) for material design, synthesis, and discovery. Since RL is fairly data intensive, training agents ‘on-the-fly’ by taking actions in the real world is infeasible and possibly dangerous. Moreover, chemical processing and discovery involves challenges which are not commonly found in RL benchmarks and therefore offer a rich space to work in. We introduce a set of highly customizable and open-source RL environments, ChemGymRL, implementing the standard gymnasium API. ChemGymRL supports a series of interconnected virtual chemical benches where RL agents can operate and train. The paper introduces and details each of these benches using well-known chemical reactions as illustrative examples, and trains a set of standard RL algorithms in each of these benches. Finally, discussion and comparison of the performances of several standard RL methods are provided in addition to a list of directions for future work as a vision for the further development and usage of ChemGymRL.
ChemGymRL is a collection of interconnected environments (or chemistry benches) that enable the training of RL agents for discovery and optimization of chemical synthesis. These environments are each a virtual variant of a chemistry “bench”, an experiment or process that would otherwise be performed in real-world chemistry labs. As shown in Fig. 1, the ChemGymRL environment includes reaction, distillation, and extraction benches on which RL agents learn to perform actions and satisfy objectives. Each bench could be a standalone environment (as the majority of RL environments are), this would counter the purpose of their inter-connectivity. While their inter-connectivity is not important for training a specific agent on a single bench, sharing an overarching framework for the benches ensures a compatibility between them, allowing the results of one experiment to be easily used as input to another bench's task. This makes inter-connectivity important if one wishes to perform a multi-task experiment, requiring several benches to be used in differing orders.
The need for a simulated chemistry environment for designing, developing, and evaluating artificial intelligence algorithms is motivated by the recent growth in research on topics, such as automated chemistry and self-driving laboratories,1–5 laboratory robots6–13 and digital chemistry for materials and drug discovery.14–22 Given RL's appropriateness for sequential decision making, and its ability to learn via online interactions with a physical or simulated environment without a supervised training signal, we see it as having a great potential within digital chemistry and self-driving laboratories. Within this context, recent work has demonstrated some successful applications of RL23,24 or methods inspired by parts of RL25 to automated chemistry. Nonetheless, it remains an understudied area of research. Our work aims to partially addresses this problem by sharing an easy to use, extensible, open source, simulated chemical laboratory. This serves to simplify the design and development of application specific RL agents.
Although RL agents could be trained online in physical laboratories, this approach has many limitations, particularly in early stages of the research before mature policies exist. Training agents in a robotic laboratory in real-time would be costly (in both time and supplies) and restrictive (due to potential safety hazards). Our simulated ChemGymRL environment remedies this by allowing the early exploration phase to occur digitally, speeding up the process and reducing the waste of chemical materials. It provides a mechanism to design, develop, evaluate and refine RL for chemistry applications and researcher, which cannot safely be achieved in a physical setting. We would also like to specifically highlight ChemGymRL as a unique testbed for RL research. Since ChemGymRL is open source and highly customizable, it provides a training environment to accelerate both chemical and RL research in several directions in addition to providing a useful training environment with real-world applications (more discussions in Section A.1).
The software is developed according to the Gymnasium† standard, which facilitates easy experimentation and exploration with novel and off-the-shelf RL algorithms. When users download it, they gain access to a standard gymnasium compatible environment that simulates chemical reactions using rate law differential equations, the mixing/settling of soluble and non-soluble solutions for solute extractions, the distillation of solutions, and a digital format for storing the state of the vessels used. In addition to this article, further detailed information about this software package, documentation and tutorials, including code and videos can be found at https://www.chemgymrl.com/.
In our experimental results, we illustrate how to setup and use each bench with two distinct classes of reactions, along with how they can be extended to new reaction types. We evaluate the capabilities of a wide cross-section of off-the-shelf RL algorithms for goal-based policy learning in ChemGymRL, and compare these against hand-designed heuristic baseline agents. In our analysis, we find that only one RL off-the-shelf RL algorithm, proximal policy optimization (PPO), is able to consistently outperform these heuristics on each bench. This suggests that the heuristics are a challenging baseline to compare to but that they are also far from optimal. Thus there is space for an optimization approach such as RL to achieve optimal behavior on each bench. Near the end of the paper, we discuss some of the challenges, limitations and potential improvements in RL algorithms required to learn better, more sample efficient policies for discovery and optimization of material design pathways.
The remainder of the paper is organized as follows. The next section describes the ChemGymRL environment, including the three primary benches: reaction, extraction and distillation. Section 3 provides an overview of reinforcement learning and Section 4 contains a case study of the Wurtz reaction and its use in each bench. Our experimental setup involves training off-the-shelf RL algorithms on each of the benches. The RL algorithms and hyper-parameters are discussed in Section 5 and the specific laboratory settings and objectives used in our experiments are described in Section 6. The results of the RL experiments are presented in Section 7 and the limitations of the simulations are discussed in Section 8 followed by our general conclusions and some ideas for future directions.
Benches recreate a simplified version of a task in a material design pipeline and vessels contain materials and track the hidden internal state of their contents, as shown in Fig. 1(b). A bench must be able to receive a set of initial experimental supplies, possibly including vessels, and return the results of the intended experiment, also including modified vessels. A vessel is used to transfer the chemical state contained in one bench to another, allowing for continuation of experiments.
We provide some core elements of a basic chemistry lab which enable the simulation of essential materials experiments. Critically, each bench and the gym as a whole is extensible. Therefore there is no limit on the complexity and precision simulations can be implemented. In the following sections we describe each of these benches in term and demonstrate an example workflow.
These benches each have three crucial components required for operating them. The observation space is the possible set of observations which the agent (or human user) can use to learn the status of the bench and take appropriate actions. These observations are, a usually only partial, representation of the internal state of the system. The action space for a bench, is a set of actions the user can take on that bench. These actions are methods of modifying the state of the system or observation. Lastly, the reward function is a measure of success based on the states the system has been in and the actions that have been taken. Generally in RL, the goal is to maximize its expected outputs, known as rewards, over time. These rewards are usually discounted over time. An episode refers to a single play of a bench from starting materials until the agent stops. The total, cumulative expected reward over an entire episode is called the return.
The goal of the agent operating on this bench is to modify the reaction parameters, in order to increase, or decrease, the yield of certain desired, or undesired, materials. The key to the agent's success in this bench is learning how best to allow certain reactions to occur such that the yield of the desired material is maximized and the yield of the undesired material is minimized. Therefore the reward in this bench is zero at all steps except the final step, at which point it is the difference in the number of mols of the desired material and undesired material(s) produced.
A main feature of ChemGymRL is its modularity. If one wanted to make the results of RxN more accurate and generalizable, they could replace the current system of differential equations with a molecular dynamics simulation without needing to change how the agent interacts with the bench or how the bench interacts with the rest of ChemGymRL. In its current state, this bench takes approximately 0.73 ms to initialize and 0.87 ms to perform an action.
A simple extraction experiment example is extracting salt from an oil solvent using water. Suppose we have a vessel containing sodium chloride dissolved in hexane. Water is added to a vessel and the vessel is shaken to mix the two solvents. When the contents of the vessel settle, the water and hexane will have separated into two different layers. Sodium chloride is an ionic compound, therefore there is a distinct separation of charges when dissolved. Due to hexane being a non-polar solvent and water being a polar solvent, a large portion of the dissolved sodium chloride is pulled from the hexane into the water. Since water has a higher density than hexane, it is found at the bottom of the vessel and can be easily drained away, bringing the dissolved sodium chloride with it.
The mutually exclusive actions are: (1) mix the vessel or let it settle (i.e. wait), (2) add an amount of solvent to the vessel, (3) drain contents of the vessel into an auxiliary vessel bottom first, (4) pour contents of the vessel into a second auxiliary vessel, (5) pour contents of the first auxiliary vessel into the second, (6) pour contents of first auxiliary vessel into the second, (7) pour contents back into the original vessel, (8) end the experiment.
The multiplier for each action corresponds to either the duration (mix, wait), the amount to pour, or the amount to drain, with 5 discrete non-zero values each. These actions are depicted in Fig. 3(d).
Note that, for practical implementation purposes, the two-part action described above is flattened into a single discrete value to reduce redundancy in the action space.
The goal of the agent in this bench is to use these processes in order to maximize the purity of a desired solute relative to other solutes in the vessel. This means the agent must isolate the desired solute in one vessel, while separating any other solutes into the other vessels. Note that the solute's relative purity is not affected by the presence of solvents, only the presence of other solutes. Therefore the reward in this bench is zero at all steps except the final step, at which point it is the difference in the relative purity of the desired solute at the first and final steps.
As with RxN, the realism of ExT could be improved by replacing the separation equations with a physics-based simulation without needing to change how the agent interacts with the bench or how the bench interacts with the rest of ChemGymRL. In its current state, this bench takes approximately 0.87 ms to initialize and 0.47 ms to perform an action.
A simple distillation example is extracting a solute dissolved in a single solvent. Suppose we have a vessel containing sodium chloride dissolved in water. If we heat the vessel to 100 °C, the water will begin to boil. With any added heat, more water will evaporate and be collected in an auxiliary vessel, leaving the dissolved sodium chloride behind to precipitate out as solid sodium chloride in the original vessel.
(1) |
(2) |
Similar to the ExT bench, these processes are mutually exclusive and each have a magnitude (temperature change, amount to pour). Thus, the same kind of (action, multiplier) definition is used for DiT bench. The actions can be one of the following four choices: (1) heat/cool by some amount, (2) pour from the distillation vessel into an auxiliary vessel, (3) pour from an one auxiliary vessel into another, or (4) end the experiment. Actions (1–3) each can have one of 10 multiplier values specifying magnitude. These actions are depicted in Fig. 4. Just as in ExT, the actions are returned by the agent are flattened into a single discrete value to reduce redundancy in the action space.
Fig. 4 The icons representing each action and their multiplier values available in DiT. The distillation vessel (DV) is the primary vessel and B1/B2 are the auxiliary vessels in the experiment. |
The goal of the agent in this bench is to use these processes to maximize the absolute purity of a desired material in the vessel. This means the agent must isolate the desired material in one vessel, while separating any other materials into other vessels. Note that unlike ExT, the material's absolute purity is affected by the presence of all materials. Therefore the reward in this bench is zero at all steps except the final step, at which point it is the difference in the absolute purity of the desired material at the first and final steps. In its current state, this bench takes approximately 0.87 ms to initialize and 0.86 ms to perform an action.
The characterization bench is the primary method to obtain insight as to what the vessel contains. The purpose of the characterization bench is not to manipulate the input vessel, but to subject it to analysis techniques that observe the state of the vessel, possibly including the materials inside it and their relative quantities. This does not mean that the contents of the input vessel cannot be modified by the characterization bench. This allows an agent or user to observe vessels, determine their contents, and allocate the vessel to the necessary bench for further experimentation.
The characterization bench is the only bench that is not “operated”. A vessel is provided to the bench along with a characterization method and the results of said method on that vessel are returned. Currently, the characterization bench consists of a UV-vis spectrometer that returns the UV-vis absorption spectrum of the provided vessel. Each material in ChemGymRL has a set of UV-vis absorption peaks defined and the UV-vis spectrum for a vessel is the combination of the peaks for all materials present, weighted proportionally by their concentrations. In future versions of ChemGymRL we will expand the characterization bench to include other forms of partial observation.
In the field of model-free RL on which we focus here, a major distinction between solution algorithms is between value optimization approaches and direct policy optimization approaches. A popular example of value optimization is Q-learning,27,28 where a state-value function Q(s,a):S × A → ℜ, is learned iteratively using the Bellman optimality operator Here s and s′ denote the current and next state respectively, a and a′ denote the current and next action, respectively. After this process converges, an exact or approximate scheme of maximization is used to extract the greedy policy from the Q-function. These methods are often restricted to environments with discrete actions, although many generalizations exist to other formulations.29–31
Direct policy optimization32 approaches are ones which iteratively improve the target policy directly. They may do this as the only optimization approach, or they may do it in combination with value-function optimization. Actor-critic33–35 methods are a currently popular approach for doing just that. In actor-critic methods, the algorithm alternates between estimating a value function Qπ (the “critic”) of a current policy via a partial policy evaluation routine using the Bellman operator on an initially random, stochastic policy π. The current policy π (the “actor”) is then improved by biasing it towards selecting actions that maximize the estimate maintained by the current Q-values and the value function for this improved policy is then re-estimated again. This family of methods can easily apply to both discrete and continuous action space environments, thus may be used on any bench in chemistry gym environment.
(3) |
(4) |
(5) |
Reaction | R1 | R2 | R1–R2 |
---|---|---|---|
1 | 1-Chlorohexane | 1-Chlorohexane | Dodecane |
2 | 1-Chlorohexane | 2-Chlorohexane | 5-Methylundecane |
3 | 1-Chlorohexane | 3-Chlorohexane | 4-Ethyldecane |
4 | 2-Chlorohexane | 2-Chlorohexane | 5,6-Dimethyldecane |
5 | 2-Chlorohexane | 3-Chlorohexane | 4-Ethyl-5-methylnonane |
6 | 3-Chlorohexane | 3-Chlorohexane | 4,5-Diethyloctane |
Wurtz can be an interesting and challenging reaction because the yield varies greatly between each product, making it difficult to train an agent which can optimally make each of them.
The vessel can then be brought to the ExT to separate dodecane from sodium chloride. Dodecane is non-polar, so if we add water to the vessel and mix, most of the sodium chloride will be extracted into the water while most of the dodecane will be left in the diethyl ether. We can then drain the water out of the vessel while keeping the diethyl ether. While it's impossible to get all of the sodium chloride out with this method, we can repeat this process to increase the purity of dodecane.
The vessel can then be brought to the DiT to separate the dodecane from the diethyl ether. Diethyl ether has a much lower boiling point than dodecane so it will boil first. Heating the vessel enough will cause all of the diethyl ether to vaporize, leaving the dodecane in the vessel with trace amounts of sodium chloride.
Alternatively, because dodecane has a much lower boiling point than sodium chloride, we can skip the ExT and bring the vessel to DiT right after RxN. As before, heating the vessel enough will cause all of the diethyl ether to vaporize, condensing into an auxiliary vessel. We can then put the collect diethyl ether elsewhere such that the auxiliary vessel collected the vaporized materials is now empty. If the vessel is heated up even further now, the dodecane will be vaporized and collected into the empty auxiliary vessel, leaving the sodium chloride behind. Our auxiliary vessel now contains pure dodecane, concluding the experiment.
While this example workflow uses the benches in a specific order, more complicated experiments may use them in a completely different order or even use each bench multiple times. Given specific goals, below we will outline how RL can be used to emulate this behavior for various cases.
Unless otherwise specified, all RL agents were trained for 100 K time steps across 10 environments in parallel (for a total of 1 M time steps). Training was done by repeatedly gathering 256 time steps of experience (in each environment), then updating our policy and/or Q-function with this new experience. For the PPO and A2C algorithms, the updates happened directly with the 2560 steps of new experiences. In contrast, for the DQN, SAC, and TD3 algorithms, the standard training approach uses an experience replay buffer, in our case containing 1 M steps, which was sampled from for training. For the first 30 K steps of DQN training, a linear exploration schedule beginning at 1.0 and ending at 0.01 was used. Exploration remained at 0.01 afterwards. All of these RL algorithms were performed using the stable baselines 3 implementations.39
In this paper, we choose to use DQN as a representative of the Q-learning family of algorithms since it is a very standard benchmark in RL, and the central themes that were initially introduced in DQN by Mnih et al.28 led to an explosion of different RL techniques that emerged subsequently. DQN is known to overestimate the values for multiple states and double DQN (DDQN)40 was introduced to decouple action selection and action evaluation. However, the classic solution to this, DDQN, is not necessarily better than DQN in all settings. It has been shown that DDQN suffers from an underestimation bias that is equally bad for performance,41 so we did not use it in these experiments.
In the second experiment, we introduce a new set of reaction dynamics given by
(6) |
Again, the PPO agents consistently outperform the A2C, SAC, and TD3 agents for all 5 target materials, however it is not as significant of a gap as in Wurtz RxN. Target materials with high returns across each algorithm (such as F, G, and H) appear to be easier tasks to learn, where target materials with less consistent return across each algorithm (such as E and I) appear to be more difficult tasks to learn.
As an example, consider when the target material is dodecane. In this experiment, the 1 mol of dissolved sodium chloride becomes 1 mol each of Na+ and Cl−, so the initial solute purity of dodecane is 1/3. Suppose we end the experiment with 0.7 mols of dodecane with 0.2 mols each of Na+ and Cl− in one vessel, and the remaining molar amounts in a second vessel. Dodecane has a solute purity of 7/11 and 3/19 in each vessel respectively. The final solute purity of dodecane would be 0.7 × 7/11 + 0.3 × 3/19 ≈ 0.493. Thus the agent would receive a reward of 0.159.
The average return as a function of training steps for each algorithm is shown in Fig. 6. On average, the agents trained with each algorithm are able to achieve a return of at least 0.4. This is expected as even an agent taking random actions can achieve an average return of approximately 0.44. The agents trained with A2C, SAC, and TD3 do not perform much better than a random agent in most cases, however the ones trained with PPO significantly outperform it. While on average, A2C, SAC, and TD3 have similar performance, we saw in Fig. 5 that the best performing SAC and TD3 agents outperformed the best A2C agents.
The second RxN experiment uses reaction dynamics more complicated than the Wurtz reaction. In the Wurtz reaction, the agent need only add the required reactants for the desired product all together. In this new reaction, this is still true for some desired products, however not all of them. Similarly to the previous experiment, we also devised a heuristic agent for this experiment, which achieves an average return of approximately 0.83. For target materials E, F, G, and H, the required reaction is a single step process like before. Therefore the optimal behavior is to exclusively add the required reactants at step 1, and increase the temperature to speed up the reaction in order to produce as much of the target before the experiment ends. For target material I, the required reaction is a two step process, allowing for variation in how to material is produced. While all four reactants are required to produce I, adding them all at once would also produce E, wasting needed materials. Hence the optimal behavior is not necessarily producing all intermediate products simultaneously. As any two of the three intermediates can be safely produced simultaneously, the heuristic policy is designed to add only the reactants required for two intermediates products (we arbitrarily choose F and G). Given the limited number possibilities it is easily determined by brute force that step 6 is the optimal step for the heuristic policy to add the reactants required to create the third intermediate products.
Using the performance of the heuristic agent as reference again, the best and mean relative performances of the agents trained with each algorithm are shown in Fig. 7. Once again, PPO consistently produces agents that can match the performance of the heuristic agent. The best performing policies produced by A2C, SAC, and TD3 are able to nearly match the heuristic agent for all desired products excluding I. This is not unexpected as producing I requires producing intermediate products at different times during the reaction. On average, the policies produced by SAC and TD3 however, are unable to match the heuristic agent when asked to produce E. This is also not unexpected, given that producing E is penalized for all other desired products.
Unlike PPO, the other algorithms used appear to be less reliable at producing these best performing agents. This could be due to PPO learning these policies much faster than the other algorithms, as seen in Fig. 8. Since PPO converges to optimal behavior so quickly, there's very little room for variation in the policy. The other algorithms however are slowly converging to non-optimal behaviors, leaving much more room for variation in the policies (and returns) that they converge to.
For the best performing agents produced by each algorithm, the average action values for each target are shown in Fig. 9. Looking at the heuristic policy, a constant action can be used for each target product, excluding I. When the target is I, the desired action must change after several steps have passed, meaning the agent cannot just rely on what the specified target is. Note that if all of a material has been added by step t, then it does not matter what value is specified for adding that material at step t + 1.
The best performing agent for each algorithm were all able to produce E when requested and Fig. 9 shows that they each have learned to add A, B, C, and not D. It can also be seen that all four algorithms learned to add two of A, B, or C in addition to D, then add the third one several steps later when I is the target product, mimicking the behavior of the heuristic policy. Note that even though the heuristic waits to add C, waiting to add A or B instead would be equally optimal. While each algorithm does this, PPO and A2C do so better than the others. PPO is also the only one that succeeds in both of these cases, showing that an RL agent can learn the required behavior in this system.
Unlike the reaction bench, we do not have an analytic solution for this bench, therefore we have devised a heuristic policy for this experiment based on what an undergraduate chemist would learn. These lessons involve adding a solvent of opposite polarity to the existing solution (i.e., adding a non-polar solvent to a vessel containing a polar solvent solution or vice versa), mixing everything, letting it settle until distinct layers are formed, and separating the two solvents into separate vessels. The vessel containing the solvent with a similar polarity to the target material is kept while the other vessel is discarded. Thus, our heuristic policy is designed mimic this behavior. However, as the dynamics are more complex we do not necessarily expect it to be optimal.
As seen in Fig. 10, the agents trained with A2C do not achieve a return above zero, while the agents trained with DQN ended up achieving a negative return. Not only do both PPO and PPO-XL produce agents that achieve significantly more reward than the other algorithms, they are able to outperform the heuristic policy as well. On average, the best performing agent trained with PPO-XL manages to achieve a return of approximately 0.1 higher than the heuristic (see Fig. 10), resulting in roughly a 10% higher solute purity. While there is a large variance in the final performance of the agents trained with PPO and PPO-XL, they consistently outperform the agents trained with the other algorithms.
As shown in Fig. 11, the action sequences of the policies learned from A2C, DQN, and PPO are quite different. The action sequences of the policies learned by PPO and PPO-XL are much more similar, as expected. The first half of these sequences are comparable to the heuristic, however the agents in both cases have learned a second component to the trajectory to achieve that extra return. Interestingly, both PPO and PPO-XL agents have learned to end the experiment when they achieve the desired results, whereas the A2C and DQN agents do not. PPO once again shows that an RL agent can learn the required behavior in this system.
Fig. 11 Wurtz ExT, the sequence of actions with the highest return when dodecane is the target material seen during the rollout of the best performing policy learned by each algorithm. Each picture represents an action and average value described by Fig. 3(d). The number beneath the image represents how many times that action was repeated. While it is more difficult to interpret these policies than with RxN, similarities can be seen between the PPO, PPO-XL, and heuristic policies, explaining their high performances. The A2C policy uses a similar action set, however in a different order, outlining the precision required by the agent. The DQN policy use many actions that either undo previous actions or do nothing in that specific state. |
As before, we have devised a heuristic policy and as with the RxN experiments, we expect it to be optimal once again. In distillation, the optimal behavior is to heat the vessel until everything with a boiling point lower than the target material has boiled off, discarding the boiled off contents, then continuing to heat the vessel until just the target material has boiled off, condensing in a separate vessel. Our heuristic policy is designed to mimic this behavior.
In Fig. 12 we can see that on average, the algorithms (excluding A2C) converge faster than in the other experiments, however, there is much less variation in return compared to before. For the case when no salt is present in the distillation vessel, the best-performing agents trained with each algorithm learn a very similar policy to the heuristic one, as seen in Fig. 13. They heat the vessel until the solvent has boiled away, then end the experiment. For the case when salt and additional material are present, the best-performing agents trained with PPO and PPO-XL modify their actions similar to the heuristic policy, achieving the optimal return in both cases. The best-performing agent trained with A2C modifies their actions in a similar fashion, however, it does so in a way that also achieves a much lower return. The best-performing agent trained with DQN makes much more significant changes to their policy, however, it still achieves a return closer to optimal than A2C. This shows that the expected behavior in our final bench can also be learned by an RL agent.
Fig. 13 Wurtz DiT, the sequences of actions with the highest return produced by the best performing policy learned with each algorithm for two cases: when salt is and is not initially present in the distillation vessel with another material. Each picture represents an action and average value described by Fig. 4. The number beneath the image represents how many times that action was repeated. The PPO, PPO-XL, and heuristic policies are nearly identical in both cases, with only minor differences. When no salt is present, the A2C and DQN policies are similar to the others, however when salt is present they continue to behave as if it is not. |
As previously mentioned, the ChemGymRL framework was designed in a modular fashion for the ease of improvement. The differential equations used to model the reactions could be replaced with a molecular dynamics simulation. This would allow RxN to operate with on a more generalizable rule-set. Without having to manually define the possible reactions, the RxN could be used to discover new, more efficient reaction pathways by an RL agent. Currently, the reward metric used in RxN is the molar amount of desired material produced by the agent. If this metric was changed to reflect a certain desired property for the produced material, then the RxN could be used for drug discovery. Making similar improvements to ExT and DiT, the RL agent could then learn to purify these new discoveries.
In terms of the ChemGymRL library itself, the next step is to create a lab manager environment which will be constructed to allow an RL agent to operate the entire system. Using pre-trained agents for the individual benches, the lab manager agent would decide which vessel to give to which bench while also specifying the desired goal to each bench, in order to achieve the lab manager's own goal. The lab manager agent would make proper use of the agentless characterization bench introduced here as well, which will have characterization methods with associated costs. In addition to this, the implementation of new benches will be explored, allowing more complicated experiments to be conducted and new insights into the benefits and challenges of the integration of RL into automated chemistry and self-driving labs.
In the Wurtz RxN experiment, A2C, SAC, and TD3 were not able to show better performances than an agent taking random actions, where PPO was able to achieve optimal returns on all targets. In the second RxN experiment, A2C, SAC, and TD3 were able to show performances that achieves optimal returns for one of the two difficult tasks, whereas PPO was able to achieve optimal returns on both.
In the Wurtz ExT experiment, A2C and DQN were not able to produce agents that perform better than doing nothing, whereas PPO was able to achieve higher returns than the devised heuristics. In the Wurtz DiT experiment each algorithm was able to produce an agent that performs better than doing nothing and much better than an agent taking random actions.
Finally, we included discussions on other RL algorithms that can be tried in ChemGymRL and how this environment will be extremely valuable to the broader RL research community. We hope to see a wide adoption of ChemGymRL within the set of test-beds commonly used by researchers in the RL community.
From the perspective of a RL researcher, ChemGymRL provides a very useful training environment on a problem having real-world impact. While majority of prior RL environments focus on games, researchers now acknowledge that RL has become a mature technology that can be useful in a variety of real-world applications.54 RL is growing rapidly and many new sub-fields within RL have emerged over the last five years.26 In this section we highlight the specific RL areas where we think ChemGymRL will be helpful, with the objective of encouraging the RL community to adopt this library within their suite of RL training environments.
Table 2 captures a set of RL paradigms (and associated algorithms) where we believe ChemGymRL will be impactful. In the previous sections, we considered a small set of RL algorithms to benchmark performances in ChemGymRL. However, there are a much larger class of RL sub-fields where ChemGymRL can be used as a testbed.
Paradigm | Algorithm | Other testbeds | ChemGymRL application |
---|---|---|---|
Model-based RL47 | I2A55 | Sokoban52 | Learn a model for chemical reactions and plan using the model |
ME-TRPO56 | Mujoco57 | ||
Curriculum learning49 | OOCG58 | Half field offense59 | Generate a curriculum of tasks across benches |
ScreenNet60 | Minecraft53 | ||
H-DRLN61 | StarCraft-II62 | ||
CM3 (ref. 63) | SUMO64 | ||
Reward shaping66 | RCPO65 | Mujoco57 | Provide shaping rewards for intermediate steps |
DPBA67 | Cartpole68 | ||
Partial observability70 | DRQN69 | Atari28 | True state not observed as in extraction bench |
DNC56 | POPGym71 | ||
Distributional RL48 | C51 (ref. 48) | Atari28 | Policy based on distribution of reagents produced for every action |
QR-DQN72 | Windy gridworld68 | ||
Experience replay methods74 | HER73 | Montezuma's revenge28 | Sparse rewards only obtained at the end of the episode |
ERO75 | Mujoco57 | ||
Learning from demonstations77 | DQfD76 | Atari28 | Demonstrations from hueristic/human policies |
NAC78 | GTA-V78 | ||
Hierarchical RL80 | Option-critic79 | Atari28 | Hierarchical policy in terms of benches (options) and low-level actions |
AOC81 | Four rooms domain80 | ||
Constrained RL83 | IPO82 | Safety gym83 | Safe handing of chemical reagents |
CPO84 | Mujoco57 |
Traditionally most RL methods are model-free, where the transition dynamics is neither known nor learned by the algorithms,26 and most algorithms we considered in this paper fall into the model-free category. Alternatively, another class of algorithms combines planning85 and RL. These algorithms explicitly learn the transition dynamics model and then use this model to arrive at the optimal policy (by planning). Such algorithms are classified as model-based RL algorithms.47 Depending on the domain, model-based RL methods can be significantly more sample efficient than their widely used model-free counterparts.86 In ChemGymRL, model-based approaches that learn the transition dynamics can effectively plan over all the different possible ways of obtaining higher amounts of the target products from the input reactants.
Another class of methods applicable to ChemGymRL is the curriculum learning methods in RL.49 In this paradigm, RL problems that are too hard to learn from scratch can be broken down into a curriculum of tasks which can then be tackled individually. Curriculum learning methods generally have 3 key elements: task generation, sequencing, and transfer learning. Task generation is the process of generating a good set of tasks that are neither trivial nor too hard to solve. Sequencing is the process of generating a sequence (in terms of difficulty, cost etc.) of available tasks. Transfer learning focuses on strategies to transfer knowledge from one task to another (so that the agent does not have to learn each task from scratch). Generating desired products from available reactants is a complex process that requires learning policies across multiple benches in ChemGymRL. This renders itself well to the curriculum learning framework. Closely related to curriculum learning is the hierarchical RL approach using the options framework.46 Transferring higher-level knowledge across tasks can take the form of partial policies of options. Options can be seen as temporally extended actions which allow learning/planning over a sequence of lower-level actions.
The default reward function in ChemGymRL is sparse. For example, in the RxN the agent only receives a reward equal to the molar amount of the target material produced at the end of the episode. Reward shaping methods66 provide small rewards in intermediate steps to help the agent learn and converge faster. It is possible for reward shaping to change the optimal behaviour and make agents learn unintended policies. Potential-based reward shaping methods are a special class of reward shaping methods that preserves the optimality order over policies and does not affect the converged optimal behaviour of the MDP.87,88 Such methods can be used in ChemGymRL.
The benches in ChemGymRL are typically partially observable. For example, the ExT had (partial) observations that do not show the amount of dissolved solutes present nor their distribution throughout the solvents. In this paper, we have considered benchmarks that assume that the observation obtained is the true state (such methods perform surprisingly well in partially observable domains too, in many cases69). Alternatively, using other algorithms such as DRQN69 or DNC56 that explicitly reason over partial observations, is guaranteed to provide better empirical performances.
RL algorithms traditionally aim to maximize the expected utility using the Q-function or the value function. However, in many environments considering the entire distribution of returns rather than just the expected value has been demonstrated to be helpful.48 This distributional perspective of RL has gained a lot of attention recently in the RL research community.89 Distributional algorithms can show good performances in the ChemGymRL environments since the returns are typically multi-modal (in the sense of distributions) with more than one product expected to be produced as a result of chemical reactions. Hence, these methods can also use ChemGymRL as a testbed.
The commonly used RL algorithms such as DQN28 and deep deterministic policy gradients DDPG90 use an experience replay technique for improving sample efficiency and removing correlations between successive data samples. Recently, there have been many advances in RL algorithms that aim to improve the efficiency of the experience replay technique, especially in sparse reward settings. Several such algorithms, including the popular human experience replay (HER)73 and experience replay optimization (ERO)75 methods can be tested in ChemGymRL.
Another paradigm of interest is the learning from demonstrations (LfD), which combines imitation-learning from external demonstrations with RL approach of learning from the environment.77 The objective is to make use of pre-existing policies or human knowledge to accelerate the training of RL agents, as opposed to learning from scratch which is highly sample inefficient. In ChemGymRL, for all the benches we have provided heuristic policies that can be used as the external demonstrator to train RL algorithms. Popular LfD techniques like deep Q-learning from demonstrations (DQfD)76 and normalized actor-critic (NAC)78 are of interest in ChemGymRL.
Finally, while handing chemical reagents, it is not only sufficient to learn the most optimal policy that provides higher quantities of the required products but also use safe techniques that do not cause any harm or injuries to agents or humans using these reagents. Such constraints in learning the optimal behaviour can be principally incorporated within the Constrained RL framework.83 The constrained RL techniques like interior-point policy optimization (IPO) and constrained policy optimization (CPO)84 impose constraints in the MDP that curtail an agent's ability to explore, so that it can perform safe exploration.91 Such methods are also expected to be successful in ChemGymRL.
We would like to highlight in Table 2 that many of the other testbeds used by prior works pertained to computer games, video games, or robotic simulations. In this context, ChemGymRL provides an excellent alternative testbed that pertains to a real-world problem and helps in evaluating the performance of a large class of RL algorithms.
(7) |
(8) |
Chemical reactions of this form can be considered as special cases of the initial value problem:
(9) |
(10) |
(11) |
(12) |
(13) |
Footnotes |
† The Gymnasium API https://github.com/Farama-Foundation/Gymnasium is the continuing development name of the original OpenAI Gym library for RL environments. |
‡ For example Sokoban-RL52 pertains to model-based RL, MineRL53 corresponds to curriculum learning etc., however ChemGymRL is a general testbed that can support a wide-variety of RL paradigms and algorithms. |
This journal is © The Royal Society of Chemistry 2024 |