Mae V. Taylora,
Zaid Muwaffaka,
Matthew R. Pennya,
Blanka R. Szulcb,
Steven Brownc,
Andy Merrittd and
Stephen T. Hilton*a
aUCL School of Pharmacy, 29-39 Brunswick Square, London, WC1N 1AX, UK. E-mail: s.hilton@ucl.ac.uk
bSchool of Biosciences, University of Kent, CT2 2NZ, UK
cScott Bader, Wellingborough NN297RJ, UK
dLifeArc, Centre for Therapeutics Discovery, Stevenage Bioscience Catalyst, Stevenage, SG1 2FX, UK
First published on 13th January 2025
Digital twin laboratories, accessible through low-cost, portable virtual reality (VR) headsets, have become a powerful tool in chemical education and research collaboration. These immersive digital environments replicate physical laboratories, offering unique platforms for planning experiments, conducting virtual lab tours, and training on specialist equipment. In this paper, we present the development of Lab427 VR, a digital twin model of our laboratory designed to be a novel platform for global collaborative research with immersive training. A significant advancement in our approach to the potential of digital twins such as our laboratory is the integration of conversational artificial intelligence (AI) avatars, which address operational gaps in current digital twin systems. We designed and trained three specialised AI avatars to perform key laboratory functions, achieving up to 95% accuracy in their responses, assessed using evaluation metrics such as human evaluation, set-based F1 scoring, and BERTScore. Our findings highlight the potential of combining digital twin technology with AI-driven solutions to enhance laboratory collaboration and training, demonstrating the future potential of smart, interactive connected laboratory environments.
Recent investigations into how AI, and more specifically conversational AI such as ChatGPT, a large language model (LLM) developed by OpenAI, can fill this gap have begun to emerge with examples focusing on comprehension of undergraduate chemistry experiments and assistance with writing reports,3 with further research understanding how these models understand more complex chemical language.4–7 Since its development in 2019 ChatGPT has gone on to instigate a huge wave of advancement in the field of related AI programs.
ConvAI is one example of a conversational AI application that has been developed primarily for the gaming industry, but that has potential for scientific use (Fig. 1). This web-based application allows users to design and integrate conversational avatars into their existing applications, without the need to program the generative AI model. These avatars can be trained using a ‘Knowledge Bank’ of information input by the user in the form of plain text files, as well as a character biography which modifies the apparent personality of the avatar, and the responses provided. Interested in how this application may be used to support our research into digital twin environments, and building on recent investigations into the use of VR in laboratories, we decided to investigate the potential of incorporating these avatars into a digital twin of our own laboratory. We initially designed and built a digital twin model of the laboratory and trained three avatars to perform specific roles beneficial to the daily work, covering chemical inventory awareness, health and safety expertise, and general laboratory stock locations. These avatars were consequently tested through a series of scenario-based tests and evaluated both via human metrics, computational metrics, and error-analysis. Questions were also asked to push the original scope of the avatar, to test how well the avatars use external sources of information in addition to their internal knowledge. To best contextualise how these avatars may be of benefit, we also introduce in this paper our Lab427 Digital Twin VR software for the first time.
![]() | ||
Fig. 1 ConvAI avatars present the opportunity to customise features such as appearance, personality, training information and character backstory. |
As recently featured in a Nature Spotlight,10 Lab427 Digital Twin is one such VR application that was developed within the Hilton group. Previous research in the group has focused on generating a VR application for undergraduate training in High-Performance Liquid Chromatography (HPLC), using highly detailed and interactive HPLC models in a virtual space. This application has been successfully integrated into the current Pharmacy curriculum at UCL School of Pharmacy and has received positive feedback on its use from students.11 Building upon this, the aim was to digitise the research lab and in doing so, produce a platform suitable for training and wider collaborative opportunities and further digital integration.
![]() | ||
Fig. 2 Images from Lab427 Digital Twin VR software developed in house using Unreal Engine at the Hilton Lab, UCL. |
![]() | ||
Fig. 3 Students train on continuous flow equipment in real life (left) versus digital world collaboration via multiuser capability in Lab427 Digital Twin software (right). |
The Lab427 Digital Twin software was designed to provide an exact model of the research laboratory, with details such as drawer location and contents and interactive elements, cognisant of the fact that the degree of interactivity in a VR program can be a crucial factor for information retention.12 This included equipment such as HPLC digital twins in their correct locations, and interactable elements such as laboratory glassware and 3D-drawing pens. This enables users to practice using such equipment without associated risks due to inexperience, and VR experiences have been shown to reduce anxiety and confusion in later real-world scenarios.13
Whilst the real environment was digitised as replica, the software was also designed with two additional rooms that were not replicas of real life in addition to that outlined above (Fig. 4 and ESI†). One was designed to represent a conference room, including a projector with changeable PowerPoint slides, chairs, and interactive whiteboards. The second room was designed as an area for poster presentations and included a series of blank canvases for PhD students to share their work. As the software was designed to allow for multiplayer sessions, this meant that individuals could meet in the same virtual space using either a VR headset or a PC (Fig. 5).
![]() | ||
Fig. 4 (Top left, clockwise) Process of developing a digital twin model in VR, utilising existing floor plans to develop accurate CAD models with further optimisation in Revit and Unreal Engine. |
![]() | ||
Fig. 5 The ConvAI avatars in the Digital twin laboratory model of LAB 427 (left to right) SAM, SUSAN, and InGRID. |
(1) 18 Risk assessments specific to chemical hazards.
(2) The local laboratory 427 health and safety rules of practice.
(3) A summary text generated in ChatGPT.
The methods used to evaluate the avatars depended heavily upon the intended use case of each avatar. For example, where the avatar was trained to provide specific information about the location of an item, in either the chemical inventory system or general laboratory consumables, then the model was evaluated based on the percentage of correct information retrieval. In the case of the avatar designed for Chemical Inventory Navigation, a series of 20 chemicals were selected at random and the avatar was asked to return information regarding that chemical's CAS number, local stock code, and location. For the avatar trained on laboratory consumables and glassware only the location was requested, however further questions were asked to try and see how well the avatars could use information in their knowledge bank as well as external information sources.
The health and safety avatar – Susan, was evaluated by both human evaluation and computational metrics. A series of questions regarding health and safety, both general information around chemical risk assessments and more specific questions pertinent to Lab427, were generated. In total, 35 general health and safety questions were written and 19 questions specific to Lab427 were human generated, for a total of 54 questions. An example general question used was “Explain the significance of ventilation systems in chemical labs?”, whilst a specific question was “I need to use some TFA in my experiment. Which risk assessment should I refer to, and can you summarise it?”, as this required the generative-AI model to correctly recall the specific Lab427 ‘Chemistry Risk Assessment (CS6) Use of TFA and TFAA’. Once generated, each question was answered with a model answer, which was a well-researched answer text directly pulled from reference documents or researched external sources such as the Occupational Health and Safety Administration (OHSA). In most instances, these model answers were written by a scientist fully trained in Lab427 health and safety processes, though ChatGPT was used to improve readability of the answer in some cases. The questions were then asked to three AI models; Chat-GPT 3.5, an untrained ConvAI avatar with no information loaded into its knowledge bank, and the fully trained ConvAI avatar. Each set of answers was evaluated by a scientist fully trained in health and safety for its accuracy, meaning how correct and comprehensive the answer was, and clarity which measured readability. The Likert scales used to evaluate the answers can be seen in Fig. 8.
![]() | ||
Fig. 8 Likert scales used to evaluate the accuracy and clarity of ConvAI generated answers on Lab427 Health and Safety processes. |
Further to this, each set of answers were evaluated for their F1 score and BERTScore. F1 score is a computational evaluation metric that measures the precision and recall of a generated text compared to a model text and calculates a harmonic mean of the two. It has historically been used for machine translation tasks but is also used to evaluate text generated by AI models. The other metric used is BERTScore, which uses contextual embeddings powered by BERT (Bidirectional Encoder Representations from Transformers), and calculates precision, recall, and an F1 score using n-gram overlap.14 BERTScore has been found to more highly correlate with human evaluation, and as such was used in addition to F1 score in this workpiece. All codes were written in Python using the IDE (Integrated Development Environment) PyCharm.
![]() | ||
Fig. 9 Testing performed on SAM to determine the optimal format of data entry into the Knowledge Bank for average accuracy rates in recalling location, CAS number, and stock code of chemicals. |
Given that the primary use cases of ConvAI is game development and consequently the generation of non-playable characters (NPC), ConvAI features the ability to modify the personality of the avatar. This can be achieved by modifying five personality parameters provided by the platform, such as extraversion and agreeableness. Interested in how modifying the avatars personality may affect accuracy rates, a test was performed to see how the modification of two of these affected the results, where meticulousness and guard rail were modified. According to ConvAI, meticulousness is a personality trait that determines the attention the avatar will pay to details, whilst guardrail measures the degree to which the avatar relies on only internal knowledge (100%) to internal and external knowledge (0%). An accuracy heatmap was subsequently created as per Fig. 7. No conclusive trend appeared for the modification of these two personality traits, however accuracy did improve to 95% overall with 100% accuracy obtained for CAS number retrieval demonstrating the need to adjust parameters widely for accuracy of results.
SUSAN was designed to address recent reports on the shortcomings of health and safety training in laboratories and was tested using the Likert Scale previously described and compared to two other generative AI models.15 It was found that Chat-GPT answers scored on average higher for clarity than the two ConvAI models, which we attributed to it being less conversationally constrained that the CovAI model, the trained ConvAI model nevertheless far exceeded the other two models in terms of accuracy. The trained ConvAI model scored on average 4.59 out of 5, indicating a high degree of accuracy and inclusive answers. The other two AI models scored relatively similar scores of 3.28 and 3.32, as expected where the access to specific information is unavailable (Table 1).
ChatGPT | Blank | Trained | ||
---|---|---|---|---|
Human evaluation | Accuracy | 3.315 | 3.278 | 4.593 |
Clarity | 2.907 | 2.556 | 2.685 | |
F1 score | 0.345 | 0.351 | 0.695 | |
BERTScore | 0.895 | 0.892 | 0.946 |
For computational analysis, the trained ConvAI model far exceeds the ChatGPT and untrained ConvAI mode in both F1 score and BERTScore. The trained ConvAI model obtained an F1 score of 0.695, indicating a high degree of precision and recall between the reference or model text and the generated text. Likewise, SUSAN obtained a BERTScore of 0.946 indicating again a high degree of both precision and recall. The two other AI models scored very similar for both F1 and BERTScore. When examined based on the type of question asked, the trained ConvAI model could accurately provide both general health and safety information as well as information specific to Lab427.
Lastly, InGRID was tested with 20 questions concerning the location of items in the laboratory and was able to give the correct answer for 95% of all questions asked. Only one negative response was received when InGRID was asked to locate ‘tin foil’ as opposed to ‘aluminium foil’ listed in the equipment inventory. The language used to ask these questions to the avatar was deliberately informal and conversational, with the model having exceptional understanding of voice inputs. Further to this, follow-up questions that pushed the original scope of InGRID's purpose were asked, and in 78% of cases useful and helpful information was provided in response to these questions. For example, when asked, InGRID could locate Dean and Stark water removal apparatus and then proceed to provide alternative methods for separating out a component from a reaction mixture – such as distillation or solvent extraction. She could then inform the user as to what additional equipment they would need to perform a specific experimental protocol and list the locations for it. Overall, incomplete information in only one instance occurred, where upon asking for guidelines in setting up equipment, InGRID directed the user to online sources or staff members. This is a good illustration of the limitations imposed by an incomplete knowledge bank (Fig. 10).
![]() | ||
Fig. 10 Example conversation with SUSAN, trained Lab427 health and safety expert. Inputs were given via voice and text. |
For tasks in which specific information is requested, experiments were conducted to monitor both how to increase the rates of accuracy, and how well the avatar can respond to conversational enquiries and deeper questioning. It was found that the format of information in a knowledge bank results in variable accuracy scores, with prompting also playing a role in improving the quality of generated answers. In regard to SAM, who was designed to help users navigate a chemical inventory, a high degree of accuracy was obtained. In the Lab427 Digital Twin laboratory, users were able to quickly check to see if the required chemicals for their reaction were in the inventory and gain an idea of where to locate them. The ability of the avatar to suggest chemicals has yet to be explored.
Mirza et al. recently proposed an LLM evaluation framework called ‘ChemBench’, which compares the chemical reasoning abilities of humans to multiple LLMs and found that whilst LLMs appear to be ‘superhuman chemists’ in some areas, there are critical gaps in their reasoning abilities including chemical safety profiles.16 Recognising these limitations, this proof-of-concept study sought to address these gaps by designing an avatar specifically tailored to present local laboratory health and safety rules and provide information related to chemical risk assessments. To evaluate the effectiveness of this avatar, both human and computational assessments were conducted. Overall results indicated a trained ConvAI avatar was able to both reference key-information in a knowledge bank and understand the information enough to provide a contextually sound response. In VR, this translates to a virtual healthy and safety expert which can conversationally provide high-quality safety demonstrations and instruction to new starters, as well as talk through health and safety protocols specific to the reaction a chemist is running at any given time.
Recently, Bolko et al. demonstrated the potential of LLM-driven systems that autonomously design, plan, and execute complex chemical experiments, showcasing how LLMs can automate tasks such as reaction planning and experiment execution, greatly enhancing research efficiency.17 Building on this premise, from a training perspective, we anticipate that InGRID could be particularly beneficial when combined with an AI fully capable of instructing users on setting up experiments and common laboratory protocols. For instance, this integration would be ideal for developing virtual reality training programs, where an undergraduate could be guided step-by-step through both the theory and practical set-up for a procedure like solvent extraction. The avatar would also instruct the student on where to locate the necessary equipment within both the virtual digital twin and the physical laboratory, aid with reaction optimisation, and ultimately boost confidence and skill-building prior to or during real-world experimentation. Such a program would allow students to gain experience in techniques they may not otherwise have the opportunity to practice hands-on.
The avatars have yet to be tested on a wider user base, to gain an understanding of how different people will interact with the avatars, and this will be the focus of studies moving forward.
Conversational avatars, powered by generative AI models, have the remarkable ability to transform training and collaboration by enhancing the interactive aspects of VR beyond previous limits. These avatars represent a unique way to retain information in a laboratory and provide on-hand expertise for aspects crucial to effective scientific training and research. These avatars could also be trained to specialise in operating laboratory equipment such as HPLC machines, to lead or assist undergraduate training workshops. Work in our group has also focused on creating avatars of group members, equipped with specialist knowledge about individual's research, to facilitate the concept of virtual conferences and retain this knowledge if staff members leave the laboratory. Overall, the exciting potential of these avatars promises to redefine the landscape of virtual reality training and collaboration, unlocking new levels of immersion and a multitude of educational possibilities and both the avatars and VR environment continue to be modified to make them closer to the real environments and people.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00330f |
This journal is © The Royal Society of Chemistry 2025 |