Yuang Shi‡
ab,
Nakul Rampal‡
abc,
Chengbin Zhao
abc,
Dongrong Joe Fu
c,
Christian Borgs
cd,
Jennifer T. Chayes
cdefg and
Omar M. Yaghi
*abch
aDepartment of Chemistry, University of California, Berkeley, California 94720, USA. E-mail: yaghi@berkeley.edu
bKavli Energy NanoScience Institute, University of California, Berkeley, California 94720, USA
cBakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
dDepartment of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
eDepartment of Mathematics, University of California, Berkeley, California 94720, USA
fDepartment of Statistics, University of California, Berkeley, California 94720, USA
gSchool of Information, University of California, Berkeley, California 94720, USA
hKACST-UC Berkeley Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia
First published on 29th May 2025
Artificial intelligence, represented by large language models (LLMs), has demonstrated tremendous capabilities in natural language recognition and extraction. To further evaluate the performance of various LLMs in extracting information from academic papers, this study explores the application of LLMs in reticular chemistry, focusing on their effectiveness in generating Q&A datasets and extracting synthesis conditions from scientific literature. The models evaluated include OpenAI's GPT-4 Turbo, Anthropic's Claude 3 Opus, and Google's Gemini 1.5 Pro. Key results indicate that Claude excelled in providing complete synthesis data, while Gemini outperformed others in accuracy, characterization-free compliance (obedience), and proactive structuring of responses. Although GPT-4 was less effective in quantitative metrics, it demonstrated strong logical reasoning and contextual inference capabilities. Overall, Gemini and Claude achieved the highest scores in accuracy, groundedness, and adherence to prompt requirements, making them suitable benchmarks for future studies. The findings reveal the potential of LLMs to aid in scientific research, particularly in the efficient construction of structured datasets, which can help train models, predict, and assist in the synthesis of new metal–organic frameworks (MOFs).
Importantly, all the use cases described above require high quality datasets as they provide the foundation for the training, testing, and benchmarking of different LLMs. Most datasets involve using the published scientific literature as the primary input. This input data is diverse and very often, unstructured, making the task of using this information less easy. Given these challenges, it is critical to ensure that the datasets generated are both comprehensive, that is representative of the broad scientific literature available, and accurate, where the information contained within the dataset is correct. Since LLMs are now also often being used to generate these datasets,7,8 knowledge of which LLM is useful for a particular task, we believe, is helpful for the community.
In this study, we compare the performance of different LLMs—OpenAI's GPT-4 Turbo (hereafter abbreviated as GPT-4),9 Anthropic's Claude 3 Opus (Claude)10 and Google's Gemini 1.5 Pro(Gemini)11—in two tasks, (i) generating question–answer (Q&A) pairs, and (ii) the extraction of synthesis conditions of MOFs, from the scientific literature by subject matter experts. Building on the RetChemQA dataset,12 we provide a thorough quantitative and qualitative comparison of each model's performance identifying the key strengths, limitations, and specific challenges encountered when generating the datasets. We find that for extracting synthesis conditions from the literature, Claude is the most comprehensive and accurate LLMs, while for generating Q&A pairs datasets Gemini performance is the best. Furthermore, we report the particulars of where these LLMs fail, but also where they demonstrate “amazing” and unexpected results. These findings provide insights towards advancing the use of LLM powered AI agents in reticular chemistry.
The datasets for synthesis conditions, single-hop Q&A, and multi-hop Q&A were generated using the same workflow as described in our previous work.12 Briefly, this workflow involved initializing the environment, parsing the relevant files, and tokenizing the combined text. Once tokenized, these texts were processed by three different LLMs: OpenAI's GPT-4 Turbo (gpt-4-0125-preview, data generated on March, 2024), Anthropic's Claude 3 Opus (data generated from June to October, 2024), and Google's Gemini 1.5 Pro (gemini-1.5-pro-001, data generated from May to October, 2024).
For the single-hop and multi-hop Q&A tasks, LLMs were instructed to generate and answer questions based on both the main content of the paper and its ESI. Single-hop questions involved straightforward inquiries that could be addressed using a single section of the text, while multi-hop questions required synthesizing information from multiple sections, thereby assessing the models' capabilities in complex reasoning and information integration.
Furthermore, we specified the type and number of questions for the single-hop and multi-hop Q&A tasks: 6 factual, 7 reasoning, and 7 true-or-false questions for each DOI. Evaluating whether the LLMs adhered to these requirements was an important metric, allowing us to assess their ability to follow the prompt specifications accurately.
Completeness is a measure of whether all relevant parameters, such as temperature, concentration, and reagent amount for the synthesis of a specific product, are included.
Correctness refers to the accuracy of the extracted information, meaning that all extracted data had to be precise and without errors. This criterion ensured that every piece of information provided by the LLMs was factually correct and aligned with the original content.
Characterization-free compliance is a measure of how obedient the LLM in following the instructions given in the prompt. In this case, whether the LLMs exclude characterization details—such as temperatures for adsorption isotherms or chemical shifts from nuclear magnetic resonance (NMR)—or not.
As Fig. S2† shows, for each of these criteria, the LLM output was marked as “Y” if it met the requirement and “N” if it did not. In cases where multiple products were reported for a single DOI, we collected the synthesis conditions for all products and human-evaluated the LLM-extracted information for each product individually based on the aforementioned criteria. Occasionally, the LLMs failed to extract synthesis condition information for some products; in such situations, we assigned a value of “\” for each of the three criteria.
After assigning “Y”, “N”, or “\” for each criterion of each product extracted by each LLM, we calculated the corresponding proportions of Y, N, and \. Additionally, we introduced a parameter called net-Y-ratio, which is defined as the ratio of “Y” to the total extracted information (i.e., Y + N). This parameter measures the accuracy of the extracted information, independent of the overall completeness of the dataset.
This evaluation framework allowed us to thoroughly assess the performance of each LLM, not only in terms of the quality of the information extracted but also in their adherence to specific requirements and ability to exclude irrelevant details.
Accuracy is the measure of whether the answer provided is correct, meaning that the response must be factually accurate and directly address the question posed.
Groundedness is the measure of the questions are based on the provided article, rather than relying on common sense or hallucinated information. This criterion assesses whether the generated questions and answers are properly anchored in the context of the provided material.
If a question was relevant to the content and the corresponding answer was correct, we classified this instance as True Positive (TP). If the question was hallucinated but the answer was still correct, or if the LLM correctly identified the question as hallucinated—such as by responding with “this question is not related to the content” or “I don't know the answer”—we marked this as True Negative (TN). In cases where the question was well-formed but the answer was incorrect, we categorized the outcome as False Positive (FP). If the question was irrelevant to the main text and the LLM failed to identify it as hallucinated, we classified it as False Negative (FN).
After assigning TP, TN, FP, and FN labels to each question–answer pair, we applied a modified version of our previous evaluation framework.12 To provide a more intuitive assessment of LLM performance, we used the following four metrics:
Accuracy: this metric evaluates the ability of the LLMs to provide correct answers, regardless of the quality of the question. Accuracy is calculated as the ratio of all correctly handled instances (TP + TN) to the total number of Q&A pairs (TP + TN + FP + FN).
Groundedness: this metric is conceptually the inverse of the “Hallucination Rate” used in our previous framework.12 It measures the quality of the questions, specifically whether they are based on the context provided. Groundedness is calculated as the ratio of in-context questions (TP + FP) to the total number of Q&A pairs (TP + TN + FP + FN).
Precision: unlike accuracy and groundedness, precision considers both the quality of the question and the correctness of the answer. This means that hallucinated questions or incorrect answers are penalized. Precision is calculated as the ratio of accurately answered, in-context questions (TP) to the total number of Q&A pairs (TP + TN + FP + FN).
Hallucination capture rate: This is a measure to evaluate the ability of LLMs to self-correct when faced with irrelevant or erroneous questions. Hallucination capture rate is calculated as the ratio of hallucinated questions correctly identified (TN) to the total number of hallucinated questions (TN + FN).
We observed that despite explicitly specifying in the prompt that each entry of datasets should contain 6 factual, 7 reasoning, and 7 true-or-false questions, the LLM-generated Q&A datasets did not always adhere to this requirement. To address this, we introduced an alignment parameter for each LLM and each DOI. The alignment parameter was calculated as the proportion of DOIs (out of 50) for which the generated responses met the prompt requirements. This allowed us to evaluate how consistently each LLM adhered to the required question distribution across the entire dataset.
Fig. 1 shows the performance of GPT-4, Claude, and Gemini in the synthesis conditions extraction task. For the first criterion, completeness (Fig. 1a), Claude successfully extracted complete synthesis information for 83.5% of the products, outperforming Gemini's 72.2% and GPT-4's 67.8%. This suggests that Claude is more effective at covering all necessary details for synthesis across a wide range of products.
For the correctness criterion (Fig. 1b), Gemini demonstrated exceptional performance with an accuracy of 95.7%, significantly better than Claude's 86.1% and GPT-4's 83.5%. This highlights Gemini's ability to extract precise information without errors, which is critical for maintaining the reliability of the database.
Regarding characterization-free compliance (Fig. 1c), Gemini again excelled with a compliance rate of 95.7%, surpassing Claude's 84.4% and GPT-4's 80.9%. This indicates that Gemini was most successful in adhering to the requirement of excluding characterization data, which is essential for creating a synthesis conditions-focused dataset.
Notably, for the net-Y-ratio, which excludes the 12.2% of product data missed by Claude, Claude's performance in both correctness and characterization-free compliance was comparable to that of Gemini (>96%). This indicates that, when focusing solely on the data that was successfully extracted, both Claude and Gemini generated highly accurate and compliant datasets. The primary difference lay in the scope of coverage, with Claude's dataset ignoring some of the products. Considering that Claude achieved high net ratios (>95%) across all three criteria, we conclude that Claude exhibited the best overall performance in generating synthesis condition databases, balancing both accuracy and coverage.
Fig. 2a showcases an “amazing” response from GPT-4. In the original text, the authors provided complete synthesis conditions for compound 1 but only mentioned that “1@Y was prepared by the same synthetic procedure as 1 with ErCl3·6H2O and YCl3·6H2O in a 1:
7 molar ratio” when describing the synthesis of compound 1@Y. While Claude and Gemini successfully extracted the synthesis conditions for compound 1, they did not capture the synthesis details for compound 1@Y. In contrast, GPT-4 understood the context and duplicated the synthesis conditions for compound 1 while updating the chemicals used entry to include “ErCl3·6H2O and YCl3·6H2O in a 1
:
7 molar ratio”. This demonstrates GPT-4's ability to infer and apply contextual information effectively, a key aspect of advanced data extraction.
![]() | ||
Fig. 2 “Amazing” responses of (a) GPT-4, (b) Claude, and (c) Gemini in the synthesis condition extraction task. |
Fig. 2b presents a remarkable response from Claude. When describing the washing process of a product, the original text stated: “before drying, the wet solid powder was immediately transferred to a 500 mL glass jar, and 400 mL of DMF was added. The jar was heated at 120 °C for 12 h, then cooled to room temperature. The DMF was decanted and replaced with 400 mL of fresh DMF. The jar was reheated at 120 °C, followed by decanting and replacing with fresh DMF. This was repeated four additional times”. The total number of DMF washings was therefore 1 + 1 + 4 = 6, which Claude correctly calculated. This response highlights Claude's capacity to perform arithmetic reasoning and accurately interpret iterative procedures, which is crucial for understanding experimental protocols.
Fig. 2c highlights Gemini's outstanding performance in providing comprehensive synthesis details. In our synthesis conditions prompt, we explicitly requested synthesis conditions for every material, which included both MOFs and organic linkers. While GPT-4 and Claude often focused on extracting synthesis conditions for MOFs only, frequently overlooking the synthesis of organic linkers, Gemini consistently searched both the main text and the ESI to ensure no synthesis detail was omitted. This completeness is reflected in the statistical data, where Gemini received the fewest “\” labels, indicating a more complete and exhaustive extraction process. Gemini's diligence in capturing all relevant synthesis details underscores its potential for tasks that require exhaustive data compilation.
These “amazing” responses illustrate that LLMs possess a considerable level of logical reasoning ability when handling complex tasks, showing their potential to significantly improve dataset construction. Their ability to infer information, perform arithmetic operations, and ensure thorough extraction highlights their versatility in dealing with nuanced scientific content. By refining prompts and applying targeted preprocessing to the literature, we believe LLM-based information extraction can become even more accurate and consistent. This progress will make LLMs valuable tools for building comprehensive scientific datasets that support efficient research and discovery.
For GPT-4, the primary cause of errors in completeness was missing key details, such as the amounts of chemicals used (Fig. 3a). Additionally, GPT-4 often missed gas reagents like H2S or HCl. Moreover, in terms of correctness, GPT-4 struggled with accurately reporting temperatures during multi-step synthesis, which suggests difficulties in effectively extracting intricate details throughout complex synthesis processes.
For Claude, the main issue affecting completeness was missing certain conditions or concentrations, and regarding correctness, Claude made mistakes related to misinterpreting methods or temperature values. For instance, as shown in Fig. 3b, Claude incorrectly identified the synthesis method, mistaking liquid–liquid diffusion (used for synthesizing single crystals) for a method applicable to polycrystals, indicating that LLMs can confuse similar locations or content, leading to ‘cross-contamination’ in their outputs. As a result, although Claude generally captured more synthesis details, it occasionally struggled to accurately interpret specific experimental procedures.
For Gemini, the primary issue in completeness was missing amounts of chemicals. Fig. 3c shows how Gemini incorrectly interpreted “DMF–H2O mixture (v/v 1/1, 1 mL)” as 1 mL each of DMF and H2O instead of a total volume of 1 mL. In terms of correctness, Gemini also faced challenges with incorrect quantities, often due to ambiguous measurements. These errors indicate that Gemini's thorough approach sometimes led to incorrect extrapolation of details.
For the characterization-free compliance evaluation, as shown in Fig. 3d, GPT-4 frequently included characterization data, while Claude and Gemini were more effective in excluding such information, which aligns with the earlier quantitative results (Fig. 1). Therefore, Claude and Gemini demonstrated a better ability to distinguish synthesis-specific data from characterization details, which is crucial for creating a focused and relevant dataset.
Overall, the error analysis highlights specific areas where each LLM excelled or struggled. We found that LLMs often lost or misinterpreted numerical information during extraction, such as reagent amounts, temperatures, and concentrations, whereas errors related to reagent or product names were rare. This discrepancy may be due to the inherent mechanisms of LLM generation. Reagent and product names typically appear only once in synthesis descriptions, enabling LLMs to store them accurately without confusion. In contrast, similar numerical data is presented in multiple contexts. This can lead to confusion or overwriting of previously stored information, resulting in cross-contamination of details. Understanding this pattern allows us to better determine which types of information should be prioritized in databases to enhance overall accuracy and reliability.
For the single-hop task, we evaluated the three LLMs—GPT-4, Claude, and Gemini—based on accuracy (to assess the quality of the answers), groundedness (to evaluate whether the questions were derived from the paper), precision (to determine whether questions were derived from the main text and answered correctly), and hallucination capture rate (to assess the self-correction ability of LLMs). As shown in Fig. 4a, both Claude and Gemini achieved scores exceeding 99% in accuracy, groundedness and precision. And the hallucination capture rates of both were 50.0%. In contrast, GPT-4's accuracy was 90.7%, while groundedness was slightly higher at 92.6%, precision was 89.8%, and hallucination capture rate was only 11.4%. These results indicate that Claude and Gemini are highly reliable for generating single-hop Q&A datasets, making them suitable as benchmarks for future studies. The consistently high performance of these two models suggests a strong ability to extract relevant information accurately while maintaining a high standard of quality across diverse question types and categories.
Fig. 4b indicates that the results for the multi-hop task were similar to those for the single-hop task. Claude and Gemini continued to perform exceptionally well. Notably, GPT-4 showed slight improvements in both groundedness and precision, with significant increases in accuracy and hallucination capture rate, bringing them closer to the levels achieved by Claude and Gemini. As previously reported,12 this improvement may be attributed to GPT-4 “thinking” more thoroughly when responding to the revised prompt, leading to better self-correction when faced with hallucinated questions in the multi-hop task compared to the single-hop task. This highlights the importance of prompt engineering in leveraging the full capabilities of LLMs, especially when dealing with nuanced, multi-step reasoning tasks. Moreover, recent work12 has shown that prompt engineering can help encourage the generation of Q&A pairs with multi-hop reasoning, which helps better differentiate the capabilities of various LLMs and push the boundaries of their understanding.
During our evaluation, we found that although the prompt did not explicitly require it, Gemini proactively numbered the different types of questions (Fig. S3–S5†), resulting in more organized responses. This approach suggests a higher level of understanding of structured data generation. To quantify this behavior, we introduced an alignment parameter (Table 1) to compare how well different LLMs adhered to formatting requirements. As shown in Table 1, GPT-4's responses (Fig. S6†) rarely followed the expected format, approximately half of Claude's responses were organized, and nearly all of Gemini's responses met the formatting criteria. This difference in adherence reflects the models' varying abilities to interpret implicit organizational cues. We believe that adherence to proper formatting is crucial for creating structured datasets in the future, as it significantly enhances ease of use, reduces the need for manual adjustments, and facilitates subsequent analysis.
Single-hop Q&A | Multi-hop Q&A | |
---|---|---|
GPT-4 | 3/50 | 2/50 |
Claude | 29/50 | 20/50 |
Gemini | 44/50 | 45/50 |
Overall, based on these findings, we conclude that Gemini offers several advantages in generating Q&A datasets: high accuracy, well-structured responses, and lower cost ($0.18 per DOI, Table S2†) compared to other models. Moreover, Gemini's ability to independently impose structure on its outputs suggests that it is particularly well-suited for applications where consistency and ease of integration are paramount.
Overall, Claude strikes a balance between accuracy and coverage for generating synthesis conditions dataset, while Gemini's completeness and structured approach make it the best choice for creating comprehensive Q&A datasets. These findings show that LLMs can help build scientific databases, but improvements in prompt design and preprocessing are needed to make them truly effective.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00081e |
‡ Equal contribution. |
This journal is © The Royal Society of Chemistry 2025 |