Mohd
Zaki
a,
Jayadeva
bc,
Mausam
cd and
N. M. Anoop
Krishnan
*ac
aDepartment of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. E-mail: krishnan@iitd.ac.in; cez198233@iitd.ac.in
bDepartment of Electrical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
cYardi School of Artificial Intelligence, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
dDepartment of Computer Science & Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
First published on 20th December 2023
Information extraction and textual comprehension from materials literature are vital for developing an exhaustive knowledge base that enables accelerated materials discovery. Language models have demonstrated their capability to answer domain-specific questions and retrieve information from knowledge bases. However, there are no benchmark datasets in the materials science domain that can be used to evaluate the understanding of the key concepts by these language models. In this work, we curate a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials science student who has cleared their undergraduate degree. We classify these questions based on their structure and the materials science domain-based subcategories. Further, we evaluate the performance of LLaMA-2-70B, GPT-3.5, and GPT-4 models on solving these questions via zero-shot and chain of thought prompting. It is observed that GPT-4 gives the best performance (∼62% accuracy) as compared to other models. Interestingly, in contrast to the general observation, no significant improvement in accuracy is observed with the chain of thought prompting. To evaluate the limitations, we performed an error analysis, which revealed conceptual errors (∼72%) as the major contributor compared to computational errors (∼28%) towards the reduced performance of the LLMs. We also compared GPT-4 with human performance and observed that GPT-4 is better than an average student and comes close to passing the exam. We also show applications of the best performing model (GPT-4) on composition–extraction from tables of materials science research papers and code writing tasks. While GPT-4 performs poorly on composition extraction, it outperforms all other models on the code writing task. We hope that the dataset, analysis, and applications discussed in this work will promote further research in developing better materials science domain-specific LLMs and strategies for information extraction.
There exist several large-sized datasets like MMLU,22,23 HellaSwag,24 WinoGrande,25 HumanEval,10 and DROP26 to evaluate the capabilities of LLMs. However, there are limited datasets in the materials science domain for assessing their question-answering abilities. Table 1 lists datasets related to mathematics, chemistry, and materials science suitable for evaluating LLMs. In addition to these datasets, Jablonka et al. (2023) demonstrated the application of LLMs on 14 chemistry and materials science specific datasets.27 Further, based on datasets listed in Table 1, researchers have attempted to finetune materials domain LLMs and proposed DARWIN28 and HoneyBee29 and compared them with the performance of LLaMA, GPT3.5 and GPT-4 on different tasks.
Dataset | Description |
---|---|
GSM8K30 | A set of 8.5 K linguistically diverse grade school mathematics word problems |
AI2RC (ARC)31 | A set of ∼7.7 K school-level science questions |
ChemistryQA32 | A dataset of 4.5 K chemistry question answers |
ScienceQA33 | ∼21 K multimodal multiple choice questions from natural, language, and social sciences |
SciQ34 | Crowdsourced dataset of ∼13.7 K science questions |
MatSci-Instruct29 | 52 K synthetically generated instructions dataset to finetune LLMs for materials science information extraction |
JEEBench35 | 450 questions on physics, chemistry, and mathematics from JEE advanced examination of India for admission to IITs |
Although these datasets are diverse, the complexity of questions asked in examinations for testing students who have completed their undergraduate-level education, is quite different from the existing ones. Therefore, developing such a dataset is crucial to investigate the materials science domain knowledge of these LLMs so that they can be further used for addressing challenging problems related to materials discovery for areas such as manufacturing, energy, environment, and sustainability. This information is further essential to understand the lacunae of the understanding of such LLMs, which are being proposed to be used for several domains such as manufacturing, planning, material synthesis, and materials discovery.14,19
To this end, we collected questions that require students to have a undergraduate-level understanding of materials science topics to solve them. These questions and answers are carefully curated from the original questions in the graduate aptitude test in engineering (GATE) exam—a national-level examination for graduate admission in India. More than 800000 students take this exam annually, with an average of 100000 students in major disciplines, such as mechanical or civil engineering, to enroll in master's/doctoral courses in the premier institutes in India. We classify these questions based on their (a) structure, which leads to 4 types of questions, and (b) domain knowledge required to solve them, which divides the database into 14 categories. The questions in MaScQA also have diversity in length, ranging from 9 words per question to 145 words question, with an average of 50 words (see Fig. S1†). We then evaluate the performance of state-of-the-art proprietary models, GPT-3.5 and GPT4, and an open-source model, LLaMA-2-70B, in solving these questions. The availability of MaScQA will allow the researchers to benchmark existing models and prompting strategies. Specifically, the analysis from a domain-specific perspective will enable the researchers to train better domain-specific LLMs and help them decide where these models can be used in the materials discovery pipeline. Note that, MaScQA is an open database where other researchers can also contribute questions for increasing the diversity of topics on which LLMs can be evaluated.
Finally, we evaluate LLMs on domain-specific tasks and compare their performance with the existing models suitable for such tasks. The first task, introduced by Gupta et al. (2023),17 is composition extraction from tables in materials-related research papers. The second task, related to code writing, employed the dataset released by White et al. (2022),36 which is a compilation of ∼100 Python functions comprising of the docstring, return statement, and the [insert] token, which has to be replaced upon prompting the LLM. The performance of GPT-4 on these tasks further allows researchers to devise strategies for task-oriented finetuning of the LLMs. Overall, we try to answer the following questions in this paper:
1. How well do general-purpose LLMs perform in answering complex questions from the materials science domain?
2. Can we improve the performance of the LLMs by using the chain of thought prompting methods?
3. What are the factors limiting the performance of these LLMs?
4. Can LLMs be used for accelerated materials modelling and design through information extraction and code writing?
To understand the performance of LLMs from the materials science domain perspective, we classified the questions into 14 categories. The list of categories was prepared in consultation with domain experts who teach materials science subjects at the institute where this research is conducted. Then, two experts assign all the questions to one of the categories. The conflict in the category assignments was resolved through discussion and mutual agreement. Fig. 2 shows the number of questions in each category. The color of the bars represents the broad category of materials science topics under which each subtopic is shown in the graphical abstract. The database can be accessed at https://github.com/M3RG-IITD/MaScQA.
The solutions to all the questions obtained using two approaches for both models can be accessed at https://github.com/M3RG-IITD/MaScQA. The official answer keys are obtained from the official website of IIT Kharagpur, which is one of the organizing institutes of the GATE exam. https://gate.iitkgp.ac.in/old_question_papers.html. The LLMs' performance on two prompting methods is discussed in detail in the following sections.
Fig. 2 shows the number of questions in different domain-specific categories. To visualize the frequently used words related to each domain-specific category of questions, word clouds are shown in Fig. 3 and top 10 most occurring words are shown in Fig. 4. The maximum number of questions (114) are in the thermodynamics category, which deals with questions related to enthalpy of formation, energy balance during chemical reactions, transition temperatures, activation energy, and heat transfer (Fig. 3(a)) which is also reflected by most occurring words like energy, k (Kelvin), j (Joule), g (indicating gaseous state of chemicals in reactions and units of material properties), as shown in Fig. 4(a). The category of atomic structure comprises 100 questions, which are based on concepts such as dislocations, diffraction planes, and crystal structures (Fig. 3(b) and 4(b)). The mechanical behavior category is based on the concepts of stress–strain behavior of materials, creep, fatigue, and fracture mechanics (Fig. 3(c)). Further, the presence of words like “mpa and gpa” (Fig. 4(c)), which are units of stress and strength (MPa and GPa), indicate the correct classification of questions. In materials manufacturing (Fig. 3(d) and 4(d)) and material applications (Fig. 3(e) and 4(e)), the questions test the knowledge of extraction processes of materials from their respective ores and why a particular material, e.g. oxides, alloys are used for a specific application. Thus, these questions require logical understanding connecting multiple concepts: first, “recall” or “deduce” the properties of a material based on its composition, label, or processing conditions, and second, “identify” the properties required for a particular application and then connect these two concepts to “derive” a logical explanation to arrive at the correct answer. The questions on phase transition test the knowledge of how phase transition can be induced in materials, how to calculate the percentage of different phases in the materials, and the characteristics of different phases. This is also indicated by the high frequency of words related to different phases of materials (Fig. 3(f) and 4(f)). The questions on electrical properties include fuel cells, characteristics of materials used in batteries, and semiconductor devices (Fig. 3(g)). This is also seen in the frequency of top-10 words in this domain (Fig. 4(g)), which comprises of electron, v (Volt), and electrode. Then, questions are based on material processing such as welding, annealing, tempering, recrystallization, welding, etc. (Fig. 3(h) and 4(h)). The questions on transport phenomena test concepts related to the diffusion or transport of ions, corrosion, and duration of the phenomena (Fig. 3(i) and 4(i)). The question related to magnetic properties tests the knowledge about magnetization and the characteristics of different magnetic materials (Fig. 3(j) and 4(j)). The material characterization topic has questions related to methods like scanning electron microscopy, diffraction studies, and back-scattered electron microscopy (Fig. 3(k) and 4(k)). The fluid mechanics topic comprises questions on the viscosity of the fluid and the movement of particles in a viscous medium (Fig. 3(l) and 4(l)). In the material testing topic, the questions are based primarily on non-destructive material testing methods (Fig. 3(m) and 4(m)). The miscellaneous category deals with questions requiring a simultaneous understanding of multiple materials science domains like optical properties, piezoelectricity, and microscopy for their solution (Fig. 3(n) and 4(n)).
Evaluation method | MCQ (284) | Matching (MATCH) (70) | Numerical with MCQ (MCQN) (68) | Numerical (NUM) (228) | Overall accuracy |
---|---|---|---|---|---|
Baseline scores | 25 | 25 | 25 | 0 | |
LLaMA-70B-CoT | 41.20 | 22.86 | 20.59 | 3.95 | 24.0 |
GPT-3.5 | 56.69 | 40.00 | 35.29 | 15.79 | 38.31 |
GPT-3.5-CoT | 57.04 | 38.57 | 33.82 | 14.91 | 37.85 |
GPT-4 | 74.65 | 88.57 | 58.82 | 37.28 | 61.38 |
GPT-4-CoT | 77.11 | 92.86 | 50.00 | 39.04 | 62.62 |
Here, GPT-4-CoT gives an accuracy of 77.11% on MCQ, which is a high score considering the difficulty levels of this exam. Also, the performance of GPT-4-CoT is ∼20% higher than GPT-3.5-CoT for MCQ type of questions. For MATCH questions, GPT-4-CoT exhibits the maximum performance with a score of 92.86%, a very high score considering the amount of knowledge required to connect the entities. In contrast, the variants of GPT-3.5 performed poorly on MATCH questions, with a score of 40% and 38.57% for the variants without and with CoT, respectively. In this case, the GPT-4-CoT provides ∼4% improvement over direct prompting. For MCQN, GPT-4 gives the best performance with a score of 58.82%, while CoT reduces the model's performance to 50.0%. The same trend of reduced performance on these questions is observed with the GPT-3.5 model. This implies that CoT prompting may not always lead to better performance. Now, we focus on the numerical questions. Among all the categories, models exhibit the worst performance in the NUM category. Here, GPT-4 and GPT-4-CoT obtain the maximum score of 37.28% and 39.04%. Interestingly, we observe that CoT yields poorer results in the case of GPT-3.5, while it yields better accuracy in the case of GPT-4.
Finally, regarding overall performance, GPT-4-COT gives the best score of 62.62%, with GPT-4 following closely at 62%. It should be noted that in MCQ, there are 13 questions where more than one option is correct, of which GPT-4 and GPT-4-CoT answered six and seven questions correctly, respectively. Interestingly, we observe that CoT does not always give improved results. In fact, for GPT-3.5, CoT gives poorer results in all the cases except MCQs and marginally better results for GPT-4 in all the cases except MCQN. Note that this observation contrasts with the general observation that the CoT prompting results in improved performance of LLMs on QA tasks. To identify whether the overall performance of LLMs-based evaluation strategies on MaScQA is statistically significant, we perform paired t-test by taking the performance of two evaluation strategies at a time and report the resulting p-values in Table 3. The null hypothesis tested is “there is no significant difference between the performance of two LLMs-based evaluation strategies in solving the questions of MaScQA”. Since the p-values when comparing the performance of GPT-3.5 with GPT-3.5-CoT and GPT-4 with GPT-4-CoT are quite higher than 0.05, it is accepted that in these two cases, there is no significant difference between the performance of the two evaluation strategies. However, in all other cases, the p-values are lower than 0.05, implying significant difference in the performance of the LLMs-based evaluation strategies.
LLMs | GPT-3.5 | GPT-3.5-CoT | GPT-4 | GPT-4-CoT |
---|---|---|---|---|
GPT-3.5-CoT | 0.864 | |||
GPT-4 | 3.56 × 10−17 | 7.96 × 10−18 | ||
GPT-4-CoT | 6.12 × 10−19 | 1.26 × 10−19 | 0.648 | |
LLAMA-70B-CoT | 2.17 × 10−8 | 5.75 × 10−8 | 1.76 × 10−42 | 2.17 × 10−48 |
In addition to evaluating the performance of LLMs in answering different types of questions like MCQ, MATCH, MCQN, and NUM, which test different abilities of the students, it is also essential to analyze the performance of the models from a domain perspective. To this end, we classify all the questions of our dataset into 14 broad categories. Fig. 5 shows the accuracy of the GPT-4-CoT prompting method while answering the questions. Since the number of questions differs under each category, we report the percentage of questions answered correctly and incorrectly to show proper comparison. The number of question for each case are written with white color inside the respective bars. It is observed that questions related to materials' mechanical behavior and electrical properties have the most percentage of incorrectly answered questions (∼60%). The questions on thermodynamics, atomic structure, magnetic properties, transport phenomena, and phase transition have ∼40% of incorrectly answered questions in the respective categories. Further, ∼30% of materials manufacturing and characterization questions are incorrectly answered. In the categories of fluid mechanics and materials applications, ∼15% of questions are incorrectly answered with the lowest error rates for material processing and no mistakes made on material testing questions. To further gain insights into the factors limiting LLMs' performance, we will discuss them by classifying the errors into two categories, as explained in the Discussion section.
To analyze different errors, we perform error analysis on GPT-4-CoT response because this strategy performed best on MaScQA. We take all the incorrectly answered questions by GPT-4-CoT, in which 139 are NUM, 65 are MCQs, 34 are MCQN, and 5 are matching-type questions (MATCH) (Fig. 6). The number of incorrectly answered questions across materials science sub-domains are shown in Fig. 7. Note that there may be questions with conceptual and numerical errors, but we have considered only the conceptual error in these questions since it is the first to be found. If the retrieved concept is incorrect, we deem the computational error secondary.
Fig. 7 Types of the error made by GPT-4-CoT on questions classified according to materials science perspective. |
Fig. 6 shows the distribution of errors made by GPT-4-CoT in different categories of question based on their structure. The text inside the bars representing conceptual and computational error shows the number of questions in respective category. The analysis of all the incorrectly answered questions reveals that majority of errors are conceptual. Further, in MCQs and MATCH type questions, the error is always conceptual because answering such questions requires retrieval of appropriate concepts and facts and then connecting them with relevant options. For MCQN and NUM, majority of the questions were answered incorrectly (∼65% and ∼59%) due to conceptual errors implying the need for domain-specific models or better prompting and problem-solving approaches.
As mentioned earlier, we observed that GPT-4-CoT made no grounding errors. To evaluate whether this is due to the effectiveness of CoT, we investigate questions that are incorrectly answered by GPT-4 and correctly by GPT-4-CoT. Out of 66 such questions from the entire dataset, GPT-4's solutions had ∼70% conceptual errors, ∼30% computational errors, and no grounding errors. Further, we also analyzed the erroneously answerd questions by GPT-4-CoT and are correctly answered by GPT-4. There were 58 such questions in the complete dataset. Out of these questions, solutions of 45 questions (∼78%) had conceptual errors; for one question, there was a grounding error, and the remaining 12 questions had computational errors when solved using GPT-4-CoT. Since there are little to no grounding errors in either GPT-4 or GPT4-CoT, both models are adept in this regard. The CoT prompting is helping reduce some numerical errors.
Fig. 7 shows the domain-wise distribution of conceptual and computational errors on the all the incorrectly answered questions by GPT-4-CoT. The number written in white color over colored bars respresent the number of question in each case. All categories have conceptual errors in more than 50% of the respective questions except for transport phenomena (∼45%) and fluid mechanics. Now, we will discuss some conceptual errors in different domains. The list of all questions subjected to analysis is provided in the GitHub repository of this work.
Fig. 8(a) shows an example of the conceptual error made on a question related to thermodynamics. In this question, instead of considering the coefficient of thermal expansion same in the planar dimension, it considered the coefficient of thermal expansion in the perpendicular direction as the same in one of the planar directions. Mathematically, instead of obtaining the final coefficient using 2 × parallel + perpendicular coefficients, GPT-4-CoT used parallel + 2 × perpendicular, leading to an incorrect answer. While solving a question on atomic structure, as shown in Fig. 8(b), GPT-4-CoT mistook the relation between lattice parameter (a) and atomic diameter (D) as instead of . In a question on the electrical properties of materials (Fig. 8(c)), the GPT-4-CoT answered that all the given statements were correct. Hence, it could not choose from the four options shown as answers. According to the materials science domain and the Wikipedia entry of Pourbaix diagrams, one of their major limitations is that these diagrams do not estimate actual corrosion rates; also, these diagrams cannot be used while studying corrosion due to chloride ions. Hence, the statement R is incorrect, making (C) the correct choice. While solving the question shown in Fig. 8(d), GPT-4-CoT did not convert the lattice parameter into the atomic diameter and considered them the same while using it in the formula required for solving the problem. For a question on materials manufacturing (Fig. 8(e)), GPT-4-CoT retrieved the functions of (P) blast furnace slag and (R) Torpedo car as opposite, thus leading to a wrong answer, C, when the correct option was A. The complete solution of GPT-4-CoT can be found in the GitHub repository of this work. Some examples of correct answers given by GPT-4-CoT on four types of question according to structure (MCQ, MATCH, MCQN, and NUM) are shown in ESI (Fig. S2–S5).†
Fig. 8 Visualizing some of the questions where GPT-4-CoT made conceptual errors in the solution. The correct answers to each question are marked in bold and underlined. |
To summarise, the CoT prompting cannot significantly improve the LLM performance as the mistakes are mainly conceptual. This makes a strong case for a domain-specific LLM for materials and potentially domain-specific alternate prompting strategies. Further, for questions where the LLMs give the incorrect response due to computational error, the solution involved unit conversions, logarithms, and exponentials and had numbers with multiplying factors (e.g., 1010). There have been recent works in the literature that suggest methods for improving calculations and for improving on concept-based mistakes.44 Introducing such heuristics while prompting can help researchers in two ways: (1) probe the existing LLMs more deeply and (2) generate datasets to train LLMs with lesser parameters, thus making the use of these models economical. Hence, this answers the third research question (limiting factors for LLMs) raised in this work.
Year | Maximum marks | Cut off marks | Average marks | GPT-4-CoT |
---|---|---|---|---|
2020 | 83 | 49.2 | N.A. | 46.46 |
2021 | 87.67 | 48.5 | 28.7 | 42.86 |
2022 | 77.67 | 46.2 | 27.6 | 38.62 |
Fig. 9 (a) System message (b) table as prompt to extract materials compositions from tables using GPT-4 API. |
In the composition extraction task, the extracted compositions must be expressed as a set of tuples containing material ID (as defined in the paper), constituent elements or compounds, corresponding percentage contributions, and corresponding units. To evaluate the performance on this task, two categories of metrics are used: tuple level precision, recall and F1-scores, and material level precision, recall, and F1-scores. Tuple level metrics imply that the individual components of a material are extracted along with its value and unit. In contrast, the material level metrics also consider extracting material id in the extracted tuple. For more details about the extraction task and metrics, the readers are requested to refer to the paper introducing this dataset and models.17Table 5 shows the performance of GPT-4 along with the DiSCoMaT's performance on the same dataset. Since GPT-4 is not particularly trained for this task, it produces extra text like “The extracted compositions are…” which is incompatible with the evaluation pipeline used in DiSCoMaT. Therefore, we analyse only the relevant part of the extractions from the GPT-4 output. The lower performance of GPT-4 compared to DiSCoMaT can be attributed to the fact that GPT-4 was not pre-trained/finetuned for such tasks. The mistakes made by GPT-4 include non-extraction of material ids, not being able to normalise the component values if the sum of all components is not 100, and not being able to extract nominal compositions when both nominal and experimental compositions are reported in the table. These mistakes, thus, constitute both computational and factual errors, as investigated in the Discussion section of this paper. The dataset of 100 compositions, prompts used for this study and response of GPT-4 are provided in the GitHub repository of this work.
Model | Tuple level metrics | Material level metrics | ||||
---|---|---|---|---|---|---|
GPT-4 | 76.39 | 76.0 | 76.2 | 57.45 | 51.92 | 54.55 |
DiSCoMaT | 83.24 | 66.33 | 73.80 | 88.18 | 62.50 | 73.11 |
Fig. 10 (a) An example prompt provided to the GPT-4 model for generating the complete output (b) response of GPT-4. |
The performance of GPT-4, compared to the output of other models, is reported in Table 6. It was observed that most of the mistakes made by GPT-4 are in the codes related to molecular dynamics, spectroscopy, chemical informatics, and quantum mechanics, which is consistent with the performance of GPT-4-CoT on MaScQA. Another interesting observation is that GPT-4 answered all code-related thermodynamics questions (a total of ten questions), which is consistent with the observation that GPT-4 has a reasonable understanding of thermodynamics concepts, and the poor performance on MaScQA was mainly due to computational error. We have provided the output of GPT-4 on all the questions in the GitHub repository of this work. Altogether, we observe that GPT-4 achieves state-of-the-art performance for the code writing task.
Interestingly, the results suggest that domain-adaption and task-specific prompting strategies are necessary to elicit the desired output from the LLMs. Therefore, the language models must be finetuned on a domain-specific and task-specific datasets to enable the use of LLMs in the materials discovery pipeline. Moreover, the performance of the LLMs on MaScQA can enable a deeper understanding of the lacunae of materials science knowledge in the LLMs, thereby providing new research avenues. For instance, LLMs' poor performance in NUM questions suggests that a pipeline connecting the LLM to a math calculator can potentially yield improved results. Further, the conceptual mistakes made by the LLMs indicate that the development of an LLM trained on materials literature could provide improved results. The materials science domain is a field that derives concepts from physics, chemistry, and mechanics. Therefore, a benchmark like MaScQA will allow the researchers to benchmark their domain specific models and prompting strategies against a standard dataset. Further, the correct solutions can help researchers create a new dataset for training lightweight or small language models, which are economical and, hence, can be easily deployed on low-memory industrial devices for materials discovery and their usage for educational purposes.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00188a |
This journal is © The Royal Society of Chemistry 2024 |