Auto-generating Question-Answering Datasets with Domain-Specific Knowledge for Language Models in Scientific Tasks

Abstract

Global software giants are relying on a ‘knowledge is power’ style of competitive advantage to back their leading foundational language models. However, many seek to democratize artificial intelligence (AI) as it could then be used by anyone as a force for good to help solve some of the world’s largest societal challenges, from reducing poverty to offsetting climate change. This work helps to democratize AI for the energy sector by fine-tuning language models with large domain-specific question-answering (QA) datasets about photovoltaic devices; these QA datasets have been auto-generated from custom materials databases that we have produced using the ‘chemistry-aware’ text-mining tool, ChemDataExtractor. By loading domain-specific knowledge into large QA datasets that fine-tune the language models rather than focus on loading as much knowledge as possible into foundational language models, one can shift the power of knowledge out of language models and into QA datasets. This knowledge shifting mitigates the performance of language models from being reliant on their size; rather, such performance becomes governed by the size and quality of domain-specific QA datasets that fine tune language models. This means that such QA datasets will only need to fine-tune modest-sized language models. This has enormous implications for the development and application of language models because it will massively reduce the computational resource that is needed to tailor them for a given application; to the extent that the public will be able create their own language models, i.e., we can fully democratize AI.

Supplementary files

Article information

Article type
Paper
Submitted
25 Sep 2024
Accepted
19 Feb 2025
First published
24 Feb 2025
This article is Open Access
Creative Commons BY-NC license

Digital Discovery, 2024, Accepted Manuscript

Auto-generating Question-Answering Datasets with Domain-Specific Knowledge for Language Models in Scientific Tasks

Z. Li and J. Cole, Digital Discovery, 2024, Accepted Manuscript , DOI: 10.1039/D4DD00307A

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements