Auto-generating Question-Answering Datasets with Domain-Specific Knowledge for Language Models in Scientific Tasks
Abstract
Global software giants are relying on a ‘knowledge is power’ style of competitive advantage to back their leading foundational language models. However, many seek to democratize artificial intelligence (AI) as it could then be used by anyone as a force for good to help solve some of the world’s largest societal challenges, from reducing poverty to offsetting climate change. This work helps to democratize AI for the energy sector by fine-tuning language models with large domain-specific question-answering (QA) datasets about photovoltaic devices; these QA datasets have been auto-generated from custom materials databases that we have produced using the ‘chemistry-aware’ text-mining tool, ChemDataExtractor. By loading domain-specific knowledge into large QA datasets that fine-tune the language models rather than focus on loading as much knowledge as possible into foundational language models, one can shift the power of knowledge out of language models and into QA datasets. This knowledge shifting mitigates the performance of language models from being reliant on their size; rather, such performance becomes governed by the size and quality of domain-specific QA datasets that fine tune language models. This means that such QA datasets will only need to fine-tune modest-sized language models. This has enormous implications for the development and application of language models because it will massively reduce the computational resource that is needed to tailor them for a given application; to the extent that the public will be able create their own language models, i.e., we can fully democratize AI.