Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

Jiayun Pang; Ivan Vulić

doi:10.1039/D4FD00104D

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction†

Jiayun Pang

*^a and Ivan Vulić

*^b

Author affiliations

* Corresponding authors

^a School of Science, Faculty of Engineering and Science, University of Greenwich, Medway Campus, Central Avenue, Chatham Maritime, UK
E-mail: j.pang@gre.ac.uk

^b Language Technology Lab, University of Cambridge, 9 West Road, Cambridge CB3 9DA, UK
E-mail: iv250@cam.ac.uk

Abstract

Transformer-based encoder–decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder–decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become ‘chemistry domain compatible’ in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

This article is part of the themed collection: Data-driven discovery in the chemical sciences

Supplementary files

Article information

DOI: https://doi.org/10.1039/D4FD00104D
Article type: Paper
Submitted: 16 may 2024
Accepted: 19 avq 2024
First published: 19 avq 2024

Download Citation

Faraday Discuss., 2024, Advance Article

Permissions

Request permissions

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

J. Pang and I. Vulić, Faraday Discuss., 2024, Advance Article , DOI: 10.1039/D4FD00104D

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Faraday Discussions

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction

Social activity

Search articles by author

Spotlight

Advertisements