Molecular property prediction using pretrained-BERT and Bayesian active learning : a data-efficient approach to drug design

Muhammad Arslan Masood*, Samuel Kaski, Tianyu Cui

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

4 Downloads (Pure)

Abstract

Abstract: In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization. Scientific Contribution: We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation—a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.

Original languageEnglish
Article number58
Pages (from-to)1-15
Number of pages15
JournalJournal of Cheminformatics
Volume17
Issue number1
DOIs
Publication statusPublished - Dec 2025
MoE publication typeA1 Journal article-refereed

Keywords

  • Active learning
  • Bayesian
  • BERT
  • Drug discovery

Fingerprint

Dive into the research topics of 'Molecular property prediction using pretrained-BERT and Bayesian active learning : a data-efficient approach to drug design'. Together they form a unique fingerprint.

Cite this