Projects per year
Abstract
Abstract: In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization. Scientific Contribution: We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation—a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.
Original language | English |
---|---|
Article number | 58 |
Pages (from-to) | 1-15 |
Number of pages | 15 |
Journal | Journal of Cheminformatics |
Volume | 17 |
Issue number | 1 |
DOIs | |
Publication status | Published - Dec 2025 |
MoE publication type | A1 Journal article-refereed |
Keywords
- Active learning
- Bayesian
- BERT
- Drug discovery
Fingerprint
Dive into the research topics of 'Molecular property prediction using pretrained-BERT and Bayesian active learning : a data-efficient approach to drug design'. Together they form a unique fingerprint.Projects
- 2 Finished
-
MSCA AIDD /Kaski S.: Advanced machine learning for Innovative Drug Discovery
Kaski, S. (Principal investigator)
01/01/2021 → 31/12/2024
Project: EU H2020 MC
-
-: Finnish Center for Artificial Intelligence
Kaski, S. (Principal investigator)
01/01/2019 → 31/12/2022
Project: Academy of Finland: Other research funding