Projekteja vuodessa
Abstrakti
Abstract: In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization. Scientific Contribution: We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation—a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.
| Alkuperäiskieli | Englanti |
|---|---|
| Artikkeli | 58 |
| Sivut | 1-15 |
| Sivumäärä | 15 |
| Julkaisu | Journal of Cheminformatics |
| Vuosikerta | 17 |
| Numero | 1 |
| DOI - pysyväislinkit | |
| Tila | Julkaistu - jouluk. 2025 |
| OKM-julkaisutyyppi | A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä |
Rahoitus
This study was partially funded by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training Network European Industrial Doctorate grant agreement No. 956832 “Advanced Machine Learning for Innovative Drug Discovery”. Further, this work was supported by the Academy of Finland Flagship program: the Finnish Center for Artificial Intelligence FCAI. Samuel Kaski was supported by the UKRI Turing AI World-Leading Researcher Fellowship, [EP/W002973/1].
Sormenjälki
Sukella tutkimusaiheisiin 'Molecular property prediction using pretrained-BERT and Bayesian active learning : a data-efficient approach to drug design'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Projektit
- 2 Päättynyt
-
MSCA AIDD /Kaski S.: Advanced machine learning for Innovative Drug Discovery
Kaski, S. (Vastuullinen johtaja), Masood, A. (Projektin jäsen) & Nahal, Y. (Projektin jäsen)
01/01/2021 → 31/12/2024
Projekti: EU H2020 MC
-
-: Finnish Center for Artificial Intelligence
Kaski, S. (Vastuullinen johtaja)
01/01/2019 → 31/12/2022
Projekti: Academy of Finland: Other research funding
Laitteet
Tutkimustuotos
- 3 Viittaukset
- 1 Konferenssiesitys
-
Multi-Modal Representation learning for molecules
Masood, A., Heinonen, M. & Kaski, S., 6 maalisk. 2025. 8 Sivumäärä.Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussa › Konferenssiesitys › Scientific › vertaisarvioitu
Open accessTiedosto2 Lataukset (Pure)