Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition

Wei Sun*, Shaoxiong Ji, Tuulia Denti, Hans Moen, Oleg Kerro, Antti Rannikko, Pekka Marttinen, Miika Koskinen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

6 Downloads (Pure)


One of the central tasks of medical text analysis is to extract and structure meaningful information from plain-text clinical documents. Named Entity Recognition (NER) is a sub-task of information extraction that involves identifying predefined entities from unstructured free text. Notably, NER models require large amounts of human-labeled data to train, but human annotation is costly and laborious and often requires medical training. Here, we aim to overcome the shortage of manually annotated data by introducing a training scheme for NER models that uses an existing medical ontology to assign weak labels to entities and provides enhanced domain-specific model adaptation with in-domain continual pretraining. Due to limited human annotation resources, we develop a specific module to collect a more representative test dataset from the data lake than a random selection. To validate our framework, we invite clinicians to annotate the test set. In this way, we construct two Finnish medical NER datasets based on clinical records retrieved from a hospital’s data lake and evaluate the effectiveness of the proposed methods. The code is available at https://github.com/VRCMF/HAM-net.git.

Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases
Subtitle of host publicationApplied Data Science and Demo Track - European Conference, ECML PKDD 2023, Proceedings
EditorsGianmarco De Francisci Morales, Francesco Bonchi, Claudia Perlich, Natali Ruchansky, Nicolas Kourtellis, Elena Baralis
Number of pages16
ISBN (Print)978-3-031-43426-6
Publication statusPublished - 2023
MoE publication typeA4 Conference publication
EventEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases - Turin, Italy
Duration: 18 Sept 202322 Sept 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14174 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferenceEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Abbreviated titleECML PKDD


  • Clinical Reports
  • Distant Supervision
  • Named Entity Recognition
  • Sample Selection


Dive into the research topics of 'Weak Supervision and Clustering-Based Sample Selection for Clinical Named Entity Recognition'. Together they form a unique fingerprint.

Cite this