Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Yogesh Kumar, Pekka Marttinen

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
Original languageEnglish
Title of host publicationComputer Vision – ECCV 2024
Subtitle of host publication18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XX
EditorsAleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
PublisherSpringer
Pages468-486
ISBN (Electronic)978-3-031-72661-3
ISBN (Print)978-3-031-72660-6
DOIs
Publication statusPublished - 2025
MoE publication typeA4 Conference publication
EventEuropean Conference on Computer Vision - Milano, Italy
Duration: 29 Sept 20244 Oct 2024
Conference number: 18

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15078
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceEuropean Conference on Computer Vision
Abbreviated titleECCV
Country/TerritoryItaly
CityMilano
Period29/09/202404/10/2024

Keywords

  • Contrastive Learning
  • Deep Neural Networks
  • LLM Large Language Models
  • Medical Imaging
  • Zero-shot Inference

Fingerprint

Dive into the research topics of 'Improving Medical Multi-modal Contrastive Learning with Expert Annotations'. Together they form a unique fingerprint.

Cite this