Projects per year
Abstract
Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.
Original language | English |
---|---|
Article number | 2 |
Pages (from-to) | 1-13 |
Number of pages | 13 |
Journal | Visual Intelligence |
Volume | 3 |
Issue number | 1 |
DOIs | |
Publication status | Published - Dec 2025 |
MoE publication type | A1 Journal article-refereed |
Keywords
- Contrastive learning (CL)
- Inverted index
- Pre-screening
- Text-to-video retrieval (TVR)
Fingerprint
Dive into the research topics of 'Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening'. Together they form a unique fingerprint.Projects
- 1 Finished
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Principal investigator)
01/01/2022 → 31/12/2024
Project: Academy of Finland: Other research funding