Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening

Yingjia Xu, Mengxia Wu, Zixin Guo, Min Cao*, Mang Ye, Jorma Laaksonen

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

2 Downloads (Pure)

Abstract

Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.

Original languageEnglish
Article number2
Pages (from-to)1-13
Number of pages13
JournalVisual Intelligence
Volume3
Issue number1
DOIs
Publication statusPublished - Dec 2025
MoE publication typeA1 Journal article-refereed

Keywords

  • Contrastive learning (CL)
  • Inverted index
  • Pre-screening
  • Text-to-video retrieval (TVR)

Fingerprint

Dive into the research topics of 'Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening'. Together they form a unique fingerprint.

Cite this