Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening

Yingjia Xu, Mengxia Wu, Zixin Guo, Min Cao*, Mang Ye, Jorma Laaksonen

*Tämän työn vastaava kirjoittaja

Tutkimustuotos: LehtiartikkeliArticleScientificvertaisarvioitu

12 Lataukset (Pure)

Abstrakti

Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.

AlkuperäiskieliEnglanti
Artikkeli2
Sivut1-13
Sivumäärä13
JulkaisuVisual Intelligence
Vuosikerta3
Numero1
DOI - pysyväislinkit
TilaJulkaistu - jouluk. 2025
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä

Sormenjälki

Sukella tutkimusaiheisiin 'Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä