Projekteja vuodessa
Abstrakti
Text-to-video retrieval (TVR) has made significant progress with advances in vision and language representation learning. Most existing methods use real-valued and hash-based embeddings to represent the video and text, allowing retrieval by computing their similarities. However, these methods are often inefficient for large volumes of video, and require significant storage and computing resources. In this work, we present a plug-and-play multi-modal multi-tagger-driven pre-screening framework, which pre-screens a substantial number of videos before applying any TVR algorithms, thereby efficiently reducing the search space of videos. We predict discrete semantic tags for video and text with our proposed multi-modal multi-tagger module, and then leverage an inverted index for space-efficient and fast tag matching to filter out irrelevant videos. To avoid filtering out relevant videos for text queries due to inconsistent tags, we utilize contrastive learning to align video and text embeddings, which are then fed into a shared multi-tag head. Extensive experimental results demonstrate that our proposed method significantly accelerates the TVR process while maintaining high retrieval accuracy on various TVR datasets.
Alkuperäiskieli | Englanti |
---|---|
Artikkeli | 2 |
Sivut | 1-13 |
Sivumäärä | 13 |
Julkaisu | Visual Intelligence |
Vuosikerta | 3 |
Numero | 1 |
DOI - pysyväislinkit | |
Tila | Julkaistu - jouluk. 2025 |
OKM-julkaisutyyppi | A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä |
Sormenjälki
Sukella tutkimusaiheisiin 'Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Projektit
- 1 Päättynyt
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Vastuullinen tutkija)
01/01/2022 → 31/12/2024
Projekti: RCF Academy Project targeted call