Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

2 Citations (Scopus)

Abstract

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@ K=1 score. Our code, detailed language descriptions for Web ViD-Co VR dataset are available at https://github.com/OmkarThawakar/composed-video-retrieval.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE
Pages26886-26896
Number of pages11
ISBN (Electronic)979-8-3503-5300-6
DOIs
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventIEEE Conference on Computer Vision and Pattern Recognition - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

ConferenceIEEE Conference on Computer Vision and Pattern Recognition
Abbreviated titleCVPR
Country/TerritoryUnited States
CitySeattle
Period16/06/202422/06/2024

Keywords

  • CoVR
  • multimodal conversational model

Fingerprint

Dive into the research topics of 'Composed Video Retrieval via Enriched Context and Discriminative Embeddings'. Together they form a unique fingerprint.

Cite this