Skip to main navigation Skip to search Skip to main content

Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

2 Citations (Scopus)
70 Downloads (Pure)

Abstract

The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.

Original languageEnglish
Title of host publication2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
PublisherEuropean language resources distribution agency
Pages15823-15834
Number of pages12
ISBN (Electronic)978-2-493814-10-4
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventJoint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20 May 202425 May 2024
https://lrec-coling-2024.org/conference-program/
https://aclanthology.org/2024.lrec-main

Publication series

NameLREC proceedings
PublisherLanguage Resources Association (ELRA)
ISSN (Electronic)2522-2686
NameInternational conference on computational linguistics
PublisherInternational Committee on Computational Linguistics
ISSN (Print)2951-2093

Conference

ConferenceJoint International Conference on Computational Linguistics, Language Resources and Evaluation
Abbreviated titleLREC-COLING
Country/TerritoryItaly
CityTorino
Period20/05/202425/05/2024
Internet address

Keywords

  • contrastive learning
  • cross-modality
  • modality fusion
  • multimodal retrieval
  • multimodal transformers
  • text-to-video retrieval
  • transfer learning

Fingerprint

Dive into the research topics of 'Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer'. Together they form a unique fingerprint.
  • USSEE: Understanding speech and scene with ears and eyes (USSEE)

    Laaksonen, J. (Principal investigator), Kainulainen, J. (Project Member), Saif, A. (Project Member), Wang, T.-J. (Project Member), Guo, Z. (Project Member), Arora, P. (Project Member), Riahi, I. (Project Member), Tiwari, H. (Project Member) & Pehlivan Tort, S. (Project Member)

    01/01/202231/12/2024

    Project: RCF Academy Project targeted call

  • Science-IT

    Hakala, M. (Manager)

    School of Science

    Facility/equipment: Facility

Cite this