Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

22 Lataukset (Pure)

Abstrakti

The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.

AlkuperäiskieliEnglanti
Otsikko2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
ToimittajatNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
KustantajaEuropean language resources distribution agency
Sivut15823-15834
Sivumäärä12
ISBN (elektroninen)978-2-493814-10-4
TilaJulkaistu - 2024
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaJoint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italia
Kesto: 20 toukok. 202425 toukok. 2024
https://lrec-coling-2024.org/conference-program/
https://aclanthology.org/2024.lrec-main

Julkaisusarja

NimiInternational conference on computational linguistics
KustantajaInternational Committee on Computational Linguistics
ISSN (painettu)2951-2093
NimiLREC proceedings
KustantajaLanguage Resources Association (ELRA)
ISSN (elektroninen)2522-2686

Conference

ConferenceJoint International Conference on Computational Linguistics, Language Resources and Evaluation
LyhennettäLREC-COLING
Maa/AlueItalia
KaupunkiTorino
Ajanjakso20/05/202425/05/2024
www-osoite

Sormenjälki

Sukella tutkimusaiheisiin 'Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä