Projekteja vuodessa
Abstrakti
The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.
Alkuperäiskieli | Englanti |
---|---|
Otsikko | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings |
Toimittajat | Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue |
Kustantaja | European language resources distribution agency |
Sivut | 15823-15834 |
Sivumäärä | 12 |
ISBN (elektroninen) | 978-2-493814-10-4 |
Tila | Julkaistu - 2024 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
Tapahtuma | Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italia Kesto: 20 toukok. 2024 → 25 toukok. 2024 https://lrec-coling-2024.org/conference-program/ https://aclanthology.org/2024.lrec-main |
Julkaisusarja
Nimi | International conference on computational linguistics |
---|---|
Kustantaja | International Committee on Computational Linguistics |
ISSN (painettu) | 2951-2093 |
Nimi | LREC proceedings |
---|---|
Kustantaja | Language Resources Association (ELRA) |
ISSN (elektroninen) | 2522-2686 |
Conference
Conference | Joint International Conference on Computational Linguistics, Language Resources and Evaluation |
---|---|
Lyhennettä | LREC-COLING |
Maa/Alue | Italia |
Kaupunki | Torino |
Ajanjakso | 20/05/2024 → 25/05/2024 |
www-osoite |
Sormenjälki
Sukella tutkimusaiheisiin 'Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Projektit
- 1 Aktiivinen
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Vastuullinen tutkija), Pehlivan Tort, S. (Projektin jäsen), Wang, T.-J. (Projektin jäsen), Guo, Z. (Projektin jäsen), Saif, A. (Projektin jäsen) & Riahi, I. (Projektin jäsen)
01/01/2022 → 31/12/2024
Projekti: Academy of Finland: Other research funding