Projekteja vuodessa
Abstrakti
Diffusion-based models have recently demonstrated notable success in various generative tasks involving continuous signals, such as image, video, and audio synthesis. However, their applicability to video captioning has not yet received widespread attention, primarily due to the discrete nature of captions and the complexities of conditional generation across multiple modalities. This paper delves into diffusion-based video captioning and experiments with various modality fusion methods and different modality combinations to assess their impact on the quality of generated captions. The novelty of our proposed MM-Diff-Net is in the use of diffusion models in multimodal video captioning and in the introduction of a number of mid-fusion techniques for that purpose. Additionally, we propose a new input modality: generated description, which is attended to enhance caption quality. Experiments are conducted on four well-established benchmark datasets, YouCook2, MSR-VTT, VATEX, and VALOR-32K, to evaluate the proposed model and fusion methods. The findings indicate that combining all modalities yields the best captions, but the effect of fusion methods varies across datasets. The performance of our proposed model shows the potential of diffusion-based models in video captioning, paving the way for further exploration and future research in the area.
Alkuperäiskieli | Englanti |
---|---|
Otsikko | Computer Vision – ACCV 2024 : 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part III |
Kustantaja | Springer |
Sivut | 148-165 |
ISBN (elektroninen) | 978-981-96-0885-0 |
ISBN (painettu) | 978-981-96-0884-3 |
Tila | Julkaistu - 7 jouluk. 2024 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
Tapahtuma | Asian Conference on Computer Vision - Hanoi, Vietnam Kesto: 8 jouluk. 2024 → 12 jouluk. 2024 Konferenssinumero: 17 |
Julkaisusarja
Nimi | Lecture Notes in Computer Science |
---|---|
Kustantaja | Springer |
Vuosikerta | 15474 |
ISSN (painettu) | 0302-9743 |
ISSN (elektroninen) | 1611-3349 |
Conference
Conference | Asian Conference on Computer Vision |
---|---|
Lyhennettä | ACCV |
Maa/Alue | Vietnam |
Kaupunki | Hanoi |
Ajanjakso | 08/12/2024 → 12/12/2024 |
Sormenjälki
Sukella tutkimusaiheisiin 'Diffusion-Based Multimodal Video Captioning'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.Projektit
- 1 Päättynyt
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Vastuullinen tutkija)
01/01/2022 → 31/12/2024
Projekti: RCF Academy Project targeted call