Projects per year
Abstract
Diffusion-based models have recently demonstrated notable success in various generative tasks involving continuous signals, such as image, video, and audio synthesis. However, their applicability to video captioning has not yet received widespread attention, primarily due to the discrete nature of captions and the complexities of conditional generation across multiple modalities. This paper delves into diffusion-based video captioning and experiments with various modality fusion methods and different modality combinations to assess their impact on the quality of generated captions. The novelty of our proposed MM-Diff-Net is in the use of diffusion models in multimodal video captioning and in the introduction of a number of mid-fusion techniques for that purpose. Additionally, we propose a new input modality: generated description, which is attended to enhance caption quality. Experiments are conducted on four well-established benchmark datasets, YouCook2, MSR-VTT, VATEX, and VALOR-32K, to evaluate the proposed model and fusion methods. The findings indicate that combining all modalities yields the best captions, but the effect of fusion methods varies across datasets. The performance of our proposed model shows the potential of diffusion-based models in video captioning, paving the way for further exploration and future research in the area.
Original language | English |
---|---|
Title of host publication | Computer Vision – ACCV 2024 : 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part III |
Publisher | Springer |
Pages | 148-165 |
ISBN (Electronic) | 978-981-96-0885-0 |
ISBN (Print) | 978-981-96-0884-3 |
Publication status | Published - 7 Dec 2024 |
MoE publication type | A4 Conference publication |
Event | Asian Conference on Computer Vision - Hanoi, Viet Nam Duration: 8 Dec 2024 → 12 Dec 2024 Conference number: 17 |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
Volume | 15474 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | Asian Conference on Computer Vision |
---|---|
Abbreviated title | ACCV |
Country/Territory | Viet Nam |
City | Hanoi |
Period | 08/12/2024 → 12/12/2024 |
Fingerprint
Dive into the research topics of 'Diffusion-Based Multimodal Video Captioning'. Together they form a unique fingerprint.Projects
- 1 Finished
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Principal investigator)
01/01/2022 → 31/12/2024
Project: Academy of Finland: Other research funding