Diffusion-Based Multimodal Video Captioning

Jaakko Kainulainen, Zixin Guo, Jorma Laaksonen

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

Abstrakti

Diffusion-based models have recently demonstrated notable success in various generative tasks involving continuous signals, such as image, video, and audio synthesis. However, their applicability to video captioning has not yet received widespread attention, primarily due to the discrete nature of captions and the complexities of conditional generation across multiple modalities. This paper delves into diffusion-based video captioning and experiments with various modality fusion methods and different modality combinations to assess their impact on the quality of generated captions. The novelty of our proposed MM-Diff-Net is in the use of diffusion models in multimodal video captioning and in the introduction of a number of mid-fusion techniques for that purpose. Additionally, we propose a new input modality: generated description, which is attended to enhance caption quality. Experiments are conducted on four well-established benchmark datasets, YouCook2, MSR-VTT, VATEX, and VALOR-32K, to evaluate the proposed model and fusion methods. The findings indicate that combining all modalities yields the best captions, but the effect of fusion methods varies across datasets. The performance of our proposed model shows the potential of diffusion-based models in video captioning, paving the way for further exploration and future research in the area.
AlkuperäiskieliEnglanti
OtsikkoComputer Vision – ACCV 2024 : 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part III
KustantajaSpringer
Sivut148-165
ISBN (elektroninen)978-981-96-0885-0
ISBN (painettu)978-981-96-0884-3
TilaJulkaistu - 7 jouluk. 2024
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaAsian Conference on Computer Vision - Hanoi, Vietnam
Kesto: 8 jouluk. 202412 jouluk. 2024
Konferenssinumero: 17

Julkaisusarja

NimiLecture Notes in Computer Science
KustantajaSpringer
Vuosikerta15474
ISSN (painettu)0302-9743
ISSN (elektroninen)1611-3349

Conference

ConferenceAsian Conference on Computer Vision
LyhennettäACCV
Maa/AlueVietnam
KaupunkiHanoi
Ajanjakso08/12/202412/12/2024

Sormenjälki

Sukella tutkimusaiheisiin 'Diffusion-Based Multimodal Video Captioning'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä