Abstract

Diffusion-based models have recently demonstrated notable success in various generative tasks involving continuous signals, such as image, video, and audio synthesis. However, their applicability to video captioning has not yet received widespread attention, primarily due to the discrete nature of captions and the complexities of conditional generation across multiple modalities. This paper delves into diffusion-based video captioning and experiments with various modality fusion methods and different modality combinations to assess their impact on the quality of generated captions. The novelty of our proposed MM-Diff-Net is in the use of diffusion models in multimodal video captioning and in the introduction of a number of mid-fusion techniques for that purpose. Additionally, we propose a new input modality: generated description, which is attended to enhance caption quality. Experiments are conducted on four well-established benchmark datasets, YouCook2, MSR-VTT, VATEX, and VALOR-32K, to evaluate the proposed model and fusion methods. The findings indicate that combining all modalities yields the best captions, but the effect of fusion methods varies across datasets. The performance of our proposed model shows the potential of diffusion-based models in video captioning, paving the way for further exploration and future research in the area.
Original languageEnglish
Title of host publicationComputer Vision – ACCV 2024 : 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part III
PublisherSpringer
Pages148-165
ISBN (Electronic)978-981-96-0885-0
ISBN (Print)978-981-96-0884-3
Publication statusPublished - 7 Dec 2024
MoE publication typeA4 Conference publication
EventAsian Conference on Computer Vision - Hanoi, Viet Nam
Duration: 8 Dec 202412 Dec 2024
Conference number: 17

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15474
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceAsian Conference on Computer Vision
Abbreviated titleACCV
Country/TerritoryViet Nam
CityHanoi
Period08/12/202412/12/2024

Fingerprint

Dive into the research topics of 'Diffusion-Based Multimodal Video Captioning'. Together they form a unique fingerprint.

Cite this