AV-PEA : Parameter-Efficient Adapter for Audio-Visual Multimodal Learning

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

Fine-tuning has emerged as a widely used transfer learning technique for leveraging pre-trained vision transformers in various downstream tasks. However, its success relies on tuning a significant number of trainable parameters, which could lead to significant costs in terms of both model training and storage. When it comes to audio-visual multimodal learning, the challenge also lies in effectively incorporating both audio and visual cues into the transfer learning process, especially when the original model has been trained with unimodal samples only. This paper introduces a novel audio-visual parameter-efficient adapter (AV-PEA) designed to improve multimodal transfer learning for audio-visual tasks. Through the integration of AV-PEA into a frozen vision transformer, like the visual transformer (ViT), the transformer becomes adept at processing audio inputs without prior knowledge of audio pre-training. This also facilitates the exchange of essential audio-visual cues between audio and visual modalities, all while introducing a limited set of trainable parameters into each block of the frozen transformer. The experimental results demonstrate that our AV-PEA consistently achieves superior or comparable performance to state-of-the-art methods in a range of audio-visual tasks, including audio-visual event localization (AVEL), audio-visual question answering (AVQA), audio-visual retrieval (AVR), and audio-visual captioning (AVC). Furthermore, it distinguishes itself from competitors by enabling seamless integration into these tasks while maintaining a consistent number of trainable parameters, typically accounting for less than 3.7% of the total parameters per task.

Original languageEnglish
Title of host publicationProceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
PublisherSciTePress
Pages730-737
Number of pages8
ISBN (Electronic)978-989-758-679-8
DOIs
Publication statusPublished - 2024
MoE publication typeA4 Conference publication
EventInternational Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Rome, Italy
Duration: 27 Feb 202429 Feb 2024
Conference number: 19

Publication series

NameProceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
ISSN (Print)2184-5921

Conference

ConferenceInternational Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Abbreviated titleVISIGRAPP
Country/TerritoryItaly
CityRome
Period27/02/202429/02/2024

Keywords

  • Audio-Visual Adapter
  • Audio-Visual Fusion
  • Multimodal Learning
  • Parameter-Efficient

Fingerprint

Dive into the research topics of 'AV-PEA : Parameter-Efficient Adapter for Audio-Visual Multimodal Learning'. Together they form a unique fingerprint.

Cite this