Projects per year
Abstract
Fine-tuning has emerged as a widely used transfer learning technique for leveraging pre-trained vision transformers in various downstream tasks. However, its success relies on tuning a significant number of trainable parameters, which could lead to significant costs in terms of both model training and storage. When it comes to audio-visual multimodal learning, the challenge also lies in effectively incorporating both audio and visual cues into the transfer learning process, especially when the original model has been trained with unimodal samples only. This paper introduces a novel audio-visual parameter-efficient adapter (AV-PEA) designed to improve multimodal transfer learning for audio-visual tasks. Through the integration of AV-PEA into a frozen vision transformer, like the visual transformer (ViT), the transformer becomes adept at processing audio inputs without prior knowledge of audio pre-training. This also facilitates the exchange of essential audio-visual cues between audio and visual modalities, all while introducing a limited set of trainable parameters into each block of the frozen transformer. The experimental results demonstrate that our AV-PEA consistently achieves superior or comparable performance to state-of-the-art methods in a range of audio-visual tasks, including audio-visual event localization (AVEL), audio-visual question answering (AVQA), audio-visual retrieval (AVR), and audio-visual captioning (AVC). Furthermore, it distinguishes itself from competitors by enabling seamless integration into these tasks while maintaining a consistent number of trainable parameters, typically accounting for less than 3.7% of the total parameters per task.
Original language | English |
---|---|
Title of host publication | Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP |
Publisher | SciTePress |
Pages | 730-737 |
Number of pages | 8 |
ISBN (Electronic) | 978-989-758-679-8 |
DOIs | |
Publication status | Published - 2024 |
MoE publication type | A4 Conference publication |
Event | International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Rome, Italy Duration: 27 Feb 2024 → 29 Feb 2024 Conference number: 19 |
Publication series
Name | Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications |
---|---|
ISSN (Print) | 2184-5921 |
Conference
Conference | International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications |
---|---|
Abbreviated title | VISIGRAPP |
Country/Territory | Italy |
City | Rome |
Period | 27/02/2024 → 29/02/2024 |
Keywords
- Audio-Visual Adapter
- Audio-Visual Fusion
- Multimodal Learning
- Parameter-Efficient
Fingerprint
Dive into the research topics of 'AV-PEA : Parameter-Efficient Adapter for Audio-Visual Multimodal Learning'. Together they form a unique fingerprint.Projects
- 1 Finished
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Principal investigator), Pehlivan Tort, S. (Project Member), Wang, T.-J. (Project Member), Guo, Z. (Project Member), Saif, A. (Project Member) & Riahi, I. (Project Member)
01/01/2022 → 31/12/2024
Project: Academy of Finland: Other research funding