Abstract
Active Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio and visual features from long video clips with a complex 3D Convolutional Neural Network (CNN) architecture. However, models based on 3D CNNs can generate discriminative spatial-temporal features, but this comes at the expense of computational complexity, and they frequently face challenges in detecting active speakers in short video clips. This work proposes the Active Speaker Network (AS-Net) model, a simple yet effective ASD method tailored for detecting active speakers in relatively short video clips without relying on 3D CNNs. Instead, it incorporates the Temporal Shift Module (TSM) into 2D CNNs, facilitating the extraction of dense temporal visual features without the need for additional computations. Moreover, self-attention and cross-attention schemes are introduced to enhance long-term temporal audio-visual synchronization, thereby improving ASD performance. Experimental results demonstrate that AS-Net outperforms state-of-the-art 2D CNN-based methods on the AVA-ActiveSpeaker dataset and remains competitive with the methods utilizing more complex architectures.
| Original language | English |
|---|---|
| Pages (from-to) | 72027-72042 |
| Number of pages | 16 |
| Journal | Multimedia Tools and Applications |
| Volume | 83 |
| Issue number | 28 |
| Early online date | 2024 |
| DOIs | |
| Publication status | Published - Aug 2024 |
| MoE publication type | A1 Journal article-refereed |
Funding
This work is supported by the Academy of Finland in project 345791. We acknowledge the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC and the LUMI consortium.
Keywords
- Active speaker detection
- Audio-visual attention
- Audio-visual features
- Convolutional Neural Networks (CNNs)
- Temporal shift module
Fingerprint
Dive into the research topics of 'AS-Net : active speaker detection using deep audio-visual attention'. Together they form a unique fingerprint.Projects
- 1 Finished
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Principal investigator), Kainulainen, J. (Project Member), Saif, A. (Project Member), Wang, T.-J. (Project Member), Guo, Z. (Project Member), Arora, P. (Project Member), Riahi, I. (Project Member), Tiwari, H. (Project Member) & Pehlivan Tort, S. (Project Member)
01/01/2022 → 31/12/2024
Project: RCF Academy Project targeted call
Equipment
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver