Projects per year
Abstract
Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furthermore, comparing the state-of-the-art pre-trained models with prior speech embedding solutions is notably scarce in the literature. To address these gaps, we undertake a comprehensive evaluation of both small and large-scale speech embedding models, which, in our opinion, needs to incorporate both intrinsic and extrinsic assessments. The intrinsic experiments delve into the models' ability to pick speaker-related characteristics and assess their discriminative capacities, providing insights into their inherent capabilities and internal workings. Concurrently, the extrinsic experiments evaluate whether the models learned semantic cues during pre-training. The findings underscore the superior performance of the large-scale pre-trained models, albeit at an elevated computational cost. The base self-supervised models show comparable results to their large counterparts, making them a better choice for many applications. Furthermore, we show that by selecting the most crucial dimensions, the models' performance often does not suffer drastically and even improves in some cases. This research contributes valuable insights into the nuanced landscape of speech embeddings, aiding researchers and practitioners in making informed choices for various applications.
Original language | English |
---|---|
Pages (from-to) | 3546-3560 |
Number of pages | 15 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 32 |
DOIs | |
Publication status | Published - 2024 |
MoE publication type | A1 Journal article-refereed |
Keywords
- Computational modeling
- Data models
- dimension contribution
- extrinsic evaluation
- Feature extraction
- intrinsic evaluation
- Speech embeddings
- Speech processing
- Task analysis
- Training
- Transformers
Fingerprint
Dive into the research topics of 'From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques'. Together they form a unique fingerprint.Projects
- 1 Active
-
LAREINA: LAREINA - Language Resource Infrastructure for AI
Kurimo, M. (Principal investigator), Moisio, A. (Project Member), Getman, Y. (Project Member), Porjazovski, D. (Project Member), Rouhe, A. (Project Member) & Virkkunen, A. (Project Member)
01/01/2023 → 31/12/2025
Project: Business Finland: Strategic centres for science, technology and innovation (SHOK)