From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques

Research output: Contribution to journalArticleScientificpeer-review

19 Downloads (Pure)

Abstract

Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furthermore, comparing the state-of-the-art pre-trained models with prior speech embedding solutions is notably scarce in the literature. To address these gaps, we undertake a comprehensive evaluation of both small and large-scale speech embedding models, which, in our opinion, needs to incorporate both intrinsic and extrinsic assessments. The intrinsic experiments delve into the models' ability to pick speaker-related characteristics and assess their discriminative capacities, providing insights into their inherent capabilities and internal workings. Concurrently, the extrinsic experiments evaluate whether the models learned semantic cues during pre-training. The findings underscore the superior performance of the large-scale pre-trained models, albeit at an elevated computational cost. The base self-supervised models show comparable results to their large counterparts, making them a better choice for many applications. Furthermore, we show that by selecting the most crucial dimensions, the models' performance often does not suffer drastically and even improves in some cases. This research contributes valuable insights into the nuanced landscape of speech embeddings, aiding researchers and practitioners in making informed choices for various applications.

Original languageEnglish
Pages (from-to)3546-3560
Number of pages15
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
Publication statusPublished - 2024
MoE publication typeA1 Journal article-refereed

Keywords

  • Computational modeling
  • Data models
  • dimension contribution
  • extrinsic evaluation
  • Feature extraction
  • intrinsic evaluation
  • Speech embeddings
  • Speech processing
  • Task analysis
  • Training
  • Transformers

Fingerprint

Dive into the research topics of 'From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques'. Together they form a unique fingerprint.

Cite this