From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques

Tutkimustuotos: LehtiartikkeliArticleScientificvertaisarvioitu

48 Lataukset (Pure)

Abstrakti

Speech embeddings, fixed-size representations derived from raw audio data, play a crucial role in diverse machine learning applications. Despite the abundance of speech embedding techniques, selecting the most suitable one remains challenging. Existing studies often focus on intrinsic or extrinsic aspects, seldom exploring both simultaneously. Furthermore, comparing the state-of-the-art pre-trained models with prior speech embedding solutions is notably scarce in the literature. To address these gaps, we undertake a comprehensive evaluation of both small and large-scale speech embedding models, which, in our opinion, needs to incorporate both intrinsic and extrinsic assessments. The intrinsic experiments delve into the models' ability to pick speaker-related characteristics and assess their discriminative capacities, providing insights into their inherent capabilities and internal workings. Concurrently, the extrinsic experiments evaluate whether the models learned semantic cues during pre-training. The findings underscore the superior performance of the large-scale pre-trained models, albeit at an elevated computational cost. The base self-supervised models show comparable results to their large counterparts, making them a better choice for many applications. Furthermore, we show that by selecting the most crucial dimensions, the models' performance often does not suffer drastically and even improves in some cases. This research contributes valuable insights into the nuanced landscape of speech embeddings, aiding researchers and practitioners in making informed choices for various applications.

AlkuperäiskieliEnglanti
Sivut3546-3560
Sivumäärä15
JulkaisuIEEE/ACM Transactions on Audio Speech and Language Processing
Vuosikerta32
DOI - pysyväislinkit
TilaJulkaistu - 2024
OKM-julkaisutyyppiA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä

Sormenjälki

Sukella tutkimusaiheisiin 'From Raw Speech to Fixed Representations: A Comprehensive Evaluation of Speech Embedding Techniques'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä