Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model

Tamas Grosz, Yaroslav Getman, Ragheb Al-Ghezi, Aku Rouhe, Mikko Kurimo

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

2 Citations (Scopus)
66 Downloads (Pure)

Abstract

Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2023
PublisherInternational Speech Communication Association (ISCA)
Pages196-200
Number of pages5
DOIs
Publication statusPublished - 20 Aug 2023
MoE publication typeA4 Conference publication
EventInterspeech - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Publication series

NameInterspeech
PublisherInternational Speech Communication Association
ISSN (Electronic)2958-1796

Conference

ConferenceInterspeech
Country/TerritoryIreland
CityDublin
Period20/08/202324/08/2023

Fingerprint

Dive into the research topics of 'Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model'. Together they form a unique fingerprint.

Cite this