Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

2 Downloads (Pure)


Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2023
PublisherInternational Speech Communication Association (ISCA)
Number of pages5
Publication statusPublished - 20 Aug 2023
MoE publication typeA4 Conference publication
EventInterspeech - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Publication series

PublisherInternational Speech Communication Association
ISSN (Print)1990-9772
ISSN (Electronic)2308-457X




Dive into the research topics of 'Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model'. Together they form a unique fingerprint.

Cite this