Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

34 Lataukset (Pure)


Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.
OtsikkoProceedings of Interspeech 2023
KustantajaInternational Speech Communication Association (ISCA)
DOI - pysyväislinkit
TilaJulkaistu - 20 elok. 2023
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaInterspeech - Dublin, Irlanti
Kesto: 20 elok. 202324 elok. 2023


KustantajaInternational Speech Communication Association
ISSN (elektroninen)2958-1796




Sukella tutkimusaiheisiin 'Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä