Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0

Aku Rouhe*, Anja Virkkunen, Juho Leinonen, Mikko Kurimo

*Tämän työn vastaava kirjoittaja

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

74 Lataukset (Pure)

Abstrakti

Low resource speech recognition can potentially benefit a lot from exploiting a pretrained model such as wav2vec 2.0. These pretrained models have learned useful representations in an unsupervised or self-supervised task, often leveraging a very large corpus of untranscribed speech. The pretrained models can then be used in various ways. In this work we compare two approaches which exploit wav2vec 2.0: an attention-based end-to-end model (AED), where the wav2vec 2.0 model is used in the model encoder, and a hybrid hidden Markov model (HMM/DNN) speech recognition system, where the wav2vec 2.0 model is used in the acoustic model. These approaches are compared in a very difficult Northern Sámi task, as well as an easier, simulated low resource task in Finnish. We find that the wav2vec 2.0 AED models can learn a working attention mechanism, but are still outperformed by wav2vec 2.0 HMM/DNN systems. Our best wav2vec 2.0 HMM/DNN recipe on 20 hours is competitive with an HMM/DNN system trained on 1600 hours.

AlkuperäiskieliEnglanti
OtsikkoProceedings of Interspeech'22
KustantajaInternational Speech Communication Association
Sivut3543-3547
Sivumäärä5
DOI - pysyväislinkit
TilaJulkaistu - 2022
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaInterspeech - Incheon, Etelä-Korea
Kesto: 18 syysk. 202222 syysk. 2022

Julkaisusarja

NimiProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
KustantajaInternational Speech Communication Association
ISSN (painettu)2308-457X
ISSN (elektroninen)1990-9772

Conference

ConferenceInterspeech
Maa/AlueEtelä-Korea
KaupunkiIncheon
Ajanjakso18/09/202222/09/2022

Sormenjälki

Sukella tutkimusaiheisiin 'Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä