Automatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-the-art deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of transcribed speech for Finnish, by retrieving and aligning all the recordings and meeting transcripts from the web portal of the Parliament of Finland. Short speech-text segment pairs are retrieved from the audio and text material, by using the Levenshtein algorithm to align the first-pass ASR hypotheses with the corresponding meeting transcripts. DNN acoustic models are trained on the automatically constructed corpus, and performance is compared to other models trained on a commercially available speech corpus. Model performance is evaluated on Finnish parliament speech, by dividing the testing set into seen and unseen speakers. Performance is also evaluated on broadcast speech to test the general applicability of the parliament speech corpus. We also study the use of meeting transcripts in language model adaptation, to achieve additional gains in speech recognition accuracy of Finnish parliament speech.
|Otsikko||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|Tila||Julkaistu - elokuuta 2017|
|OKM-julkaisutyyppi||A4 Artikkeli konferenssijulkaisuussa|
|Tapahtuma||INTERSPEECH - |
Kesto: 1 tammikuuta 1900 → …
|Nimi||Interspeech: Annual Conference of the International Speech Communication Association|
|Ajanjakso||01/01/1900 → …|