Automatic Construction of the Finnish Parliament Speech Corpus

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


Research units


Automatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-the-art deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of transcribed speech for Finnish, by retrieving and aligning all the recordings and meeting transcripts from the web portal of the Parliament of Finland. Short speech-text segment pairs are retrieved from the audio and text material, by using the Levenshtein algorithm to align the first-pass ASR hypotheses with the corresponding meeting transcripts. DNN acoustic models are trained on the automatically constructed corpus, and performance is compared to other models trained on a commercially available speech corpus. Model performance is evaluated on Finnish parliament speech, by dividing the testing set into seen and unseen speakers. Performance is also evaluated on broadcast speech to test the general applicability of the parliament speech corpus. We also study the use of meeting transcripts in language model adaptation, to achieve additional gains in speech recognition accuracy of Finnish parliament speech.


Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - Aug 2017
MoE publication typeA4 Article in a conference publication
Duration: 1 Jan 1900 → …

Publication series

NameInterspeech: Annual Conference of the International Speech Communication Association
ISSN (Electronic)1990-9772


Period01/01/1900 → …

    Research areas

  • automatic speech recognition, speech-to-text alignment, DNN acoustic models, parliament speech dat, transcribed speech corpus

Download statistics

No data available

ID: 13147237