Speech recognition alignments for Finnish parliament data



This dataset contains speech from Finnish parliament 2008-2020 plenary sessions, segmented and aligned for speech recognition training. In total, the training set has:

1.4 million samples3100 hours of audio460 speakersover 19 million word tokens

Additionally, the upload contains 5h long development and 5h long evaluation sets described in publication 10.21437/Interspeech.2017-1115. Due to the size of the training set (~300 GB) and Zenodo upload limit (50 GB), only the development and evaluation sets are published on Zenodo. Rest of the data is available at: http://urn.fi/urn:nbn:fi:lb-2021051903

The training set comes in two parts:

2008-2016 set which is originally described in publication 10.21437/Interspeech.2017-1115. This set includes a list of samples from sessions in 2008-2014 that can be combined with the 2015-2020 set to form the 3100 hour training set.A new 2015-2020 dataset.

All audio samples are single-channel, 16 kHz and 16-bit wav files. Each wav file has corresponding transcript in a .trn text file. The data is machine-extracted so there still remains small inaccuracies in the training set transcripts and possibly few Swedish samples. Development and evaluation sets have been corrected by hand.

The licenses can be viewed at:

http://urn.fi/urn:nbn:fi:lb-2019112822 (audio)http://urn.fi/urn:nbn:fi:lb-2019112823 (text)

The code used in extraction is available at:

https://github.com/aalto-speech/finnish-parliament-scripts (2008-2014, dev and eval sets)https://github.com/aalto-speech/fi-parliament-tools (2015-2020 set)
Dataset Licences

  • Other

