An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR

Aku Rouhe*, Astrid Van Camp, Mittul Singh, Hugo Van Hamme, Mikko Kurimo

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

2 Citations (Scopus)
84 Downloads (Pure)


Standard end-to-end training of attention-based ASR models only uses transcribed speech. If they are compared to HMM/DNN systems, which additionally leverage a large corpus of text-only data and expert-crafted lexica, the differences in modeling cannot be disentangled from differences in data. We propose an experimental setup, where only transcribed speech is used to train both model types. To highlight the difference that text-only data can make, we use Finnish, where an expert-crafted lexicon is not needed. With 1500h equal data, we find that both ASR paradigms perform similarly, but adding text data quickly improves the HMM/DNN system. On a smaller 160h subset we find that HMM/DNN models outperform AED models.

Original languageEnglish
Title of host publicationSpeech and Computer - 23rd International Conference, SPECOM 2021, Proceedings
EditorsAlexey Karpov, Rodmonga Potapova
Number of pages12
ISBN (Print)9783030878016
Publication statusPublished - 2021
MoE publication typeA4 Conference publication
EventInternational Conference on Speech and Computer - Virtual, Online
Duration: 27 Sept 202130 Sept 2021
Conference number: 23

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12997 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferenceInternational Conference on Speech and Computer
Abbreviated titleSPECOM
CityVirtual, Online


  • Attention-based Encoder-Decoder
  • Equal data


Dive into the research topics of 'An Equal Data Setting for Attention-Based Encoder-Decoder and HMM/DNN Models: A Case Study in Finnish ASR'. Together they form a unique fingerprint.

Cite this