Comparison and analysis of new curriculum criteria for end-to-end ASR

Georgios Karakasidis, Mikko Kurimo, Peter Bell, Tamás Grósz

Research output: Contribution to journalArticleScientificpeer-review

22 Downloads (Pure)

Abstract

Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and
structured learning has the ability to enable faster and better understanding of the underlying concepts. For
example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards
more complex structures such as words and sentences. Motivated by this observation, researchers have started
to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty,
resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning
(CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems,
specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural
network that performs the recognition task, in contrast to the traditional way of having several specialized
components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-
to-end models can achieve better performances if they are provided with an organized training set consisting
of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define
the notion of an easy example, we explored multiple solutions that use either external, static scoring methods
or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that
control how much data is presented to the network during each training epoch. Our proposed curriculum
learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous
Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English
speech. Empirical results showed that a good curriculum strategy can yield performance improvements and
speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease
in terms of test set word error rate for the Finnish and English data sets, respectively.
Original languageEnglish
Article number103113
Number of pages16
JournalSpeech Communication
Volume163
DOIs
Publication statusPublished - Sept 2024
MoE publication typeA1 Journal article-refereed

Keywords

  • ASR
  • Curriculum learning
  • Deep learning
  • End to end
  • Speech recognition

Fingerprint

Dive into the research topics of 'Comparison and analysis of new curriculum criteria for end-to-end ASR'. Together they form a unique fingerprint.

Cite this