Projects per year
Abstract
Traditionally, teaching a human and a Machine Learning (ML) model is quite different, but organized and
structured learning has the ability to enable faster and better understanding of the underlying concepts. For
example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards
more complex structures such as words and sentences. Motivated by this observation, researchers have started
to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty,
resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning
(CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems,
specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural
network that performs the recognition task, in contrast to the traditional way of having several specialized
components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-
to-end models can achieve better performances if they are provided with an organized training set consisting
of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define
the notion of an easy example, we explored multiple solutions that use either external, static scoring methods
or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that
control how much data is presented to the network during each training epoch. Our proposed curriculum
learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous
Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English
speech. Empirical results showed that a good curriculum strategy can yield performance improvements and
speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease
in terms of test set word error rate for the Finnish and English data sets, respectively.
structured learning has the ability to enable faster and better understanding of the underlying concepts. For
example, when humans learn to speak, they first learn how to utter basic phones and then slowly move towards
more complex structures such as words and sentences. Motivated by this observation, researchers have started
to adapt this approach for training ML models. Since the main concept, the gradual increase in difficulty,
resembles the notion of the curriculum in education, the methodology became known as Curriculum Learning
(CL). In this work, we design and test new CL approaches to train Automatic Speech Recognition systems,
specifically focusing on the so-called end-to-end models. These models consist of a single, large-scale neural
network that performs the recognition task, in contrast to the traditional way of having several specialized
components focusing on different subtasks (e.g., acoustic and language modeling). We demonstrate that end-
to-end models can achieve better performances if they are provided with an organized training set consisting
of examples that exhibit an increasing level of difficulty. To impose structure on the training set and to define
the notion of an easy example, we explored multiple solutions that use either external, static scoring methods
or incorporate feedback from the model itself. In addition, we examined the effect of pacing functions that
control how much data is presented to the network during each training epoch. Our proposed curriculum
learning strategies were tested on the task of speech recognition on two data sets, one containing spontaneous
Finnish speech where volunteers were asked to speak about a given topic, and one containing planned English
speech. Empirical results showed that a good curriculum strategy can yield performance improvements and
speed-up convergence. After a given number of epochs, our best strategy achieved a 5.6% and 3.4% decrease
in terms of test set word error rate for the Finnish and English data sets, respectively.
Original language | English |
---|---|
Article number | 103113 |
Number of pages | 16 |
Journal | Speech Communication |
Volume | 163 |
DOIs | |
Publication status | Published - Sept 2024 |
MoE publication type | A1 Journal article-refereed |
Keywords
- ASR
- Curriculum learning
- Deep learning
- End to end
- Speech recognition
Fingerprint
Dive into the research topics of 'Comparison and analysis of new curriculum criteria for end-to-end ASR'. Together they form a unique fingerprint.Projects
- 1 Active
-
USSEE: Understanding speech and scene with ears and eyes (USSEE)
Laaksonen, J. (Principal investigator), Pehlivan Tort, S. (Project Member), Wang, T.-J. (Project Member), Guo, Z. (Project Member), Saif, A. (Project Member) & Riahi, I. (Project Member)
01/01/2022 → 31/12/2024
Project: Academy of Finland: Other research funding