Modeling under-resourced languages for speech recognition

Research output: Contribution to journalArticleScientificpeer-review

Standard

Modeling under-resourced languages for speech recognition. / Kurimo, Mikko; Enarvi, Seppo; Tilk, Ottokar; Varjokallio, Matti; Mansikkaniemi, André; Alumäe, Tanel.

In: LANGUAGE RESOURCES AND EVALUATION, Vol. 51, No. 4, 12.2017, p. 961-987.

Research output: Contribution to journalArticleScientificpeer-review

Harvard

APA

Vancouver

Author

Bibtex - Download

@article{8e20ea9cbc654d6d8b92a014aa5af908,
title = "Modeling under-resourced languages for speech recognition",
abstract = "One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.",
keywords = "Adaptation, Data filtering, Large vocabulary speech recognition, Statistical language modeling, Subword units",
author = "Mikko Kurimo and Seppo Enarvi and Ottokar Tilk and Matti Varjokallio and Andr{\'e} Mansikkaniemi and Tanel Alum{\"a}e",
year = "2017",
month = "12",
doi = "10.1007/s10579-016-9336-9",
language = "English",
volume = "51",
pages = "961--987",
journal = "LANGUAGE RESOURCES AND EVALUATION",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "4",

}

RIS - Download

TY - JOUR

T1 - Modeling under-resourced languages for speech recognition

AU - Kurimo, Mikko

AU - Enarvi, Seppo

AU - Tilk, Ottokar

AU - Varjokallio, Matti

AU - Mansikkaniemi, André

AU - Alumäe, Tanel

PY - 2017/12

Y1 - 2017/12

N2 - One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

AB - One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

KW - Adaptation

KW - Data filtering

KW - Large vocabulary speech recognition

KW - Statistical language modeling

KW - Subword units

UR - http://www.scopus.com/inward/record.url?scp=84957656590&partnerID=8YFLogxK

U2 - 10.1007/s10579-016-9336-9

DO - 10.1007/s10579-016-9336-9

M3 - Article

VL - 51

SP - 961

EP - 987

JO - LANGUAGE RESOURCES AND EVALUATION

JF - LANGUAGE RESOURCES AND EVALUATION

SN - 1574-020X

IS - 4

ER -

ID: 1516608