Low-Resource Active Learning of Morphological Segmentation

Tutkimustuotos: Lehtiartikkelivertaisarvioitu

Standard

Low-Resource Active Learning of Morphological Segmentation. / Grönroos, Stig-Arne; Hiovain, Katri; Smit, Peter; Rauhala, Ilona; Jokinen, Kristiina; Kurimo, Mikko; Virpioja, Sami.

julkaisussa: NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY, Vuosikerta 4, 4, 2016, s. 47-72.

Tutkimustuotos: Lehtiartikkelivertaisarvioitu

Harvard

APA

Vancouver

Author

Grönroos, Stig-Arne ; Hiovain, Katri ; Smit, Peter ; Rauhala, Ilona ; Jokinen, Kristiina ; Kurimo, Mikko ; Virpioja, Sami. / Low-Resource Active Learning of Morphological Segmentation. Julkaisussa: NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY. 2016 ; Vuosikerta 4. Sivut 47-72.

Bibtex - Lataa

@article{de2319c1df1d46949fe970fc49598b57,
title = "Low-Resource Active Learning of Morphological Segmentation",
abstract = "Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North S{\'a}mi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North S{\'a}mi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19{\%} compared to unsupervised learning and 7.8{\%} compared to random selection.",
author = "Stig-Arne Gr{\"o}nroos and Katri Hiovain and Peter Smit and Ilona Rauhala and Kristiina Jokinen and Mikko Kurimo and Sami Virpioja",
year = "2016",
doi = "10.3384/nejlt.2000-1533.1644",
language = "English",
volume = "4",
pages = "47--72",
journal = "NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY",
issn = "2000-1533",

}

RIS - Lataa

TY - JOUR

T1 - Low-Resource Active Learning of Morphological Segmentation

AU - Grönroos, Stig-Arne

AU - Hiovain, Katri

AU - Smit, Peter

AU - Rauhala, Ilona

AU - Jokinen, Kristiina

AU - Kurimo, Mikko

AU - Virpioja, Sami

PY - 2016

Y1 - 2016

N2 - Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.

AB - Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.

U2 - 10.3384/nejlt.2000-1533.1644

DO - 10.3384/nejlt.2000-1533.1644

M3 - Article

VL - 4

SP - 47

EP - 72

JO - NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY

JF - NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY

SN - 2000-1533

M1 - 4

ER -

ID: 11716607