Low-Resource Active Learning of Morphological Segmentation

Stig-Arne Grönroos, Katri Hiovain, Peter Smit, Ilona Rauhala, Kristiina Jokinen, Mikko Kurimo, Sami Virpioja

Tutkimustuotos: LehtiartikkeliArticleScientificvertaisarvioitu

217 Lataukset (Pure)

Abstrakti

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
AlkuperäiskieliEnglanti
Artikkeli4
Sivut47-72
Sivumäärä26
JulkaisuNORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY
Vuosikerta4
DOI - pysyväislinkit
TilaJulkaistu - 2016
OKM-julkaisutyyppiA1 Julkaistu artikkeli, soviteltu

Sormenjälki Sukella tutkimusaiheisiin 'Low-Resource Active Learning of Morphological Segmentation'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

  • Laitteet

    Science-IT

    Mikko Hakala (Manager)

    Perustieteiden korkeakoulu

    Laitteistot/tilat: Facility

  • Siteeraa tätä