Advances in Weakly Supervised Learning of Morphology

Julkaisun otsikon käännös: Advances in Weakly Supervised Learning of Morphology

Oskar Kohonen

Tutkimustuotos: Doctoral ThesisCollection of Articles

Abstrakti

Morphological analysis provides a decomposition of words into smaller constituents. It is an important problem in natural language processing (NLP), particularly for morphologically rich languages whose large vocabularies make statistical modeling difficult. Morphological analysis has traditionally been approached with rule-based methods that yield accurate results, but are expensive to produce. More recently, unsupervised machine learning methods have been shown to perform sufficiently well to benefit applications such as speech recognition and machine translation. Unsupervised methods, however, do not typically model allomorphy, that is, non-concatenative structure, for example pretty/prettier. Moreover, the accuracy of unsupervised methods remains far behind rule-based methods with the best unsupervised methods yielding between 50-66% F-score in Morpho Challenge 2010. We examine these problems with two approaches that have not previously attracted much attention in the field. First, we propose a novel extension to the popular unsupervised morphological segmentation method Morfessor Baseline to model allomorphy via the use of string transformations. Second, we examine the effect of weak supervision on accuracy by training on a small annotated data set in addition to a large unannotated data set. We propose two novel semi-supervised morphological segmentation methods, namely a semi-supervised extension of Morfessor Baseline and morphological segmentation with conditional random fields (CRF). The methods are evaluated on several languages with different morphological characteristics, including English, Estonian, Finnish, German and Turkish. The proposed methods are compared empirically to recently proposed weakly supervised methods. For the non-concatenative extension, we find that, while the string transformations identified by the model have high precision, their recall is low. In the overall evaluation the non-concatenative extension improves accuracy on English, but not on other languages. For the weak supervision we find that the semi-supervised extension of Morfessor Baseline improves the accuracy of segmentation markedly over the unsupervised baseline. We find, however, that the discriminatively trained CRFs perform even better. In the empirical comparison, the CRF approach outperforms all other approaches on all included languages. Error analysis reveals that the CRF excels especially on affix accuracy.
Julkaisun otsikon käännösAdvances in Weakly Supervised Learning of Morphology
AlkuperäiskieliEnglanti
PätevyysTohtorintutkinto
Myöntävä instituutio
  • Aalto-yliopisto
Valvoja/neuvonantaja
  • Oja, Erkki, Vastuuprofessori
  • Lagus, Krista, Ohjaaja
Kustantaja
Painoksen ISBN978-952-60-6270-9
Sähköinen ISBN978-952-60-6271-6
TilaJulkaistu - 2015
OKM-julkaisutyyppiG5 Artikkeliväitöskirja

Sormenjälki

Sukella tutkimusaiheisiin 'Advances in Weakly Supervised Learning of Morphology'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä