Natural language processing (NLP) refers to the study of systems performing natural language related tasks in an automatic manner, that is, without human supervision or interference. This thesis work considers NLP problems related to morphology analysis, that is, the description of internal structure of words. Acquiring knowledge of morphology is necessary in order for applications, such as search engines, machine translators, and speech recognizers, to successfully address rare and previously unseen word forms. In particular, we focus on two widely applied morphological analysis tasks, namely, morphological tagging and segmentation. In morphological tagging, the aim is to assign words in sentential contexts with word class labels describing their morphological properties. Meanwhile, morphological segmentation considers describing the inner word structure by splitting word forms into their smallest meaning-bearing units, morphemes. In the scope of this thesis, we approach the morphological tagging and segmentation problems using statistical, data-driven machine learning methodology. Using this approach, the processing systems are learned (estimated) based on training data prepared manually by a human expert. In particular, we focus on the highly influential conditional random field (CRF) model proposed for sequence tagging and segmentation in the early 2000s.As the first main contribution, the thesis discusses data-driven morphological segmentation employing the CRF model. A particular emphasis is placed on the semi-supervised learning setting, in which the available data consists of a small number of annotated segmentation examples and a large amount of unannotated raw word forms. The provided empirical evaluation on six languages shows that the proposed semi-supervised CRF-based approach is highly successful in the considered morphological segmentation task compared to earlier methods. In particular, the performed error analysis shows that closed class phenomena, such as suffixation of English and Finnish, can be learned already from a small number of annotated examples in a supervised manner. Meanwhile, open morpheme class phenomena, such as compounding of Finnish, can be learned by additionally exploiting the large unannotated word list using the semi-supervised approach. As the second main contribution, the thesis contains a presentation of FinnPos, the first open-source statistical morphological tagging and lemmatization toolkit designed specifically for Finnish. The CRF-based FinnPos system is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank.
|Translated title of the contribution||Kontribuutioita morfologian oppimiseen ehdollisilla satunnaiskentillä|
|Publication status||Published - 2016|
|MoE publication type||G5 Doctoral dissertation (article)|
- language technology
- natural language
- conditional random fields