Learning Constructions of Natural Language: Statistical Models and Evaluations

Sami Virpioja

    Research output: ThesisDoctoral ThesisCollection of Articles


    The modern, statistical approach to natural language processing relies on using machine learning techniques on the increasing amount of text and speech data in electronic format. Typical applications for statistical methods include information retrieval, speech recognition, and machine translation. Many problems encountered in the applications can be solved without language-dependent resources, such as annotated data sets, by the means of unsupervised learning. This thesis focuses on one such problem: the selection of lexical units. It is the first step in processing text data, preceding, for example, the estimation of language models or extraction of vectorial representations. While the lexical units are often selected using simple heuristics or grammatical rule-based methods, this thesis proposes the use of unsupervised and semi-supervised machine learning. Advantages of the data-driven unit selection include greater flexibility and independence from the linguistic resources that exist for a particular language and domain. Statistically learned lexical units do not always fit to the categories in traditional linguistic theories. In this thesis, they are called constructions according to construction grammars, a family of usage-based, cognitive theories of grammar. For learning constructions of a language, the thesis builds on Morfessor, an unsupervised statistical method for morphological segmentation. Morfessor is successfully extended to the tasks of learning allomorphs, semi-supervised learning of morphological segmentation, and learning phrasal constructions of sentences. The results are competitive especially for the morphology induction problems. The thesis also includes new techniques for using the sub-word constructions learned by Morfessor in statistical language modeling and machine translation. In addition to its usefulness in the applications, Morfessor is shown to have psycholinguistic competence: its probability estimates have high correlations with human reaction times in a lexical decision task. Furthermore, direct evaluation methods for the unit selection and other learning problems are considered. Direct evaluations, such as comparing the output of the algorithm to existing linguistic annotations, are often quicker and simpler than indirect evaluation via the end-user applications. However, with unsupervised algorithms, the comparison to the reference data is not always straightforward. In this thesis, direct evaluation methods are developed for two unsupervised tasks, morphology induction and learning semantic vector representations of documents. In both cases, the challenge is to find relationships between the pairs of features in multidimensional data. The proposed methods are quick to use and they can accurately predict the performance in different applications.
    Translated title of the contributionLuonnollisen kielen rakenteiden oppiminen: tilastollisia malleja ja evaluaatiomenetelmiä
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    • Oja, Erkki, Supervisor
    • Kurimo, Mikko, Advisor
    • Lagus, Krista, Advisor
    Print ISBNs978-952-60-4882-6
    Electronic ISBNs978-952-60-4883-3
    Publication statusPublished - 2012
    MoE publication typeG5 Doctoral dissertation (article)


    • morpheme segmentation
    • morphology induction
    • construction grammar
    • unsupervised learning
    • semi-supervised learning
    • probabilistic models
    • language models
    • vector space models
    • machine translation
    • speech recognition

    Fingerprint Dive into the research topics of 'Learning Constructions of Natural Language: Statistical Models and Evaluations'. Together they form a unique fingerprint.

    Cite this