Language- and domain-independent text mining

Mari-Sanna Paukkeri

    Research output: ThesisDoctoral ThesisCollection of Articles


    The field of natural language processing (NLP) has developed enormously during the last decades. The availability of constantly increasing amount of textual data in electronic form has accelerated also the development of statistical methods for NLP, in which characteristics of natural languages are learned from large corpora. Statistical methods have shown their applicability in information retrieval, in which documents of various languages and domains are returned according to user queries, statistical machine translation which is easily applicable to new languages, document clustering to group semantically similar documents, and many information extraction tasks, including keyphrase extraction, document summarization and discovering linguistic features. However, a majority of the NLP research, including also many statistical methods, is concentrated on the English language, using various language-specific tools and resources, such as part-of-speech taggers and ontologies, which are not directly applicable to other languages. Furthermore, methods developed for English alone may not be suitable for languages with different syntax or writing system. In this dissertation, language-independent methods for natural language processing are developed and discussed. Language-independent methods can be applied to a variety of languages without requiring additional language-specific resources. Also dialects, historical forms of languages, languages of few speakers and languages used in specific domains are accessible with language-independent methods. As the main contribution of this thesis, Likey, a language-independent method for keyphrase extraction and feature selection is developed. The method is applied to keyphrase extraction from encyclopedias and scientific articles in eleven languages, and further used as a feature selection method for automatic taxonomy learning and in a novel approach to user modelling in document difficulty assessment. Another major contribution is related to document representations: a set of dimensionality reduction and distance measures are compared in a document clustering task, a novel language-independent direct evaluation method for document representations is proposed, and linguistic features are used for document clustering in a lexical choice task.
    Translated title of the contributionKielestä ja aihealueesta riippumaton tekstinlouhinta
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    • Oja, Erkki, Supervising Professor
    • Honkela, Timo, Thesis Advisor
    • Creutz, Mathias, Thesis Advisor
    Print ISBNs978-952-60-4833-8
    Electronic ISBNs978-952-60-4834-5
    Publication statusPublished - 2012
    MoE publication typeG5 Doctoral dissertation (article)


    • natural language processing
    • computational linguistics
    • unsupervised machine learning
    • language independence
    • subjectivity of language use
    • keyphrase extraction
    • document clustering


    Dive into the research topics of 'Language- and domain-independent text mining'. Together they form a unique fingerprint.

    Cite this