Randomization algorithms for assessing the significance of data mining results

Markus Ojala

    Research output: ThesisDoctoral ThesisCollection of Articles

    Abstract

    Data mining is an interdisciplinary research area that develops general methods for finding interesting and useful knowledge from large collections of data. This thesis addresses from the computational point of view the problem of assessing whether the obtained data mining results are merely random artefacts in the data or something more interesting. In randomization based significance testing, a result is compared with the results obtained on randomized data. The randomized data are assumed to share some basic properties with the original data. To apply the randomization approach, the first step is to define these properties. The next step is to develop algorithms that can produce such randomizations. Results on the real data that clearly differ from the results on the randomized data are not directly explained by the studied properties of the data. In this thesis, new randomization methods are developed for four specific data mining scenarios. First, randomizing matrix data while preserving the distributions of values in rows and columns is studied. Next, a general randomization approach is introduced for iterative data mining. Randomization in multi-relational databases is also considered. Finally, a simple permutation method is given for assessing whether dependencies between features are exploited in classification. The properties of the new randomization methods are analyzed theoretically. Extensive experiments are performed on real and artificial datasets. The randomization methods introduced in this thesis are useful in various data mining applications. The methods work well on different types of data, are easy to use, and provide meaningful information to further improve and understand the data mining results.
    Translated title of the contributionRandomization algorithms for assessing the significance of data mining results
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    Supervisors/Advisors
    • Mannila, Heikki, Supervisor
    • Mannila, Heikki, Advisor
    Publisher
    Print ISBNs978-952-60-4322-7
    Electronic ISBNs978-952-60-4323-4
    Publication statusPublished - 2011
    MoE publication typeG5 Doctoral dissertation (article)

    Keywords

    • data mining
    • randomization
    • significance testing
    • MCMC
    • matrix
    • relational database
    • clustering
    • classification
    • iterative analysis

    Fingerprint Dive into the research topics of 'Randomization algorithms for assessing the significance of data mining results'. Together they form a unique fingerprint.

    Cite this