Multiple hypothesis testing in data mining

Sami Hanhijärvi

    Research output: ThesisDoctoral ThesisCollection of Articles

    Abstract

    Data mining methods seek to discover unexpected and interesting regularities, called patterns, in presented data sets. However, the methods often return a collection of patterns for any data set, even a random one. Statistical significance testing can be applied in these scenarios to select the surprising patterns that do not appear as clearly in random data. As each pattern is tested for significance, a set of statistical hypotheses are considered simultaneously. The multiple comparison of several hypotheses simultaneously is called multiple hypothesis testing, and special treatment is required to adequately control the probability of falsely declaring a pattern statistically significant. However, the traditional methods for multiple hypothesis testing can not be used in data mining scenarios, because these methods do not consider the problem of varying set of hypotheses, which is inherent in data mining. This thesis provides an introduction to the problem and reviews some published work on the subject. The focus is in multiple hypothesis testing and specifically in data mining. The problems with traditional multiple hypothesis testing methods in data mining scenarios are discussed, and a solution to these problems is presented. The solution uses randomization, which involves drawing samples of random data sets and using the data mining algorithm with them. The results on the random data sets are then compared with the results on the original data set. Randomization is introduced and discussed in general, and possible randomization schemes in different data mining scenarios are presented. The solution is applied in iterative data mining and biclustering scenarios. Experiments are carried out to display the utility in these applications.
    Translated title of the contributionMonen hypoteesin testaus tiedonlouhinnassa
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    Supervisors/Advisors
    • Mannila, Heikki, Supervising Professor
    • Rousu, Juho, Supervising Professor
    • Puolamäki, Kai, Thesis Advisor
    Publisher
    Print ISBNs978-952-60-4604-4
    Electronic ISBNs978-952-60-4605-1
    Publication statusPublished - 2012
    MoE publication typeG5 Doctoral dissertation (article)

    Keywords

    • data mining
    • multiple hypothesis testing
    • statistical significance testing
    • biclustering

    Fingerprint Dive into the research topics of 'Multiple hypothesis testing in data mining'. Together they form a unique fingerprint.

    Cite this