Extending data mining techniques for frequent pattern discovery: Trees, low-entropy sets, and crossmining

Hannes Heikinheimo

    Research output: ThesisDoctoral ThesisMonograph

    Abstract

    The idea of frequent pattern discovery is to find frequently occurring events in large databases. Such data mining techniques can be useful in various domains. For instance, in recommendation and e-commerce systems frequently occurring product purchase combinations are essential in user preference modeling. In the ecological domain, patterns of frequently occurring groups of species can be used to reveal insight into species interaction dynamics. Over the past few years, most frequent pattern mining research has concentrated on efficiency (speed) of mining algorithms. However, it has been argued within the community that while efficiency of the mining task is no longer a bottleneck, there is still an urgent need for methods that derive compact, yet high quality results with good application properties. The aim of this thesis is to address this need. The first part of the thesis discusses a new type of tree pattern class for expressing hierarchies of general and more specific attributes in unstructured binary data. The new pattern class is shown to have advantageous properties, and to discover relationships in data that cannot be expressed alone with the more traditional frequent itemset or association rule patterns. The second and third parts of the thesis discuss the use of entropy as a score measure for frequent pattern mining. A new pattern class is defined, low-entropy sets, which allow to express more general types of occurrence structure than with frequent itemsets. The concept can also be easily applied to tree types of pattern. Furthermore, by applying minimum description length in pattern selection for low-entropy sets it is shown experimentally that in most cases the collections of selected patterns are much smaller than by using frequent itemsets. The fourth part of the thesis examines the idea of crossmining itemsets, that is, relating itemsets to numerical variables in a database of mixed data types. The problem is formally defined and turns out to be NP-hard, although it is approximately solvable within a constant-factor of the optimum solution. Experiments show that the algorithm finds itemsets that convey structure in both the binary and the numerical part of the data.
    Translated title of the contributionExtending data mining techniques for frequent pattern discovery trees, low-entropy sets, and crossmining
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    Supervisors/Advisors
    • Mannila, Heikki, Supervising Professor
    • Mannila, Heikki, Thesis Advisor
    Print ISBNs978-952-60-3003-6
    Electronic ISBNs978-952-60-3004-3
    Publication statusPublished - 2010
    MoE publication typeG4 Doctoral dissertation (monograph)

    Keywords

    • data analysis
    • frequent patterns
    • trees
    • entropy
    • minimum description length
    • pattern selection
    • clustering
    • mining mixed data types

    Fingerprint Dive into the research topics of 'Extending data mining techniques for frequent pattern discovery: Trees, low-entropy sets, and crossmining'. Together they form a unique fingerprint.

    Cite this