Sampling from scarcely defined distributions: Methods and applications in data mining

Aleksi Kallio

Research output: ThesisDoctoral ThesisCollection of Articles


The importance of data is widely acknowledged in the modern society. Increasing volumes of information and growing interest in data driven decision making are creating new demands for analytical methods. In data mining applications, users are often required to operate with limited background knowledge. Specifically, one needs to analyze data and derived statistics without exact information on underlying statistical distributions. This work introduces the term scarcely defined distributions to describe such statistical distributions. In traditional statistical testing one often makes assumptions about the source of data, such as those related to normal distribution. If data are produced by a controlled experiment and originate from a well-known source, these assumptions can be justified. In data mining strong presuppositions about the data source typically cannot be made, as the data source is not under the control of the analyst, is not well known or is too complex to understand. The present research discusses methods and applications of data mining, in which scarcely defined distributions emerge. Several strategies are put forth that allow to analyze the dataset even when distributions are not well known, both in frequentist and information-theoretic statistical frameworks. A recurring theme is how to employ controls at the analysis phase, if the data were not produced in a controlled experiment. In most cases presented, control is achieved by adopting randomization and other empirical sampling methods that rely on large data sizes and computational power. Data mining applications reviewed in this work are from several fields. Biomedical measurement data are explored in multiple cases, involving both microarray and high-throughput sequencing data types. In ecological and paleontological domains the analysis of presence-absence data of taxa is discussed. A common factor for all of the application areas is the complexity of the underlying processes and the biased error sources of the measurement process. Finally, the study discusses the future trend of growing data volumes and the relevance of the proposed methods and solutions in that context. It is noted that the growing complexity and the needs for quickly adaptable methods favor the general approach taken in the thesis, while increasing data volumes and computational power makes it practically feasible.
Translated title of the contributionOtanta niukasti määritellyistä jakaumista: menetelmät ja sovellukset tiedonlouhinnassa
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Gionis, Aristides, Supervising Professor
  • Mannila, Heikki, Thesis Advisor
  • Puolamäki, Kai, Thesis Advisor
Print ISBNs978-952-60-6653-0
Electronic ISBNs978-952-60-6654-7
Publication statusPublished - 2016
MoE publication typeG5 Doctoral dissertation (article)


  • data mining
  • statistical significance
  • probability distribution
  • null model
  • algorithmic data analysis


Dive into the research topics of 'Sampling from scarcely defined distributions: Methods and applications in data mining'. Together they form a unique fingerprint.

Cite this