Abstract
In data mining large amounts of data are searched through for useful information, pieces of which are called patterns. Significance testing is an important part of this task as the found patterns need to be assessed for their relevance and significance before further actions. Advances in science have brought along the need to evaluate the significance of complicated data patterns within complicated datasets. Significance testing has been historically conducted with specialized methods that cannot be adapted to new applications and many of these methods have problems with their theoretical justification. This thesis suggests using the framework of property-based randomization for building reliable and flexible significance testing tools that can be adapted and extended for a wide variety of applications. The concepts of representation-based randomization and iterative pattern mining are also discussed as ways to enlarge the scope of these tools. The final chapter of the thesis makes a review of the use of these general ideas in various applications such as databases and time series collections. The publications of the thesis are discussed along with selected introductions to other randomization methods that have been proposed.
Translated title of the contribution | Monimutkaisten nollahypoteesien käyttö tietohahmojen merkitsevyyden arvioinnissa |
---|---|
Original language | English |
Qualification | Doctor's degree |
Awarding Institution |
|
Supervisors/Advisors |
|
Publisher | |
Print ISBNs | 978-952-60-4494-1 |
Electronic ISBNs | 978-952-60-4495-8 |
Publication status | Published - 2012 |
MoE publication type | G5 Doctoral dissertation (article) |
Keywords
- data mining
- significance testing
- randomization
- null hypothesis
- null model
- Markov chain Monte Carlo
- frequent pattern
- clustering
- classification
- time series