Machine learning methods for incomplete data and variable selection

Emil Eirola

    Research output: ThesisDoctoral ThesisCollection of Articles


    Machine learning is a rapidly advancing field. While increasingly sophisticated statistical methods are being developed, their use for concrete applications is not necessarily clear-cut. This thesis explores techniques to handle some issues which arise when applying machine learning algorithms to practical data sets. The focus is on two particular problems: how to effectively make use of incomplete data sets without having to discard samples with missing values, and how to select an appropriately representative set of variables for a given task. For tasks with missing values, distance estimation is presented as a new approach which would directly enable a large class of machine learning methods to be used. It is shown that the distance can be estimated reliably and efficiently, and experimental results are provided to support the procedure. The idea is studied both on a general level, as well as how to conduct the estimation with a Gaussian mixture model. The issue of variable selection is considered from the perspective of finding suitable criteria which are feasible to calculate and effective at distinguishing the most useful variables also for non-linear connections when limited data is available. Two alternatives are studied, the first being the Delta test, which is a noise variance estimator based on the nearest neighbour regression model. It is shown that the optimal selection of feature uniquely minimises the expectation of the estimator. The second method is a mutual information estimator based on a mixture of Gaussians. The procedure is based on a single mixture model which can be used to derive estimates for any subset of variables. This leads to congruous estimates for the mutual information of different variable sets, which can then be compared to each other in a meaningful way to find the optimal. The Gaussian mixture model proves to be a highly useful tool for several tasks, especially concerning data with missing values. In this thesis, it is used for distance estimation, time series modelling, and mutual information estimation for variable selection.
    Original languageEnglish
    QualificationDoctor's degree
    Awarding Institution
    • Aalto University
    • Karhunen, Juha, Supervisor
    • Lendasse, Amaury, Advisor
    Print ISBNs978-952-60-5870-2
    Electronic ISBNs978-952-60-5871-9
    Publication statusPublished - 2014
    MoE publication typeG5 Doctoral dissertation (article)


    • machine learning
    • missing values
    • variable selection
    • Gaussian mixture model
    • mutual information
    • Delta test

    Fingerprint Dive into the research topics of 'Machine learning methods for incomplete data and variable selection'. Together they form a unique fingerprint.

    Cite this