This thesis investigates knowledge inference from measurements of multiple data sources, motivated by technologies in a wide range of domains allowing effective measurement of several related, but heterogeneous data sources. In life sciences, examples of this kind of "multi-view" data are brain imaging data of multiple subjects along with description of the experimental stimuli, as well as drug response studies including measurements regarding the expression level, copy number variation and mutation of genes in cell lines. Data analyses have been typically related to analyzing the structure of a single data source, or the effect of one data source to another. The multi-view data inspected in this thesis results in a more complex problem: besides the structure of each of the data sources, the relations between the data sources are of high interest as well.
This thesis addresses modern multi-view data analysis problems using Bayesian latent variable models. They are a natural choice for developing models in order to gain knowledge about multiple data sources and their relations; they allow for missing values in the data, incorporating prior information to the modelling problem and estimating the uncertainty present in the inference. The key contributions of this thesis include formulating a low-rank data source relation model and presenting biclustering using sparse priors, as well as a relaxed formulation of tensor factorization. All the developed models have been published as open-source software, enabling wide-spread use and further development.
The presented machine learning tools are demonstrated using drug response and brain imaging studies, for both of which predictive performance above state-of-the-art level is achieved. In the drug response studies, the models were able to accurately relate similar drugs, as well as detect known cancer genes affecting the responsiveness of cells to certain drugs. In the brain response studies the benefits of the presented methods were shown via increased accuracy in predicting brain responses, whereas the relaxed tensor decomposition allowed for a novel way of utilizing measurements for multiple subjects. Finally, the advantage of using a low-dimensional latent space is illustrated in a genome-wide association study in an especially challenging domain: when there exist measurements for only two hundred subjects, yet there exist some thousands of features regarding the subjects, with the study discovering a relevant gene associated with components of brain activity.
|Publication status||Published - 2018|
|MoE publication type||G5 Doctoral dissertation (article)|
- bayesian modelling, bioinformatics, brain imaging, factor analysis, multi-view modelling, tensor factorization