Modelling-based experiment retrieval: A case study with gene expression clustering

Research output: Contribution to journalArticleScientificpeer-review

Standard

Modelling-based experiment retrieval : A case study with gene expression clustering. / Blomstedt, Paul; Dutta, Ritabrata; Seth, Sohan; Brazma, Alvis; Kaski, Samuel.

In: Bioinformatics, Vol. 32, No. 9, 01.05.2016, p. 1388-1394.

Research output: Contribution to journalArticleScientificpeer-review

Harvard

APA

Vancouver

Author

Bibtex - Download

@article{62299bf16cd8428b97c2343643ceebd8,
title = "Modelling-based experiment retrieval: A case study with gene expression clustering",
abstract = "Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method.",
author = "Paul Blomstedt and Ritabrata Dutta and Sohan Seth and Alvis Brazma and Samuel Kaski",
year = "2016",
month = "5",
day = "1",
doi = "10.1093/bioinformatics/btv762",
language = "English",
volume = "32",
pages = "1388--1394",
journal = "Bioinformatics",
issn = "1367-4803",
number = "9",

}

RIS - Download

TY - JOUR

T1 - Modelling-based experiment retrieval

T2 - A case study with gene expression clustering

AU - Blomstedt, Paul

AU - Dutta, Ritabrata

AU - Seth, Sohan

AU - Brazma, Alvis

AU - Kaski, Samuel

PY - 2016/5/1

Y1 - 2016/5/1

N2 - Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method.

AB - Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method.

UR - http://www.scopus.com/inward/record.url?scp=84966339575&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btv762

DO - 10.1093/bioinformatics/btv762

M3 - Article

VL - 32

SP - 1388

EP - 1394

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 9

ER -

ID: 4300851