TY - JOUR
T1 - A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies
AU - Boulesteix, Anne Laure
AU - Hable, Robert
AU - Lauer, Sabine
AU - Eugster, Manuel J A
PY - 2015/7/3
Y1 - 2015/7/3
N2 - In computational sciences, including computational statistics, machine learning, and bioinformatics, it is often claimed in articles presenting new supervised learning methods that the new method performs better than existing methods on real data, for instance in terms of error rate. However, these claims are often not based on proper statistical tests and, even if such tests are performed, the tested hypothesis is not clearly defined and poor attention is devoted to the Type I and Type II errors. In the present article, we aim to fill this gap by providing a proper statistical framework for hypothesis tests that compare the performances of supervised learning methods based on several real datasets with unknown underlying distributions. After giving a statistical interpretation of ad hoc tests commonly performed by computational researchers, we devote special attention to power issues and outline a simple method of determining the number of datasets to be included in a comparison study to reach an adequate power. These methods are illustrated through three comparison studies from the literature and an exemplary benchmarking study using gene expression microarray data. All our results can be reproduced using R codes and datasets available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/compstud2013.
AB - In computational sciences, including computational statistics, machine learning, and bioinformatics, it is often claimed in articles presenting new supervised learning methods that the new method performs better than existing methods on real data, for instance in terms of error rate. However, these claims are often not based on proper statistical tests and, even if such tests are performed, the tested hypothesis is not clearly defined and poor attention is devoted to the Type I and Type II errors. In the present article, we aim to fill this gap by providing a proper statistical framework for hypothesis tests that compare the performances of supervised learning methods based on several real datasets with unknown underlying distributions. After giving a statistical interpretation of ad hoc tests commonly performed by computational researchers, we devote special attention to power issues and outline a simple method of determining the number of datasets to be included in a comparison study to reach an adequate power. These methods are illustrated through three comparison studies from the literature and an exemplary benchmarking study using gene expression microarray data. All our results can be reproduced using R codes and datasets available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/compstud2013.
KW - Benchmarking
KW - Comparison
KW - Performance
KW - Supervised learning
KW - Testing
UR - http://www.scopus.com/inward/record.url?scp=84940653205&partnerID=8YFLogxK
U2 - 10.1080/00031305.2015.1005128
DO - 10.1080/00031305.2015.1005128
M3 - Article
AN - SCOPUS:84940653205
SN - 0003-1305
VL - 69
SP - 201
EP - 212
JO - AMERICAN STATISTICIAN
JF - AMERICAN STATISTICIAN
IS - 3
ER -