Evaluation of BIC and Cross Validation for model selection on sequence segmentations

Niina Haiminen*, Heikki Mannila

*Corresponding author for this work

    Research output: Contribution to journalArticleScientificpeer-review

    Abstract

    Segmentation is a general data mining technique for summarising and analysing sequential data. Segmentation can be applied, e. g., when studying large-scale genomic structures such as isochores. Choosing the number of segments remains a challenging question. We present extensive experimental studies on model selection techniques, Bayesian Information Criterion (BIC) and Cross Validation (CV). We successfully identify segments with different means or variances, and demonstrate the effect of linear trends and outliers, frequently occurring in real data. Results are given for real DNA sequences with respect to changes in their codon, G + C, and bigram frequencies, and copy-number variation from CGH data.

    Original languageEnglish
    Pages (from-to)675-700
    Number of pages26
    JournalINTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS
    Volume4
    Issue number6
    DOIs
    Publication statusPublished - 2010
    MoE publication typeA1 Journal article-refereed

    Keywords

    • segmentation
    • model selection
    • CV
    • cross validation
    • BIC
    • Bayesian information criterion
    • sequence
    • binary
    • categorical
    • genome
    • likelihood
    • ARRAY CGH DATA
    • DNA-SEQUENCE
    • GENOME
    • ORGANIZATION
    • ISOCHORES
    • CRITERION
    • DIMENSION

    Cite this