Abstract
DNA methylation is an epigenetic modification in which methyl groups bind to the DNA molecule. It regulates gene expression and enables the normal function of the cells. On the contrary, aberrant DNA methylation patterns have been associated with diseases such as cancer. Uncovering the mechanisms of gene regulation and utilizing DNA methylation biomarkers in e.g. cancer screening require advanced analysis methods for high-throughput sequencing data.
The aim of this thesis is to improve analysis of DNA methylation data with a probabilistic modeling approach. First, two methods for differential DNA methylation analysis designed for bisulfite sequencing data are proposed. In both methods, the spatial correlation of the methylation states is utilized in a binomial generalized linear mixed model to improve the accuracy of detecting differential methylation. The first method assumes that the DNA methylation across all cytosines in a genomic window have the same correlation characteristics and performs testing for differential methylation by computing one Bayes factor for each genomic window. In the other approach a sparsifying prior is used in the correlation structure to allow individual cytosines to deviate from the general correlation pattern. In the third publication, an analysis workflow for reduced representation bisulfite sequencing data is proposed. The workflow was applied to a cord blood data set, and differential DNA methylation analysis was performed to detect possible pregnancy or delivery-related changes in cord blood DNA methylation. In the fourth publication, methods for cell-free DNA-based cancer classification were developed and compared. To demonstrate the feasibility of liquid biopsies in clinical use, lower sequencing depth was simulated by subsampling the used cell-free methylated DNA immunoprecipitation sequencing data set. Then different generalized linear model classifiers and feature extraction and selection methods were applied and the resulting classification performance was evaluated.
The results presented in this thesis show that probabilistic modeling and Bayesian methods perform well and can improve the accuracy of analysis of DNA methylation sequencing data. Taking spatial correlation into account increased the accuracy of differential DNA methylation analysis. Allowing deviations from the correlation pattern made the analysis more flexible. Most of the differentially methylated cytosines and regions found from the cord-blood data set were sex-associated, and only a few were associated with the other clinical covariates. Additionally, the cord-blood data analysis revealed the problem of inflated p-values and a permutation-based method for solving the issue was proposed. Finally, methods that improved cell-free DNA methylation-based cancer classification included a logistic regression classifier and iterative supervised principal component analysis and Fisher's exact test for feature selection.
The aim of this thesis is to improve analysis of DNA methylation data with a probabilistic modeling approach. First, two methods for differential DNA methylation analysis designed for bisulfite sequencing data are proposed. In both methods, the spatial correlation of the methylation states is utilized in a binomial generalized linear mixed model to improve the accuracy of detecting differential methylation. The first method assumes that the DNA methylation across all cytosines in a genomic window have the same correlation characteristics and performs testing for differential methylation by computing one Bayes factor for each genomic window. In the other approach a sparsifying prior is used in the correlation structure to allow individual cytosines to deviate from the general correlation pattern. In the third publication, an analysis workflow for reduced representation bisulfite sequencing data is proposed. The workflow was applied to a cord blood data set, and differential DNA methylation analysis was performed to detect possible pregnancy or delivery-related changes in cord blood DNA methylation. In the fourth publication, methods for cell-free DNA-based cancer classification were developed and compared. To demonstrate the feasibility of liquid biopsies in clinical use, lower sequencing depth was simulated by subsampling the used cell-free methylated DNA immunoprecipitation sequencing data set. Then different generalized linear model classifiers and feature extraction and selection methods were applied and the resulting classification performance was evaluated.
The results presented in this thesis show that probabilistic modeling and Bayesian methods perform well and can improve the accuracy of analysis of DNA methylation sequencing data. Taking spatial correlation into account increased the accuracy of differential DNA methylation analysis. Allowing deviations from the correlation pattern made the analysis more flexible. Most of the differentially methylated cytosines and regions found from the cord-blood data set were sex-associated, and only a few were associated with the other clinical covariates. Additionally, the cord-blood data analysis revealed the problem of inflated p-values and a permutation-based method for solving the issue was proposed. Finally, methods that improved cell-free DNA methylation-based cancer classification included a logistic regression classifier and iterative supervised principal component analysis and Fisher's exact test for feature selection.
Translated title of the contribution | DNA-metylaatiosekvensointidatan probabilistinen mallintaminen |
---|---|
Original language | English |
Qualification | Doctor's degree |
Awarding Institution |
|
Supervisors/Advisors |
|
Publisher | |
Print ISBNs | 978-952-64-0927-6 |
Electronic ISBNs | 978-952-64-0928-3 |
Publication status | Published - 2022 |
MoE publication type | G5 Doctoral dissertation (article) |
Keywords
- DNA methylation
- probabilistic modeling
- generalized linear models
- Bisulfite sequencing
- CfMeDIP-seq