During the last decade, high-throughput sequencing (HTS) has become the mainstream technique for simultaneously studying enormous number of genetic features present in the genome, transcriptome, or epigenome of an organism. Besides the static experiments which compare genetic features between two or more distinct biological conditions, time series experiments which monitor genetic features over time provide valuable information about the dynamics of complex mechanisms in various biological processes. However, analysis of the currently available HTS time series data sets involves challenges as these data sets often consist of short and irregularly sampled time series which lack sufficient biological replication. In addition, quantification of the genetic features from HTS data is inherently subject to uncertainty due to the limitations of HTS platforms such as short read lengths and varying sequencing depths. This thesis presents a Gaussian process (GP)-based approach for modelling and ranking HTS time series by taking into account the characteristics of the data sets. GPs are one of the most suitable tools for modelling sparse and irregularly sampled time series and they can capture the temporal correlations between observations at different time points via suitable covariance functions. On the other hand, naive application of GP modelling may suffer from over-fitting, leading to increased number of false positives if the characteristics of the data are not taken into account. In this thesis, this problem has been mitigated by regularizing the models by introducing bounds to the hyperparameter values of the GP prior. Firstly, the range of the values of length-scale parameters has been restricted to values compatible with the spacing of the sampled time points. Secondly, application-dependent variance models have been developed to infer the uncertainty levels on the observations, which have then been incorporated into the GP models as lower bounds for the noise variance. Regularizing the GP models by setting realistic bounds to their hyperparameters makes the GP models more robust against the uncertainty in the data without increasing the complexity of the models, and thus makes the method applicable to large genome-wide studies. The publications included in this thesis suggest a number of techniques for modelling the variance in RNA-seq and Pool-seq applications, which are the HTS techniques specifically designed to sequence RNA transcripts and pooled DNA sequences, respectively. Variance models utilize the information obtained through pre-processing stages of the data depending on, for example, the number of replicates or varying sequencing depth levels. Performance evaluation of the GP models under different experiment settings indicates that the variance incorporation into the GP models can yield a higher average precision than the naive application of GP modelling. Motivated by results, an open-source software package, GPrank, has been implemented in R in order to enable researchers to easily apply the proposed GP-based method in their own HTS time series data sets for detecting temporally most active genetic features.
|Translated title of the contribution||Gaussian Process Modelling of Genome-wide High-throughput Sequencing Time Series|
|Publication status||Published - 2018|
|MoE publication type||G5 Doctoral dissertation (article)|
- gaussian process
- high-throughput sequencing
- time series
- probabilistic modelling