The reliability of the information extracted from large-scale data, as well as the validity of data-driven decisions depend on the veracity of the data and the utilized data processing methods. Quantification of the veracity of parameter estimates or data-driven decisions is required in order to make appropriate choices of estimators and identifying redundant or irrelevant variables in multi-variate data settings. Moreover, quantification of the veracity allows efficient usage of available resources by processing only as much data as is needed to achieve a desired level of accuracy or confidence. Statistical inference such as finding the accuracy of certain parameter estimates and testing hypotheses on model parameters can be used to quantify the veracity of large-scale data analytics results. In this thesis, versatile bootstrap procedures are developed for performing statistical inference on large-scale data. First, a computationally efficient and statistically robust bootstrap procedure is proposed, which is scalable to smaller distinct subsets of data. Hence, the proposed method is compatible with distributed storage systems and parallel computing architectures. The statistical convergence and robustness properties of the method are analytically established. Then, two specific low-complexity bootstrap procedures are proposed for performing statistical inference on the mixing coefficients of the Independent Component Analysis (ICA) model. Such statistical inferences are required to identify the contribution of a specific source signal-of-interest onto the observed mixture variables. This thesis establishes significant analytical results on the structure of the FastICA estimator, which enable the computation of bootstrap replicas in closed-form. This not only saves computational resources, but also avoids convergence problems, permutation and sign ambiguities of the FastICA algorithm. The developed methods enable statistical inference in a variety of applications in which ICA is commonly applied, e.g., fMRI and EEG signal processing. In the thesis, an alternative derivation of the fixed-point FastICA algorithm is established. The derivation provides a better understanding of how the FastICA algorithm is derived from the exact Newton-Raphson (NR) algorithm. In the original derivation, FastICA was derived as an approximate NR algorithm using unjustified assumptions, which are not required in the alternative derivation presented in this thesis. It is well known that the fixed-point FastICA algorithm has severe convergence problems when the dimensionality of the data and the sample size are of the same order. To mitigate this problem, a power iteration algorithm for FastICA is proposed, which is remarkably more stable than the fixed-point FastICA algorithm. The proposed PowerICA algorithm can be run in parallel on two computing nodes making it considerably faster to compute.
|Publication status||Published - 2018|
|MoE publication type||G5 Doctoral dissertation (article)|
- Big data analytics, Bootstrap, Fast and Robustg Bootstrap, Distributed and parallel computation, Robust estimation, Independent Component Analysis, FastICA