Abstract
The reliability of the information extracted from largescale data, as well as the validity of datadriven decisions depend on the veracity of the data and the utilized data processing methods. Quantification of the veracity of parameter estimates or datadriven decisions is required in order to make appropriate choices of estimators and identifying redundant or irrelevant variables in multivariate data settings. Moreover, quantification of the veracity allows efficient usage of available resources by processing only as much data as is needed to achieve a desired level of accuracy or confidence. Statistical inference such as finding the accuracy of certain parameter estimates and testing hypotheses on model parameters can be used to quantify the veracity of largescale data analytics results. In this thesis, versatile bootstrap procedures are developed for performing statistical inference on largescale data. First, a computationally efficient and statistically robust bootstrap procedure is proposed, which is scalable to smaller distinct subsets of data. Hence, the proposed method is compatible with distributed storage systems and parallel computing architectures. The statistical convergence and robustness properties of the method are analytically established. Then, two specific lowcomplexity bootstrap procedures are proposed for performing statistical inference on the mixing coefficients of the Independent Component Analysis (ICA) model. Such statistical inferences are required to identify the contribution of a specific source signalofinterest onto the observed mixture variables. This thesis establishes significant analytical results on the structure of the FastICA estimator, which enable the computation of bootstrap replicas in closedform. This not only saves computational resources, but also avoids convergence problems, permutation and sign ambiguities of the FastICA algorithm. The developed methods enable statistical inference in a variety of applications in which ICA is commonly applied, e.g., fMRI and EEG signal processing. In the thesis, an alternative derivation of the fixedpoint FastICA algorithm is established. The derivation provides a better understanding of how the FastICA algorithm is derived from the exact NewtonRaphson (NR) algorithm. In the original derivation, FastICA was derived as an approximate NR algorithm using unjustified assumptions, which are not required in the alternative derivation presented in this thesis. It is well known that the fixedpoint FastICA algorithm has severe convergence problems when the dimensionality of the data and the sample size are of the same order. To mitigate this problem, a power iteration algorithm for FastICA is proposed, which is remarkably more stable than the fixedpoint FastICA algorithm. The proposed PowerICA algorithm can be run in parallel on two computing nodes making it considerably faster to compute.
Original language  English 

Qualification  Doctor's degree 
Awarding Institution 

Supervisors/Advisors 

Publisher  
Print ISBNs  9789526080239 
Publication status  Published  2018 
MoE publication type  G5 Doctoral dissertation (article) 
Keywords
 Big data analytics
 Bootstrap
 Fast and Robustg Bootstrap
 Distributed and parallel computation
 Robust estimation
 Independent Component Analysis
 FastICA
Fingerprint Dive into the research topics of 'Robust largescale statistical inference and ICA using bootstrapping'. Together they form a unique fingerprint.
Cite this
Basiri, S. (2018). Robust largescale statistical inference and ICA using bootstrapping. Aalto University.