Due to the growth of datasets, processing them centrally on a single computer has become untenably inefficient or even impossible. Hence, there is an urgent demand for distributed and parallel approaches for data processing. To help with designing algorithms for processing big data, many large-scale computing frameworks, such as MapReduce, Hadoop, Dryad, and Spark have emerged. In this project, we study the theoretical foundations of these frameworks to be able to design more efficient algorithms and to understand their limitations.
The goal of the project is to understand what can be computed by these frameworks. Through a mathematical model abstracting these frameworks, we, on the one hand, show how to solve fundamental problems efficiently, and on the other hand, identify problems that are provably hard. Ideally, this allows us to understand how the practical frameworks need to be adjusted to provide more efficient data processing.