This thesis studies fault-tolerant co-ordination and synchronisation in distributed systems. Such systems are prevalent in computing today: they range from large-scale communications networks to integrated hardware circuits. In essence, a distributed system consists of independent computational entities called nodes that solve some task together. In such a system, a common notion of time is critical for ensuring correct co-ordination and scheduling of actions.However, distributed systems are prone to scenarios in which the system may experience arbitrary transient failures, for example, due to various external disruptions. During a transient failure, the system ceases to operate correctly for some time which may put the system into an inconsistent state or even permanently damage parts of the system. In particular, the nodes in the system can lose synchrony. This thesis proposes synchronisation and scheduling algorithms that efficiently recover from such failures. The first part of this thesis considers self-stabilising Byzantine fault-tolerant algorithms. This means that we assume a strong adversarial fault model: the system starts from an arbitrary initial state and a large fraction of the nodes may be faulty. The faulty nodes may exhibit arbitrary misbehaviour, and in particular, give inconsistent information to other nodes. In this setting, we study the synchronous counting problem, which is a basic primitive for establishing a so-called digital clock. The task is to guarantee that eventually all correct nodes The main contribution of the first part is an array of techniques for designing fast and communication-efficient counting algorithms. In particular, we provide deterministic algorithms with asymptotically optimal stabilisation time and optimal resilience. That is, the algorithms quickly converge to correct behaviour and tolerate the largest possible number of faulty nodes. Compared to prior work, our solutions give an exponential improvement in the number of bits stored and transmitted by each node without resorting to randomisation or suboptimal resilience. The second part of this thesis investigates computational algorithm design and synthesis techniques. That is, we show how to automatically construct correct algorithms or prove their non-existence using computers. This is particularly attractive in the context of fault-tolerant systems, in which finding correct solutions is often a difficult task. We propose synthesis techniques based on propositional satisfiability solvers that yield novel state-optimal counting algorithms and tight bounds on the exact time complexity of distributed graph colouring.
|Translated title of the contribution||Laskureita, kellotusta ja väritystä: Vikasietoinen hajautettu koordinointi|
|Publication status||Published - 2016|
|MoE publication type||G5 Doctoral dissertation (article)|
- distributed algorithms
- fault tolerance