Large-Scale Byzantine Fault Tolerance: Safe but Not Always Live

Petr Kouznetsov
Max Planck Institute for Software Systems (Germany)

Abstract:

The overall correctness of large-scale systems composed of many groups of replicas executing BFT protocols scales poorly with the number of groups. This is because the probability of at least one group being compromised (more than 1/3 faulty replicas) increases rapidly as the number of groups increases. We address this problem with a simple modification to Castro and Liskovs BFT replication that allows for arbitrary choice of n (number of replicas) and f (failure threshold). The modified BFT protocol does not lose safety even when failures are many, performs very well when failures are few, but may lose liveness in the intermediate cases. This features seem to address realistic failure models which account for software errors or security attacks. The price to pay is a more restrictive liveness requirement, and we present the design of a large-scale BFT replicated system that obviates this problem.