[top]

Annotated Bibliography

More recent papers first

[Toggle All Abstracts] [Toggle All Notes]

Jump to


Complete info on individual entries


[Ezhilchelvan02]

P. Ezhilchelvan. A Middleware Architecture for Intrusion Tolerant Service Replication, Proceedings of DSN, June 2002. pages C-6.


[Stoica02]

I. Stoica, D. Adkins, S. Ratnasamy, S. Shenker, S. Surana, S. Zhuang. Internet Indirection Infrastructure, First International Workshop on Peer-to-Peer Systems, March 2002. pages .

online: citeseer home pdf

notes:

tuple-space with best-effort semantics where the set of IP addresses is a subset of the set of IDs...


[Rushby94]

J. Rushby. A Formally Verified Algorithm for Clock Synchronization Under a Hybrid Fault Model, 13th ACM Symposium on Principles of Distributed Computing (PODC 94), August 1994. pages 304-313.

online: citeseer ps.gz

abstract:

A small modification to the interactive convergence clock synchronization algorithm allows it to tolerate a larger number of simple faults than the standard algorithm, without reducing its ability to tolerate arbitrary or "Byzantine" faults. Because the extended caseanalysis required by the new fault model complicates the already intricate argument for correctness of the algorithm, it has been subjected to mechanically-checked formal verification.

The fault model examined is similar to the "hybrid" one previously used for the problem of distributed consensus: in addition to arbitrary faults, we also admit symmetric (i.e., consistent) and manifest (i.e., detectable) faults. With n processors, the modified algorithm can withstand a arbitrary, s symmetric, and m manifest faults simultaneously, provided n ? 3a + 2s + m. A further extension to the fault model includes link faults with bound n ? 3a + 2s + m + l where l is the maximum, over all pairs of processors, of the number of proces...

notes:

Improves on LAmport85


[Lincoln93]

P. Lincoln and J. Rushby. A Formally Verified Algorithm for Interactive Consistency Under a Hybrid Fault Model, Preprint of a paper for FTCS 23, June 1993. pages .

online: citeseer

abstract:

Thambidurai and Park [13] have proposed an algorithm for Interactive Consistency that retains resilience to the arbitrary (or Byzantine) fault mode, while tolerating more faults of simpler kinds than standard Byzantine-resilent algorithms. Unfortunately, and despite a published proof of correctness, their algorithm is flawed. We detected this while undertaking a formal verification of the algorithm.

We present a corrected algorithm that has been subjected to mechanically-checked formal verification. Because informal proofs seem unreliable in this domain, and the consequences of failure could be catastrophic, we believe formal verification should become standard for algorithms intended for safety-critical applications.

notes:

Presents a fixed version of an algorithm by Thambidurai and Park for Byzantine fault-tolerant "Interactive Consistency", and a mechanical verification of the new protocol. (Add more text here).


[Powell88]

D. Powell and G. Bonn and D. Seaton and P. Verissimo and F. Waeselynck. The Delta-4 approach to dependability in open distributed computing systems, FTCS-18, June 1988. pages 246--251.

online: home pdf (IEEE) citeseer (not)

abstract:

As part of the European Strategic Programme for Research in Information Technology (ESPRIT), the Delta-4 project is seeking to define an open, fault-tolerant, distributed computing architecture. The authors present the overall Delta-4 framework for open, fault-tolerant, distributed computing systems and sketch the current implementation, which is based on a local area network with specific atomic multicasting and error-processing protocols for communicating between replicated software components. The system is used to demonstrate the various fault-tolerance techniques by a replicated database application.


[Thambidurai88]

P. Thambidurai and YK. Park and K. S. Trivedi. interactive consistency with multiple failure modes, 9th International Conference on Distributed Computing Systems, June 1988. pages .

page: 136-142
ISBN: 0-8186-0875-7
online: IEEE pdf

abstract:

The authors address the problem of reaching Byzantine agreement in a distributed system in the presence of different types of faults and show that significant improvements in reliability and performance are possible if faults can be partitioned into disjoint classes. They show that, in a distributed system, to guarantee Byzantine agreement requires N<2a+2s+b+r where N is the total number of processors, a is the number of malicious asymmetric faults (a>or=r), s the number of malicious symmetric faults, b the number of nonmalicious or intercepted faults, and r an algorithm-dependent term. The practical value of this unified model in designing ultrareliable systems is demonstrated by examples.


[Lamport85]

L. Lamport and P.M. Melliar-Smith. Synchronizing clocks in the presence of faults , JACM, January 1985. pages .

online: acm pdf

abstract:

Algorithms are described for maintaining clock synchrony in a distributed multiprocess system where each process has its own clock. These algorithms work in the presence of arbitrary clock or process failures, including ?two-faced clocks? that present different values to different processes. Two of the algorithms require that fewer than one-third of the processes be faulty. A third algorithm works if fewer than half the processes are faulty, but requires digital signatures.

notes:

(These are Lamport's notes)

Practical implementation of Byzantine agreement requires synchronized clocks. For an implementation to tolerate Byzantine faults, it needs a clock synchronization algorithm that can tolerate those faults. When I arrived at SRI, there was a general feeling that we could synchronize clocks by just having each process use a Byzantine agreement protocol to broadcast its clock value. I was never convinced by that hand waving. So, at some point I tried to write down precise clock-synchronization algorithms and prove their correctness. The two basic Byzantine agreement algorithms from [40] did generalize to clock-synchronization algorithms. In addition, Melliar-Smith had devised the interactive convergence algorithm, which is also included in the paper. (As I recall, that algorithm was his major contribution to the paper, and I wrote all the proofs.)

Writing the proofs turned out to be much more difficult than I had expected (see [23]). I worked very hard to make them as short and easy to understand as I could. So, I was rather annoyed when a referee said that the proofs seemed to have been written quickly and could be simplified with a little effort. In my replies to the reviews, I referred to that referee as a "supercilious bastard". Some time later, Nancy Lynch confessed to being that referee. She had by then written her own proofs of clock synchronization and realized how hard they were.

Years later, John Rushby and his colleagues at SRI wrote mechanically verified versions of my proofs. They found only a couple of minor errors. I'm rather proud that, even before I knew how to write reliable, structured proofs (see [94]), I was sufficiently careful and disciplined to have gotten those proofs essentially correct.


[Pease80]

M. Pease and R. Shostak and L. Lamport. Reaching agreement in the presence of faults, Journal of the ACM, 27(2):228-234, April 1980. pages .

online: home pdf ACM pdf

notes:

(These are Dr Lamport's notes, from his web page)

Before this paper, it was generally assumed that a three-processor system could tolerate one faulty processor. This paper shows that "Byzantine" faults, in which a faulty processor sends inconsistent information to the other processors, can defeat any traditional three-processor algorithm. (The term Byzantine didn't appear until [40].) In general, 3n+1 processors are needed to tolerate n faults. However, if digital signatures are used, 2n+1 processors are enough. This paper introduced the problem of handling Byzantine faults. I think it also contains the first precise statement of the consensus problem.

I am often unfairly credited with inventing the Byzantine agreement problem. The problem was discovered in the work on SIFT (see [26]) before I arrived at SRI. The 4-processor solution and the general impossibility result were obtained by Shostak; Pease invented the 3n+1-processor solution. My contribution to the work in this paper was the solution using digital signatures. It was my work on digital signatures (see [30]) that led me to think in that direction. However, digital signatures, as used here, are a metaphor. Since the signatures need be secure only against random failure, not against an intelligent adversary, they are much easier to implement that true digital signatures. However, this point seems to have escaped most people, so they rule out the algorithms that use digital signatures because true digital signature algorithms are expensive. Thus, 3n+1-processor solutions are used even though there are satisfactory 2n+1-processor solutions,

My other contribution to this paper was getting it written. Writing is hard work, and without the threat of perishing, researchers outside academia generally do less publishing than their colleagues at universities. I wrote an initial draft, which displeased Shostak so much that he completely rewrote it to produce the final version.

Over the years, I often wondered whether the people who actually build airplanes know about the problem of Byzantine failures. In 1997, I received email from John Morgan who used to work at Boeing. He told me that he came across our work in 1986 and that, as a result, the people who build the passenger planes at Boeing are aware of the problem and design their systems accordingly. But, in the late 80s and early 90s, the people at Boeing working on military aircraft and on the space station, and the people at McDonnell-Douglas, did not understand the problem. I have no idea what Airbus knows or when they knew it.

abstract:

The problem addressed here concerns a set of isolated processors, some unknown subset of which may be faulty, that communicate only by means of two-party messages. Each nonfaulty processor has a private value of information that must be communicated to each other nonfaulty processor. Nonfaulty processors always communicate honestly, whereas faulty processors may lie. The problem is to devise an algorithm in which processors communicate their own values and relay values received from others that allows each nonfaulty processor to refer a value for each other processor. The value referred for a nonfaulty processor must be that processor's private value, and the value inferred for a faulty one must be consistent with the corresponding value inferred by each other nonfaulty processor.

It is shown that the problem is solvable for, and only for, n ≥ 3m + 1, where m is the number of faulty processors and n is the total number. It is also shown that if faulty processors can refuse to pass on information but cannot falsely relay information, the problem is solvable for arbitrary n ≥ m ≥ 0. This weaker assumption can be approximated in practice using cryptographic methods.


Generated from XML to JP's format