University of Texas at Austin Department of Computer Sciences Networking Research Laboratory
Department of Computer Sciences
The University of Texas at Austin

Director: Simon S. Lam (more publications)


Silk: A Resilient Routing Fabric for Peer-to-Peer Networks

Simon S. Lam and Huaiyu Liu
Technical Report TR-03-13, May 16, 2003 (revised, October 12, 2003).

Several proposed peer-to-peer networks use hypercube routing for scalability.  In a previous paper, we showed that consistency of neighbor tables in hypercube routing guarantees the existence of a path from any source node to any destination node. Consistency, however, can be broken by the failure of one node. To improve the robustness of hypercube routing, we generalize the concept of consistency to K-consistency, for K ≥ 1. We then show that a K-consistent hypercube routing network provides at least K disjoint paths from any source node to any destination node with a probability close to 1. The first objective of this report is the design and specification of a new join protocol together with a proof that it generates K-consistent neighbor tables for an arbitrary number of concurrent joins (under the assumption that there is no concurrent leave or failure). To do so, we construct a more general definition of C-set tree than our previous one as the conceptual foundation for protocol design and reasoning about K-consistency. Both the new protocol and proof require major extensions to the ones in our previous paper to generalize them from 1-consistency to K-consistency.

The second objective of this report is the design and evaluation of a failure recovery protocol for K-consistent networks. From simulation experiments in which up to 50% of the nodes in a K-consistent network failed (when a node fails, it becomes silent), we found that, for K ≥ 2, K-consistency was recovered in every experiment. The third objective of this report is to extend our join and failure recovery protocols such that they construct and maintain K-consistent neighbor tables for networks whose nodes join and fail concurrently and frequently. In particular, our join protocol is extended with rules to handle failures of not only existing nodes but also other joining nodes. These extended protocols, being implemented in our prototype system named Silk, will be referred to as Silk protocols. From simulation experiments in which the number of concurrent joins and failures was up to 50% of the initial network size, we found that, for K ≥ 2, Silk generated and maintained K-consistent neighbor tables after the concurrent joins and failures in every experiment. We also present an analysis of the communication and storage overheads of Silk protocols and show that Silk is scalable to a large number of network nodes