Lorenzo Alvisi
(with Calvin Lin),
Scalable Low-Overhead Fault-Tolerance, $147,000 (ARP).
This project will investigate techniques for providing fault-tolerance to
some of the world's fastest computers, namely, the ASCI computer clusters
found at Los Alamos, Sandia, and Livermore National Labs. Our approach will
concentrate on rollback recovery techniques, which require minimal dedicated
resources while imposing little performance degradation. These techniques
have received considerable attention in the literature for their nice
theoretical properties, but have failed to provide real fault-tolerance
solutions for real systems. Our project instead aims to improve rollback
recovery techniques by applying them to a very real and challenging problem,
one that the scientific community is desperate to solve. At the same time,
we aim to shed new light on the fundamental properties of the various
rollback recovery protocols by stress-testing them against supercomputer
systems that are at least two orders of magnitude larger than any system that
has ever been used to study such protocols.
The main outcome of this research will be a prototype
toolkit that will provide low-overhead fault-tolerance for scientific
applications running on ASCI clusters. We expect that the insights that we
gain in developing the toolkit will lead to novel fault-tolerance techniques
and algorithms that will be both theoretically and
experimentally sound.
Lorenzo Alvisi and Harrick Vin,
Resource Management in Server Clusters, $150,000 (ATP).
This project will investigate techniques for building highly scalable
server clusters capable of co-hosting efficiently a large number of
services simultaneously. There are two aspects to this problem. First,
cluster resources must be allocated to services based on their current
demand. Second, the load across servers must be balanced such that the
performance of each service scales linearly with the cluster resources
allocated to the service. We address both problems.
The outcome of this research will be new algorithms, architectures,
and prototype implementations of highly scalable server clusters.
Mike Dahlin, Resource Management for Safe Deployment
of Edge Services, $125,000 (ATP). Collaborative project with
Dan Wallach, Rice University (also awarded $125,000).
We propose to examine how to safely
support "dynamic edge services" in wide area networks (WANs) by
limiting the resources they consume.
Edge services have recently been popularized by companies such as
Akamai and Digital Island, which place caching servers throughout
the Internet and direct Web requests to servers close to users. These
systems provide high availability and improved performance relative to
traditional Web servers. However, these systems only distribute
"static" content, typically GIF or JPEG images to be embedded inside
Web pages. In order to support dynamic content generation, edge
services face the more difficult problem of managing the execution of
arbitrary computer programs. These programs may be buggy, they may
consume excessive resources, or they may even be hostile to one
another.
We propose to design and prototype a system that can efficiently
allocate resources across these programs to maximize system
throughput and to provide worst-case service guarantees. Our system
will be robust against denial-of-service attacks from malicious or
buggy programs and will scale to support thousands of concurrently
executing programs.