UTCS Colloquium: Saurabh Bagchi ECE Purdue University Failure Prediction in Cycle Sharing Distributed Systems ACES 2.402 Friday March 2 2007 at 3:00 p.m.
There is a signup schedule for this event.
Speaker Name
/Affiliation: Saurabh Bagchi/ECE Purdue University
Date/Time: Fri
day March 2 2007 3:00 p.m. - 4:15 p.m.
Refreshments: By AI - 2:4
5 p.m.
Location: ACES 2.402
Host: Lorenzo Alvisi
Tal
k Title: Failure Prediction in Cycle Sharing Distributed Systems
Ta
lk Abstract:
Although difficult prediction is nevertheless important for a class of distributed systems where resources are shared
between
multiple applications submitted by different users.
The system model i
s that of computational hosts contributed
by different individuals for
the common good of running
applications. There are several widely publi
cized projects
that have this model such as SETI%40home BOINC and Cl
imateprediction.net. In this model the contributor often has requirements
for running her application (the host process) without significant impact
from the remote users applications (the guest processes). Therefore
guest processes can be killed if there is significant resource contention.
In addition guest processes can
fail due to natural causeshard
ware or software failure on the
machine. We will consider these two cla
sses of failures
and try to predict their occurrences and characteristi
cs.
For the first class of failures we collect data on machine
usage from student computer laboratories at Purdue. We observe
the imp
act of guest processes on the host process which we characterize using a se
mi-Markov process model. Our contributions
are in the design of a suita
ble model and using domain specific
characteristics to optimize the run
time computation for
machine reliability based on the model. For the se
cond class
of failures we concentrate on software failures. We collect
machine state information and use that for predicting near-term
fa
ilures by using an artificial neural network. Our contributions
are in
identifying the state variables that are most indicative
of failures i
dentifying the key parameters (such as amount
of state) that impact the
accuracy of prediction and developing
a scheme for refining the neura
l network based on runtime
observations.
We implement the predic
tion techniques in a cycle sharing
system called iShare and use the pre
dictions in a proactive
scheduler that can migrate applications from fa
ilure-prone
machines. We evaluate the accuracy of the prediction for different kinds of jobs different amounts of lookahead
and historica
l information and show that the proactive
scheduler is able to reduce
the number of guest jobs that
fail due to resource unavailability.
Speaker Bio:
Saurabh Bagchi is an Assistant Professor in Electrical
and
Computer Engineering at Purdue University in West Lafayette
IN
. He is a faculty fellow of the Cyber Center and has a
courtesy appoint
ment in the Department of Computer Science.
He received his M.S. and Ph
.D. degrees from the University
of Illinois at Urbana-Champaign in 1998
and 2001. At Purdue
he leads the Dependable Computing Systems Lab wher
e he and
a set of wildly enthusiastic students try to make and break distributed systems for the good of the world. His papers
can be foun
d at www.ece.purdue.edu/%7Edcsl.
- About
- Research
- Faculty
- Awards & Honors
- Undergraduate
- Graduate
- Careers
- Outreach
- Alumni
- UTCS Direct