UTCS Colloquium: Saurabh Bagchi ECE Purdue University Failure Prediction in Cycle Sharing Distributed Systems ACES 2.402 Friday March 2 2007 at 3:00 p.m.

Contact Name: 
Jenna Whitney
Date: 
Mar 2, 2007 3:00pm - 4:15pm

There is a signup schedule for this event.

Speaker Name

/Affiliation: Saurabh Bagchi/ECE Purdue University

Date/Time: Fri

day March 2 2007 3:00 p.m. - 4:15 p.m.

Refreshments: By AI - 2:4

5 p.m.

Location: ACES 2.402

Host: Lorenzo Alvisi

Tal

k Title: Failure Prediction in Cycle Sharing Distributed Systems

Ta

lk Abstract:
Although difficult prediction is nevertheless important for a class of distributed systems where resources are shared
between
multiple applications submitted by different users.
The system model i

s that of computational hosts contributed
by different individuals for

the common good of running
applications. There are several widely publi

cized projects
that have this model such as SETI%40home BOINC and Cl

imateprediction.net. In this model the contributor often has requirements
for running her application (the host process) without significant impact

from the remote users’ applications (the guest processes). Therefore
guest processes can be killed if there is significant resource contention.
In addition guest processes can
fail due to natural causes—hard

ware or software failure on the
machine. We will consider these two cla

sses of failures
and try to predict their occurrences and characteristi

cs.

For the first class of failures we collect data on machine
usage from student computer laboratories at Purdue. We observe
the imp

act of guest processes on the host process which we characterize using a se

mi-Markov process model. Our contributions
are in the design of a suita

ble model and using domain specific
characteristics to optimize the run

time computation for
machine reliability based on the model. For the se

cond class
of failures we concentrate on software failures. We collect

machine state information and use that for predicting near-term
fa

ilures by using an artificial neural network. Our contributions
are in

identifying the state variables that are most indicative
of failures i

dentifying the key parameters (such as amount
of state) that impact the
accuracy of prediction and developing
a scheme for refining the neura

l network based on runtime
observations.

We implement the predic

tion techniques in a cycle sharing
system called iShare and use the pre

dictions in a proactive
scheduler that can migrate applications from fa

ilure-prone
machines. We evaluate the accuracy of the prediction for different kinds of jobs different amounts of lookahead
and historica

l information and show that the proactive
scheduler is able to reduce

the number of guest jobs that
fail due to resource unavailability.

Speaker Bio:
Saurabh Bagchi is an Assistant Professor in Electrical

and
Computer Engineering at Purdue University in West Lafayette
IN

. He is a faculty fellow of the Cyber Center and has a
courtesy appoint

ment in the Department of Computer Science.
He received his M.S. and Ph

.D. degrees from the University
of Illinois at Urbana-Champaign in 1998
and 2001. At Purdue
he leads the Dependable Computing Systems Lab wher

e he and
a set of wildly enthusiastic students try to make and break distributed systems for the good of the world. His papers
can be foun

d at www.ece.purdue.edu/%7Edcsl.