UTCS Colloquium-Greg Bronevetsky/Lawrence Livermore National Lab: "Statistical Fault Detection and Analysis," ACES 6.304, Friday, May 14, 2010, 11:00 a.m.

Contact Name: 
Jenna Whitney
May 14, 2010 11:00am - 12:00pm

There is a sign-up schedule for this event that can be found
at http://www.cs.utexas.edu/department/webeven


Type of Talk: U

TCS Colloquium

Speaker/Affiliation: Greg Bro

nevetsky/Lawrence Livermore National Lab


e/Time: Friday, May 14, 2010, 11:00 a.m.


ocation: ACES 6.304

Host: Keshav Pingali

Talk Title: Statistical Fault Detection and Analysi


Talk Abstract:

As large-scale systems grow in size and capability, they also grow increa

singly more complex and less reliable. In particular, decreasing feature

sizes and increasing component counts make systems more vulnerable to comp

lex faults that (i) reduce system performance or even (ii) silently corrup

t application output. Traditional approaches to system and application des

ign do a poor job of describing the effects of faults on real software and
hardware. This makes it difficult to design effective techniques to det

ect failures and data corruptions or trace detected failures to their root

This talk will focus on our recent work in using stati

stical modeling to describe the effect of failures on applications and sys

tems. Our basic approach is to observe the behavior of software or hardwar

e to create a simple model of its behavior. Such models approximate the ke

y behavior of the real system but are significantly simpler to analyze, w

hich enables a much more detailed understanding of complex system behavior

s. This enables powerful new techniques to make systems more productive an

d reliable:

By training a model on a system''s normal behavior

we can detect behavioral abnormalities that may indicate software bugs or

hardware failures;

We can connect observed abnormalities to th

eir likely causes by injecting faults into the system and modeling its beh

avior when exposed to each type of fault;

By creating fine-gra

ined models that track the propagation of faults through a system we can (

i) precisely characterize the system''s critical failure modes and (ii) id

entify the root causes and propagation paths of observed abnormalities.

The fundamental goal of our work is to mak

e complex systems more predictable and analyzable, which will ultimately

make them more useful and productive.