UTCS Colloquia - Saurabh Bagchi/Purdue University, Large Scale Debugging of Parallel Tasks using “Triumph of Majority” Principle, ACES 6.304

Contact Name: 
Jenna Whitney
Date: 
Oct 28, 2011 12:00pm - 1:30pm

Type of Talk: UTCS Colloquia

Speaker/Affiliation: Saurabh

Bagchi/Purdue University

Talk Audience: UTCS Faculty, Graduate Studen

ts, Undergraduate Students and Outside Interested Parties

Date/Time:

Friday, October 28, 2011, 12:00 p.m.

Location: ACES 6.304

Host

: Keshav Pingali

Talk Title: Large Scale Debugging of Parallel Tasks u

sing “Triumph of Majority” Principle

Talk Abstract:
The number o

f cores used in large scale systems will exceed a million cores in the near
future, increasing the challenge of developing correct, high performance
applications. When an application fails or returns incorrect results, the
developer must identify the offending parallel task and then the portion o

f the code in that task that caused the error. Traditional parallel debuggi

ng tools scale poorly to large task counts and overwhelm developers with in

formation. We develop a detection tool, called AutomaDeD, that identifies
the offending task and, to a customizable granularity, the relevant port

ion of code within the task. It performs runtime monitoring of a parallel a

pplication to build a statistical model of the applicationâ•˙s typical

timing and control flow behavior. By comparing the behavior of clusters of

parallel tasks temporally as well as spatially, AutomaDeD identifies the p

eriod in time, the task(s), and the error site, i.e., the region of cod

e, where a fault first manifests itself.

An especially subtle class o

f bugs are those that are scale-dependent: while small-scale test cases do

not exhibit the bug, the bug arises in large-scale production runs. The st

ate-of-the-art statistical bug detection techniques fail with such bugs, b

ecause they detect abnormal behavior through comparison with bug-free behav

ior. Unfortunately, for scale-dependent bugs, there may not be bug-free r

uns at large scales. In this talk, we will describe a statistical approach
to detecting and localizing scale-dependent bugs. It detects bugs in large

-scale programs by building models of behavior based on bug-free behavior a

t small scales. These models are constructed using kernel canonical correla

tion analysis (KCCA) and exploit scale-determined properties, whose values
are predictably dependent on application scale.

We evaluate the tools
on a parallel machine at Lawrence Livermore National Lab and with real bug
cases.

Speaker Bio:
Saurabh Bagchi is an Associate Professor in the
School of Electrical and Computer Engineering and the Department of Comput

er Science at Purdue University in West Lafayette, Indiana. He is a senio

r member of IEEE and ACM, a "Teaching for Tomorrow" faculty fellow at Purd

ue University and the Assistant Director of the CERIAS security center at P

urdue. He was the PC chair for IEEE/IFIP International Symposium on Dependa

ble Systems and Networks (DSN) in 2011. He received the MS and PhD degrees

from the University of Illinois, Urbana-Champaign, in 1998 and 2001, res

pectively. At Purdue, he leads the Dependable Computing Systems Laboratory
(DCSL), where he and a set of wildly enthusiastic students try to make an

d break distributed systems for the good of the world.