UTCS Colloquia - Saurabh Bagchi/Purdue University, Large Scale Debugging of Parallel Tasks using “Triumph of Majority” Principle, ACES 6.304
Type of Talk: UTCS Colloquia
Speaker/Affiliation: Saurabh
Bagchi/Purdue University
Talk Audience: UTCS Faculty, Graduate Studen
ts, Undergraduate Students and Outside Interested Parties
Date/Time:
Friday, October 28, 2011, 12:00 p.m.
Location: ACES 6.304
Host
: Keshav Pingali
Talk Title: Large Scale Debugging of Parallel Tasks u
sing “Triumph of Majority” Principle
Talk Abstract:
The number o
f cores used in large scale systems will exceed a million cores in the near
future, increasing the challenge of developing correct, high performance
applications. When an application fails or returns incorrect results, the
developer must identify the offending parallel task and then the portion o
f the code in that task that caused the error. Traditional parallel debuggi
ng tools scale poorly to large task counts and overwhelm developers with in
formation. We develop a detection tool, called AutomaDeD, that identifies
the offending task and, to a customizable granularity, the relevant port
ion of code within the task. It performs runtime monitoring of a parallel a
pplication to build a statistical model of the applicationâ•˙s typical
timing and control flow behavior. By comparing the behavior of clusters of
parallel tasks temporally as well as spatially, AutomaDeD identifies the p
eriod in time, the task(s), and the error site, i.e., the region of cod
e, where a fault first manifests itself.
An especially subtle class o
f bugs are those that are scale-dependent: while small-scale test cases do
not exhibit the bug, the bug arises in large-scale production runs. The st
ate-of-the-art statistical bug detection techniques fail with such bugs, b
ecause they detect abnormal behavior through comparison with bug-free behav
ior. Unfortunately, for scale-dependent bugs, there may not be bug-free r
uns at large scales. In this talk, we will describe a statistical approach
to detecting and localizing scale-dependent bugs. It detects bugs in large
-scale programs by building models of behavior based on bug-free behavior a
t small scales. These models are constructed using kernel canonical correla
tion analysis (KCCA) and exploit scale-determined properties, whose values
are predictably dependent on application scale.
We evaluate the tools
on a parallel machine at Lawrence Livermore National Lab and with real bug
cases.
Speaker Bio:
Saurabh Bagchi is an Associate Professor in the
School of Electrical and Computer Engineering and the Department of Comput
er Science at Purdue University in West Lafayette, Indiana. He is a senio
r member of IEEE and ACM, a "Teaching for Tomorrow" faculty fellow at Purd
ue University and the Assistant Director of the CERIAS security center at P
urdue. He was the PC chair for IEEE/IFIP International Symposium on Dependa
ble Systems and Networks (DSN) in 2011. He received the MS and PhD degrees
from the University of Illinois, Urbana-Champaign, in 1998 and 2001, res
pectively. At Purdue, he leads the Dependable Computing Systems Laboratory
(DCSL), where he and a set of wildly enthusiastic students try to make an
d break distributed systems for the good of the world.
- About Us
- Research
- Faculty
- Awards & Honors
- Undergraduate Program
- Graduate Program
- Giving & Collaboration
- Careers
- Outreach
- Alumni
- UTCS Direct