PRISM: PRecision-Integrated Scalable Monitoring
Committee Members
- Dr. Mike Dahlin: (Chair) Professor, CS
- Dr. James Browne: Professor, CS, ECE, and Physics
- Dr. Joe Hellerstein: Professor, EECS, UC Berkeley
- Dr. Greg Plaxton: Professor, CS
- Dr. Yin Zhang: Assistant Professor, CS
Overview
Scalable system monitoring is a
fundamental abstraction for
large-scale networked systems.
It enables operators
and end-users to characterize system behavior,
from identifying normal conditions to detecting
any unexpected or undesirable events---attacks, configuration mistakes,
security vulnerabilities, CPU overload, or memory leaks
due to buggy applications---before
any serious harm is done.
The primary goal of system monitoring is to provide a global view
of the distributed network state i.e., making good decisions requires
knowledge of what's happening across remote sites.
To compute this global view,
monitoring services face two key challenges.
First, they must scale to
handle high volumes
of dynamically changing data (e.g., per-flow or per-object state)
spanning large numbers of nodes.
Second, they must be
robust to
node and network
failures which
can significantly affect the accuracy of global results.
Since failures
are common in large-scale systems,
monitoring services should
identify inaccurate
results so as not to raise any false alarms (false positives)
or miss any anomalies (false negatives).
Unfortunately,
many monitoring systems
use ad-hoc approaches based on periodic logging at a centralized site.
Such approaches reduce communication cost but
abandon guarantees on the precision of global results,
miss anomalies and system compromises, or both.
To address these challenges,
we have developed PRISM, a new scalable monitoring service
that makes imprecision a first-class abstraction.
Exposing imprecision is essential for both
correctness in the face of network and node failures and scalability
to large systems.
PRISM introduces the notion of conditioned consistency that
quantifies imprecision along a three-dimensional vector:
- Arithmetic Imprecision (AI) bounds numeric inaccuracy,
- Temporal Imprecision (TI) bounds
update delays, and
- Network Imprecision (NI) bounds uncertainty due to
network and node failures.
AI and TI balance precision against monitoring overhead for scalability
while NI addresses the fundamental challenge of providing consistency
guarantees despite failures in a large-scale distributed system.
Our DHT-based implementation provides these metrics in a scalable way
via
(1) self-tuning of AI budgets to shift imprecision to where it is
useful, (2) pipelining of TI delays to maximize batching of updates,
and (3) dual-tree prefix aggregation
which exploits regularities in
our DHT topology to significantly reduce the cost of
the active probing needed to maintain NI.
PRISM's careful management of imprecision qualitatively improves its
capabilities. For example, by
introducing a 10% AI, PRISM's PlanetLab monitoring service
reduces network overheads by an order of magnitude compared to
a centralized service that performs periodic logging
of remote updates, and by using NI metrics to automatically
select the best aggregation results, PRISM reduces the observed
worst-case inaccuracy by nearly a factor of five.
Papers, Presentations, and Posters
-
Self-Tuning, Bandwidth-Aware Monitoring of Dynamic Data Streams.
Navendu Jain, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
25th IEEE International Conference on Data Engineering (ICDE '09)
Shanghai, China, April 2009 (To Appear). Acceptance Ratio: 17% (93 out of 554).
[PDF]
[PS]
[Bibtex]
[Project Page]
-
Network Imprecision: A New Consistency Metric for Scalable Monitoring.
Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
8th USENIX Symposium on Operating Systems Design and Implementation (OSDI '08)
San Diego, CA, December 2008 (To Appear). Acceptance Ratio: 13% (26 out of 193)
[PDF] [PS]
[Technical Report]
[Bibtex]
[Project Page]
-
WebView: Scalable Information Monitoring for Data-Intensive Web Applications.
Navendu Jain, Mike Dahlin, and Yin Zhang.
[PDF]
[PS]
[Bibtex]
-
STAR: Self-Tuning Aggregation for Scalable Monitoring.
Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
33rd International Conference on Very Large Databases (VLDB '07)
Vienna, Austria, September 2007. Acceptance Ratio: 16.4% (45 out of 275)
[PDF]
[PS]
[Bibtex]
[Technical Report]
[Project Page]
-
PRISM: PRecision-Integrated Scalable Monitoring.
Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
TR06-22, UTCS Technical Report.
Demo session. At WORLDS 2006 (in-conjunction with OSDI 2006).
[PDF]
[PS]
[Bibtex]
[Project Page]
[PRISMon Demo]
-
SCUD: Scalable Counting of Unique Data.
Dmitry Kit, Prince Mahajan, Navendu Jain, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
Poster session. At NSDI 2006.
[PDF]
[PPT]
-
PRISM: Precision-aware Aggregation for Scalable Monitoring
Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
Poster session. At NSDI 2006.
[ PDF ]
[PPT]
-
INSIGHT: A Distributed Monitoring System for Tracking Continuous Queries.
Navendu Jain, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
20th ACM Symposium on Operating Systems Principles, Work-in-Progress (SOSP '05 WIP)
Brighton, United Kingdom, October 2005.
[PDF]
[PS]
[Bibtex]
[Slides]
-
A Scalable Distributed Information Management System.
Praveen Yalagandula and Mike Dahlin.
ACM SIGCOMM, August, 2004 .
[PDF]
People
Related Links