PRISM: PRecision-Integrated Scalable Monitoring

Committee Members

Overview

Scalable system monitoring is a fundamental abstraction for large-scale networked systems. It enables operators and end-users to characterize system behavior, from identifying normal conditions to detecting any unexpected or undesirable events---attacks, configuration mistakes, security vulnerabilities, CPU overload, or memory leaks due to buggy applications---before any serious harm is done. The primary goal of system monitoring is to provide a global view of the distributed network state i.e., making good decisions requires knowledge of what's happening across remote sites.

To compute this global view, monitoring services face two key challenges. First, they must scale to handle high volumes of dynamically changing data (e.g., per-flow or per-object state) spanning large numbers of nodes. Second, they must be robust to node and network failures which can significantly affect the accuracy of global results. Since failures are common in large-scale systems, monitoring services should identify inaccurate results so as not to raise any false alarms (false positives) or miss any anomalies (false negatives). Unfortunately, many monitoring systems use ad-hoc approaches based on periodic logging at a centralized site. Such approaches reduce communication cost but abandon guarantees on the precision of global results, miss anomalies and system compromises, or both.

To address these challenges, we have developed PRISM, a new scalable monitoring service that makes imprecision a first-class abstraction. Exposing imprecision is essential for both correctness in the face of network and node failures and scalability to large systems. PRISM introduces the notion of conditioned consistency that quantifies imprecision along a three-dimensional vector: AI and TI balance precision against monitoring overhead for scalability while NI addresses the fundamental challenge of providing consistency guarantees despite failures in a large-scale distributed system. Our DHT-based implementation provides these metrics in a scalable way via (1) self-tuning of AI budgets to shift imprecision to where it is useful, (2) pipelining of TI delays to maximize batching of updates, and (3) dual-tree prefix aggregation which exploits regularities in our DHT topology to significantly reduce the cost of the active probing needed to maintain NI.

PRISM's careful management of imprecision qualitatively improves its capabilities. For example, by introducing a 10% AI, PRISM's PlanetLab monitoring service reduces network overheads by an order of magnitude compared to a centralized service that performs periodic logging of remote updates, and by using NI metrics to automatically select the best aggregation results, PRISM reduces the observed worst-case inaccuracy by nearly a factor of five.

Papers, Presentations, and Posters

People

Related Links

Reading List (UTCS)