Ballista: Software Robustness Evaluation
Many of us accept computer crashes as an inevitable fact of life. But, as our society increasingly becomes dependent upon computers, it is essential that we find ways to make software less fragile. A key element to creating more robust software is being able to test and quantify a system's ability to handle exceptional conditions gracefully.
The Ballista robustness testing service is a scalable approach to testing application programming interfaces (APIs) for robust handling of exceptional parameter values. Ballista has automatically searched for and identified numerous problems with exception handling on several commercial systems, including one-line programs that crash Unix and Windows operating systems. Beyond identifying repeatable robustness problems, the Ballista testing methodology permits quantitative robustness comparisons among competing software packages. Recent results include a Windows vs. Linux head-to-head robustness experiment.
One of the most persistent arguments against creating robust software is that doing so would cost too much performance. We have studied the robustness and performance of the SFIO (Safe Fast I/O) library developed by AT&T Research. Ballista testing pinpointed previously unidentified robustness problems in SFIO, enabling us to improve the robustness of SFIO by a factor of five with an average performance penalty of only 1% as measured by the original SFIO benchmarking scheme. Furthermore, we think that near-term CPU features may well make the performance cost of robustness improvement essentially free if hardware designers are aware of the issues involved.
Philip Koopman is an Assistant Professor of Electrical and Computer Engineering at Carnegie Mellon University, and is also a faculty member of the School of Computer Science Institute for Software Research and the Institute for Complex Engineered Systems. He has been a Navy submarine officer, a CPU designer, and an industrial researcher in application areas such as automobiles, elevators, and jet engines. His current research focus is on the creation of robust distributed embedded systems. He received his B.S. and M.Eng. from Rensselaer Polytechnic in 1982 and a Ph.D. in computer engineering from Carnegie Mellon in 1989.
University of California, Davis
Clocked Timing Elements for High-Performance and Low-Power VLSI Systems
The clocked storage elements are the single most analyzed and debated circuit structures in modern microprocessors. Their importance is in the fact that they provide a boundary between the ever-shrinking pipelined stages. The demand for high-performance mandates detailed understanding of timing issues and the intricate inner working of timing elements. The techniques known as: "time borrowing", "slack passing" or "cycle stealing" are based on the fact that the extra time needed could be traded with the time allowed for the next cycle. Those techniques are increasingly used and they are intimately related to the inner workings of timing elements. This talk will also present a number of clocked storage elements used in modern microprocessors and discuss the timing issues and design guidelines. In this talk we discuss a set of rules for consistent estimation of the real performance and power features of the Flip-Flop and Master-Slave latch structures. A new simulation and optimization approach is presented, targeting both high-performance and power budget issues. The analysis approach reveals the sources of performance and power consumption bottlenecks in different design styles. Certain misleading parameters have been properly modified and weighted to reflect the real properties of the compared structures. Furthermore, the results of the comparison of representative Master-Slave latches and Flip-Flops illustrate the advantages of presented approach and the suitability of different design styles for high-performance and low-power applications.
Prof. Vojin G. Oklobdzija, obtained Ph.D. in Computer Science from the
University of California, Los Angeles in 1982, MSc degree in 1978 and Dipl.
Ing. (MScEE) from the Electrical Engineering Department, University of
Belgrade, Yugoslavia in 1971.
From 1982 to 1991 he was at the IBM T.J.Watson Research Center in New York where he worked on development of RISC architecture and processors and super-scalar RISC, IBM RS/6000 (PowerPC) in particular, on which he co-holds a patent on Register-Renaming. This technique enabled the entire generation of super-scalar processors and is used in every high-performance processor today.
From 1988-90 he was visiting faculty at the University of California Berkeley while on leave from IBM. Since 1991 Prof. Oklobdzija has held various consulting and academic positions. He was consultant to Sun Microsystems Laboratories, AT&T Bell Laboratories, Hitachi Research Laboratories and Siemens Corp. where he was principal architect for the new generation of embedded logic and memory processors. Currently he is advisor to SONY and Fujitsu Labs.
Prof. Oklobdzija has academic appointment with the University of California and various visiting academic appointments. As a Fulbright professor he was lecturing at the universities in South America. In 1991 he spent time in Peru and Bolivia as a Fulbright professor lecturing and helping the universities in South America. During 1996-98 he taught courses in the Silicon Valley through the University of California Berkeley Extension and Hewlett-Packard.
Prof. Oklobdzija holds five U.S., five European, one Japan and one Taiwan patents and eight other US patents currently pending. He is a Fellow of IEEE and a member of American Association of the University Professors. He serves on the editorial boards of the Journal of VLSI Signal Processing and IEEE Transaction of VLSI Systems and as a program committee member of the International Solid-State Circuits Conference. He was a General Chair of the 13th Symposium on Computer Arithmetic, Vice Chair at the International Conference on Computer Design and program committee member of the International Symposium on VLSI Technology. He has published over 100 papers and has given over 100 invited talks and short courses in the USA, Europe, Latin America, Australia, China and Japan.
(for further information please see: http://www.ece.ucdavis.edu/acsel)
Power Aware Page Allocation
One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the battery lifetime for mobile devices. Memory is an unexplored, and particularly important, target for efforts to improve energy efficiency. New memory technology offers power management features with the ability to put individual DRAM chips in any one of several different power modes. This talk explores the interaction of page placement with static and dynamic hardware policies to exploit these emerging hardware features. In particular, we consider page allocation policies that can be employed by an informed operating system to complement the hardware power management strategies. We perform experiments using two complementary simulation environments: a trace-driven simulator with workload traces that are representative of mobile computing and an execution-driven simulator with a detailed processor/memory model and a more memory-intensive set of benchmarks (SPEC2000). Our results make a compelling case for a cooperative hardware/software approach for exploiting power-aware memory, with down to as little as 45% of the Energy*Delay for the best static policy and 1% to 20% of the Energy*Delay for a traditional full-power memory.
Alvin Lebeck is an Assistant Professor of Computer Science and of Electrical and Computer Engineering at Duke University. His research interests include high performance microarchitectures, hardware and software techniques for improved memory hierarchy performance, multiprocessor systems, and energy efficient computing. He received the BS in Electrical and Computer Engineering, and the MS and PhD in Computer Science at the Universiy of Wisconsin---Madison. He is the recipient of a 1997 NSF CAREER Award and a member of ACM and IEEE.
North Carolina State University
A slipstream processor harnesses an otherwise unused processing element in a chip multiprocessor (CMP) to speed up a single program. It does this by running two redundant copies of the program. Predicted-non-essential computation is speculatively removed from one of the programs, speeding it up. The second program checks the forward progress of the first and is also sped up in the process. Both program copies finish sooner than either can alone.
While unusual, there are important selling points of a redundant program arrangement:
1. The redundant programs are architecturally independent and this leads to a simple execution model. We do not advocate slipstream as a replacement for other multithreading models because they target different sources of performance. But it is notable that traditional speculative multithreading divides a single program into parallel tasks and this leads to elaborate inter-task register/memory dependence mechanisms.
2. In processor research, even slight performance gains are increasingly difficult to achieve. It may be fruitful to direct efforts not just towards improving performance, but improving performance as best we can while providing other value-add features at no extra complexity. For example, slipstream transparently provides some degree of fault tolerance due to redundant program execution. More generally, slipstream promotes comprehensive, flexible functionality without fundamentally changing the way processors execute programs.
This talk reviews the slipstream microarchitecture and key performance insights (15% average performance improvement), presents recent improvements to the previously-duplicated memory hierarchy, and summarizes future work.
Eric Rotenberg is an Asst. Professor of Electrical and Computer Engineering at NC State University. He received a BS degree in electrical engineering (1991) and MS and PhD degrees in computer sciences (1996, 1999) from the University of Wisconsin - Madison. From 1992 to 1994, he participated in the design of IBM's AS/400 computer in Rochester, MN. His main research interests are in high-performance computer architecture and microarchitecture-based fault tolerance.
Power-Aware Computer Systems: Some Thoughts on Modeling & Design
Power consumption, thermal issues, and battery lifetimes are becoming primary design issues in many computer systems. Increasingly, power must be considered side-by-side with performance when designing a computer system. For some systems, careful circuit design may be sufficient to meet the system's power goals. Often, however, the power and performance goals are aggressive enough that a range of circuit, architecture, and software techniques must be used together.
This talk will first describe my group's research efforts to model CPU power dissipation at the architecture level, and then discuss some of our more recent work on design techniques for reducing power consumption. An overall theme of these design techniques has been to use detailed and often dynamic characterizations of application behavior to drive power management. For example, I will particularly focus on our upcoming ISCA2001 paper which proposes ways to reduce data cache leakage power by exploiting the generational behavior of application references to cache lines.
Margaret Martonosi is currently an Associate Professor at Princeton University, where she has been on the faculty in the Department of Electrical Engineering since 1994. Her research interests are in computer architecture and the hardware-software interface, and her group's current research focus is on hardware and software strategies for power-efficient computing. Martonosi earned her Ph.D. from Stanford University in 1993, and also holds a Master's degree from Stanford and a bachelor's degree from Cornell University, all in Electrical Engineering.
Stream Architecture: Rethinking Media Processor Design
Today's media processing applications demand very high arithmetic rates, and compelling applications of the future promise to further increase this demand. Fortunately, these applications have large amounts of inherent parallelism. The challenge in media processor design, therefore, is to efficiently support the large number of arithmetic units needed to exploit this parallelism. In this talk, I will discuss the Imagine Stream Processor and the architectural motivation for its underlying stream architecture which is designed for high-performance media processing.
The storage structures of the stream architecture address modern VLSI constraints while providing the bandwidth necessary to support large numbers of arithmetic units. By partitioning the register file structure, its cost in terms of area, delay, and power can be greatly reduced. The partitioned stream register file organization utilizes a bandwidth hierarchy to amplify the data bandwidth of the memory system for the arithmetic units. The memory bandwidth at the base of this hierarchy must be utilized efficiently in order to meet the demands of the numerous arithmetic units. This can be accomplished by scheduling memory accesses to exploit the available parallelism in modern DRAMs. The Imagine Stream Processor incorporates these concepts, yielding a high-performance processor capable of sustaining in excess of 10GOPS on many media processing applications.
Scott Rixner is an Assistant Professor of Computer Science and Electrical and Computer Engineering at Rice University. He is the principal architect of the Imagine Stream Processor and his research interests include media processing, the interaction between VLSI and computer architectures, and techniques for managing and scheduling DRAM accesses to minimize latency and bandwidth demands. A "lifer", he received the SB, SM, and PhD degrees in Computer Science from the Massachusetts Institute of Technology.
Carnegie Mellon University
Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing the Safety of Data Relocation*
By optimizing data layout at run-time, we can potentially enhance the performance of caches by actively creating spatial locality, facilitating prefetching, and avoiding cache conflicts and false sharing. Unfortunately, it is extremely difficult to guarantee that such optimizations are safe in practice on today's machines, since accurately updating all pointers to an object requires perfect alias information, which is well beyond the scope of the compiler for languages such as C. To overcome this limitation, we propose a technique called memory forwarding which effectively adds a new layer of indirection within the memory system whenever necessary to guarantee that data relocation is always safe. Because actual forwarding rarely occurs (it exists as a safety net), the mechanism can be implemented as an exception in modern superscalar processors. Our experimental results demonstrate that the aggressive layout optimizations enabled by memory forwarding can result in significant speedups - more than twofold in some cases - by reducing the number of cache misses, improving the effectiveness of prefetching, and conserving memory bandwidth.
* Joint work with Chi-Keung Luk.
Todd C. Mowry is an Associate Professor in the School of Computer Science at Carnegie Mellon University. He received an M.S.E.E. and Ph.D. from Stanford University in 1989 and 1994, respectively. From 1994 through 1997, he was an Assistant Professor in the ECE and CS departments at the University of Toronto prior to joining Carnegie Mellon University in July, 1997. The goal of Professor Mowry's research is to develop new techniques for designing computer systems (both hardware and software) such that they can achieve dramatic performance breakthroughs at low cost without placing any additional burden on the programmer. Specifically, he has been focusing on two areas: (i) automatically tolerating the ever-increasing relative latencies of accessing and communicating data (via DRAM, disks, and networks) which threaten to nullify any other improvements in processing efficiency; and (ii) automatically extracting thread-level parallelism from important classes of applications where this is currently not possible. In 1999, he received a Sloan Research Fellowship and the TR100 Award from MIT's Technology Review magazine.
Piranha: A Complexity-Effective Processor Design for Commercial Workloads
The microprocessor industry is struggling with escalating development costs and design times arising from exceedingly complex processors that push the limits of instruction-level parallelism. Meanwhile, such designs are yielding diminishing returns and are ill suited for commercial applications, such as database and Web workloads that constitute the most important market for high-performance servers.
In this talk I will review our group's research on understanding the behavior of commercial workloads, and describe the architecture that it inspired: Piranha. Piranha uses chip-multiprocessing as the basis for a scalable shared memory system that is optimized for database and web/internet workloads. Our estimates indicate that a Piranha system can outperform traditional complex core designs by over a factor of three in database workloads. I'll also describe how Piranha's architectural choices and methodology effectively address design cost and complexity challenges.
Luiz Andre Barroso is a Senior Member of the Technical Staff at the Compaq Western Research Lab. His current interests are in architecture and design of server-class systems, and commercial workload performance. His previous work includes cache-coherence protocol design, system-level hardware emulation, and analytic performance modeling. He holds a PhD in Computer Engineering from USC, and MS/BS degrees in Electrical Engineering from PUC University, Rio de Janeiro.
Multithreading Architectures, Are they here to stay?
Microprocessor architecture has evolved significantly since the introduction of processors in the early 1970's. One line of research focused on exploiting a single thread by issuing more than one instruction per cycle. Current superscalar microprocessors are capable of issuing more than six instructions per cycle, although the Instruction Level Parallelism of a single thread, which is well below this number, limits their performance. Different techniques, including out of order execution, branch prediction, and predicative execution are used to increase the ILP within a single thread. Multithreading Architectures provide an alternative approach whose origins can be traced to the CDC6600 in the 60's. These architectures increase the ILP by making use of thread parallelism. In the early 80's the Denelcor introduced HEP, a fine-grain multithreading supercomputer capable of tolerating memory latency, and even functional unit latency. Six systems were delivered to customers during the years 1981-1985. In the late 80's Delco Electronics developed TIO, a multithreading real-time microprocessor. This processor is still used in GM cars. In the early 90's Dynamic Multithreading was proposed. While no other successful multithreading processor has been implemented yet, both Compaq and XStream Logic have announced multithreading processors. In this talk I will briefly review these architectures, and put the future of multithreading in perspective.
Mario Nemirovsky has done research in many areas of computer architecture, including simultaneous multithreading, branch prediction, superscalar architecture and real-time processors. He was the founding CEO of XstreamLogic and was the chief architect of their very high performance network processor which combines SMT, special I/O instructions, and a special register structure. Before that, he was a chief architect at National Semiconductor and at Weitek. He was one of the first to understand the power of multithreading which he termed dynamic multistreaming and first published on this subject in Micro-24 in 1991. He subsequently published results on SMT in HICSS (in 1993 and 1994) and in PACT (in 1995). This remains a core research interest of his. Mario received his PhD in ECE from UC Santa Barabara in 1990 and has been an Adjunct Professor there since 1991. Since 1998, he has also been an Adjunct Professor at UC Santa Cruz. He has produced three PhD graduates and a number of MS graduates. Two of his PhD students did their research in simultaneous multithreading. Today, Mario is an independent consultant in the field of processor implementation.