Measuring Benchmark Similarity and its Applications
Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. In this talk, we will use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis (CA) to efficiently explore the workload space. Within this workload space, different input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral differences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The final goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, we discuss four other possible applications, namely (i) getting insight in the impact of input data sets on program behavior, (ii) evaluating the representativeness of sampled traces, (iii) evaluating the representativeness of reduced input data sets, and (iv) investigating the interaction between Java applications and their virtual machines.
Lieven Eeckhout obtained his Master and PhD in Computer Science and Engineering from Ghent University, Belgium, in 1998 and 2002, respectively. He is currently working as a Postdoctoral Researcher at the same university through a grant from the Fund for Scientific Research-Flanders (FWO Vlaanderen). His research interests include computer architecture, performance evaluation and workload characterization. He has published papers in various high quality journals (IEEE Computer, IEEE Micro and Journal of Instruction-Level Parallelism) and conferences (PACT, OOPSLA, etc.).
William J. Dally
Merrimac: A Streaming Supercomputer
The streaming supercomputer project aims to develop a scientific computer that offers an order of magnitude or more improvement in performance per unit cost compared to cluster-based scientific computers built from the same underlying semiconductor and packaging technology. We expect this efficiency to arise from two innovations: stream architecture and advanced interconnection networks. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative computations by an order of magnitude or more. Hence a processing node with a fixed memory bandwidth (which is expensive) can support an order of magnitude more arithmetic units (which are inexpensive). Because each node has much greater performance (128 GFLOPs in our current design) than a conventional microprocessor, a streaming supercomputer can achieve a given level of performance with fewer nodes, simplifying system management and increasing reliability. A 1-PFLOPs machine, for example, can be realized with just 8,192 nodes. A streaming scientific computer can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer.
Bill Dally and his group have developed system architecture, network architecture, signaling, routing, and synchronization technology that can be found in most large parallel computers today. Highlights of his work include the MOSSIM Simulation Engine, the Torus Routing Chip which pioneered wormhole routing and virtual-channel flow control, and the J-Machine and the M-Machine experimental parallel computer systems. Bill is currently a Professor of Electrical Engineering and Computer Science at Stanford University where his group has developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations. Bill has worked with Cray Research and Intel to incorporate many of these innovations into commercial parallel computers, with Avici Systems to incorporate this technology into Internet routers, and co-founded Velio Communications to commercialize high-speed signaling technology. He is a Fellow of the IEEE, a Fellow of the ACM and has received numerous honors including the ACM Maurice Wilkes award. He currently leads projects on high-speed signaling, computer architecture, and network architecture.
Thomas Puzak and Phil Emma
Exploring the Limits of Prefetching
In this talk, we formulate a new approach for evaluating any prefetching algorithm. We study the conditions under which prefetching can remove all the pipeline stalls due to cache misses. This approach involves an initial profiling of the application to identify all misses, as well as the corresponding locations in the program where prefetches for them can be initiated. Then we systematically control the number of misses that are prefetched, the timeliness of these prefetches, and the number of unused prefetches. We validate the accuracy of our method by comparing to a Markov prefetch algorithm. Hence, we can measure the potential benefit that any application can receive from prefetching, and we can analyze application behavior under conditions that cannot be explored with any known prefetching algorithm. Next, we analyze a system parameter that is vital to prefetching performance, the line transfer interval, which is the number of processor cycles required to transfer a cache line.
We show that under ideal conditions, prefetching can remove nearly all of the stalls associated with cache misses. Unfortunately, real processor implementations are far less than ideal. In particular, the trend in processor frequency is outrunning the on-chip and off-chip bandwidths. Today, it is not uncommon for the processor frequency to be three or four times the bus frequency. Under these conditions, we show that nearly all of the performance benefits derived from prefetching are eroded, and in many cases prefetching actually degrades performance. We do quantitative and qualitative analyses of these trade-off's, and show that there is a linear relationship between performance and three things: percent of misses prefetched, the percent of unused prefetches, and the bandwidth. I'll close with a few ideas on related areas of research.
Thomas R. Puzak received a B.S. in Mathematics and M.S. in Computer Science from the University of Pittsburgh and a Ph.D. in Electrical and Computer Engineering from the University of Massachusetts. Since joining IBM is 1970, he has spent over twenty years working in IBM Research. While at IBM he received Outstanding Achievement, Contribution, and Innovation Awards, served as Chairman of the Computer Architecture Special Interest Group at the T. J. Watson Research and holds 24 patents in computer architecture, on topics concerning branch prediction, pipeline structure, and memory hierarchy design.
University of Virginia
Architecture-Level Thermal Management and Feedback Control: Challenges and Opportunities
In the first part of this talk I will describe why computer architects can and should be playing a role in helping to manage the growing problem of heat dissipation in microprocessors. I will then briefly outline how temperature can be conveniently simulated in conjunction with conventional architecture simulations (without the need for detailed implementation and layout information), review and compare some techniques that have recently been proposed for architectural thermal management, and describe our most recent work to understand how thermal design interacts with aging processes, specifically electromigration.
In the second part of this talk I will discuss the pros and cons of closed-loop feedback control. I will describe our experiences with feedback control in the areas of thermal management, cache-leakage management, and energy management via dynamic voltage scaling for MPEG playback and for web servers with quality-of-service constraints. I will discuss why feedback control offers compelling benefits for managing adaptive architecture mechanisms but also presents difficult challenges to be solved before it can become widely applicable.
Kevin Skadron is an assistant professor in the Department of Computer Science at the University of Virginia. He received his PhD in Computer Science from Princeton University, and bachelors' degrees in Electrical and Computer Engineering as well as Economics from Rice University. At U.Va., he directs the Laboratory for Computer Architecture at Virginia (LAVA), studying power and thermal issues, branch prediction, and techniques for fast and accurate microprocessor simulation. Skadron's research group recently released "HotSpot", a tool for dynamically modeling localized on-chip temperatures in conjunction with architecture simulations; "HotLeakage", a tool for dynamically modeling leakage energy in memory-like structures, and "MRRL"; a tool for calculating the minimum warmup period needed to avoid cold-start bias in sampled simulation. Skadron will be program co-chair of the 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT), general co-chair for the 2004 International Symposium on Microarchitecture (MICRO), was general co-chair for PACT-2002, and with Yale Patt helped to launch Computer Architecture Letters, a new short-format, refereed journal published by the IEEE Computer Society TCCA.
University of California, Berkeley
Latency Lags Bandwith
As I review performance trends, I am struck by a consistent theme across many technologies: bandwidth improves much more quickly than latency for four different technologies: disks, networks, memories and processors. A rule of thumb to quantify the imbalance is: In the time that bandwidth doubles, latency improves by no more than factors of 1.2 to 1.4.
This talk lists a half-dozen performance milestones to document this observation, many reasons why it happens, a few ways to cope with it, and a small example of how you might design a system differently if you kept this rule of thumb in mind.
David Patterson joined the faculty at the University of California at Berkeley in 1977, where he now holds the Pardee Chair of Computer Science. He is a member of the National Academy of Engineering and is a fellow of both the ACM (Association for Computing Machinery) and the IEEE (Institute of Electrical and Electronics Engineers).
He led the design and implementation of RISC I, likely the first VLSI Reduced Instruction Set Computer. This research became the foundation of the SPARC architecture, used by Sun Microsystems and others. He was a leader, along with Randy Katz, of the Redundant Arrays of Inexpensive Disks project (or RAID), which led to dependable storage systems from many companies. He is co-author of five books, including two with John Hennessy, who is now President of Stanford University. Patterson has been chair of the CS division at Berkeley, the ACM SIG in computer architecture, and the Computing Research Association. He is currently running for President of ACM.
His teaching has been honored by the ACM, the IEEE, and the University of California. Patterson shared the 1999 IEEE Reynold Johnson Information Storage Award with Randy Katz for the development of RAID and shared the 2000 IEEE von Neumann medal with John Hennessy for "creating a revolution in computer architecture through their exploration, popularization, and commercialization of architectural innovations."
Southern Paris University
Exploring spatial computing and programming
Due to the increasing complexity of superscalar processors, various forms of CMPs are emerging as a popular alternative. Two of the issues raised by such architectures are the necessity to extract enough parallelism, and the management of the program+data placement on the chip space. Current compilers already have difficulties coping with complex superscalar or VLIW processors, so it is not obvious they will be able to address these two additional issues. As a long-term solution, we propose to investigate combined language/ architecture approaches. The basic principle is to help the programmer pass more program semantic to the compiler, hopefully in a natural way, thereby simplifying the analysis task of the compiler. To further relieve the compiler from the task of managing the hardware resource space, without increasing the burden of the architecture itself, we want to let the architecture self-organize computations using simple local control rules. In the talk, we will explore "spatial" computing and programming using a computing model that relies on these principles. While such principles have yet to be applied to realistic CMPs, at the end, we will illustrate how some of these principles could be applied to existing architectures using SMTs, by considering the set of threads as a virtual resource space.
Olivier Temam is professor at University of Paris Sud/11 and in charge of the INRIA Alchemy group. He got his PhD at University of Rennes in France, was assistant professor at University of Versailles until 1999, and then joined University of Paris Sud/11. His research interests include program optimization and architectures for high-performance processors.
University of Wisconsin, Madison
Designing Commercial Servers and Token Coherence
The WISCONSIN MULTIFACET PROJECT (http://www.cs.wisc.edu/multifacet/), which I co-lead with David Wood, seeks to improve the multiprocessor servers that form the computational infrastructure for Internet web servers, databases, and other demanding applications. Work focuses on using the transistor bounty provided by Moore's Law to improve multiprocessor performance, cost, and fault tolerance, while also making these systems easier to design and program. This talk quickly summarizes our work and then focuses on one result, Token Coherence.
Most multiprocessor servers enforce a coherence invariant that permits each memory block to have either multiple read-only copies or a single writable copy (but never both at the same time). Current coherence protocols enforce this invariant indirectly via a subtle combination of local actions and request ordering restrictions. Unfortunately, emerging workload and technology trends reduce the attractiveness of these existing solutions.
We propose the TOKEN COHERENCE framework that directly enforces the coherence invariant by counting tokens (all a block's tokens to write; at least one token to read). This token-counting approach enables more obviously-correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors. This work was selected for IEEE Micro's 2003 "Top Picks in Computer Architecture" (http://www.cs.wisc.edu/multifacet/papers/ieeemicro03_token.pdf).
Mark D. Hill (http://www.cs.wisc.edu/~markhill) is Professor in the Computer Sciences Department and the Electrical and Computer Engineering Department at the University of Wisconsin - Madison. Dr. Hill's research targets the memory systems of shared-memory multiprocessors and high-performance uniprocessors. He is perhaps best known for developing the 3C model of cache misses and for memory consistency model work. In 2000, Hill was named an IEEE Fellow for contributions to cache memory design and analysis. Hill has been at Wisconsin since 1988, with sabbatical years at Sun Microsystems (1995-96) and Universidad Politecnica de Catalunya (2002-03). Hill earned a Ph.D. from Berkeley in 1987 and a B.S.E. from Michigan in 1981.
IBM, Austin Research Laboratory
2004 Global Technology Outlook (GTO)
The Global Technology Outlook (GTO) is IBM’s projection of the future for information technology (IT). It includes forecasts of software, hardware, and services technology trends, and ways in which those trends will come together to enable new uses and capabilities for the Information Technology industry. These new technologies have the potential to radically transform the performance and characteristics of tomorrow’s information processing systems and provide new business value to our customers.
Michael Rosenfield is currently Director of the Austin Research Lab focusing on high performance VLSI design and tools, system level power analysis, and new system architectures. His previous position was Senior Manager of VLSI Design and Architecture at IBM T.J. Watson Research Center in Yorktown Heights, NY where he and his team were involved in high performance microprocessor VLSI design for IBM Server Group and Microelectronics, tools, methodologies, and commonality as well as power-aware microarchitecture, circuits/technology co- design, performance analysis, exploratory microarchitectures, and advanced compiler design. Previously, he has held management positions at Research in parallel communication architecture and in advanced lithography. In 1993, he was the technical assistant to the Research VP of Systems, Technology, and Science. He has a Ph.D and M.S. from the University of California, Berkeley and a B.S. in Physics from the University of Vermont.
Carnegie Mellon University
Human administration for storage is too difficult, with industry reports indicating absurd ratios like "one human needed for each 1-10TB" and "4-7 dollars spent on managing each dollar of storage." This dilemma pushes us to pursue "self-*" storage infrastructures: self-organizing, self-configuring, self-tuning, self-healing, self-managing systems of storage servers. Borrowing organizational ideas from corporate structure and technologies from AI and control systems, self-* storage should simplify storage administration, reduce system cost, increase system robustness, and simplify system construction. This talk will overview our plans, approaches, and early progress in building a large-scale prototype for real deployment.
Greg Ganger is a professor in the ECE department at Carnegie Mellon University. He has broad research interests in computer systems, including storage systems, security, and operating systems. Some ongoing projects explore self-securing devices, storage survivability, more expressive storage interfaces, MEMS-based storage, and of course self-* storage. Greg is the Director of CMU's Parallel Data Lab, academia's premiere storage systems research center. His Ph.D. in Computer Science and Engineering is from The University of Michigan, and he spent 2.5 years as a postdoc at MIT.
University of Illinois at Urbana-Champaign
Transient Errors in High-Performance Microprocessors
Due to a confluence of factors, transient (aka soft) error rates in future generation digital systems (microprocessors, for example) are forecasted to rise approximately with transistor count. As devices reach densities of 1 billion transistors per chip, soft errors due to high-energy neutrons alone might have a mean-time occurrence per chip of once every 1000 hours. Other factors such as power supply noise, thermal noise, substrate noise, coupling, and other forms of external and internal noise additively combine to increase the mean soft error rate to dangerous levels, particularly for critical systems that demand high reliability.
The expected increases in error rate will drive architectural design, and has caused a flurry of renewed activity in error-tolerant microarchitecture. Sound approaches to error tolerance will balance performance and power overheads with error coverage. In this talk, I will present some results of our bottom-up analysis of how transient errors affect software running on a modern high-performance processor. For our experimentation, we performed statistical fault-injection on a detailed, latch-level model of a contemporary superscalar processor, on par with the Alpha 21264 or AMD Athlon, and observed the behavior of the processor after fault injection. Few faults actually cause architecturally visible errors, and most of these harmful faults can be caught using simple, local detection mechanisms that flush corrupted state using the pipeline's misprediction recovery logic. I will also some results on how errors that escape from the microarchitectural layer corrupt software state. We find some interesting and unintentional sources of redundancy, particularly involving control flow, that arise from the way we write and compile code.
Sanjay J. Patel is an Assistant Professor of Electrical and Computer Engineering and Willett Faculty Scholar at the University of Illinois at Urbana-Champaign. Sanjay is co-author (with Prof. Yale N. Patt at the University of Texas at Austin) of a introductory text book for computer science and engineering students, titled "Introduction to Computing Systems: From bits and gates to C and beyond", which is now available in its second edition. His research interests include processor microarchitecture, computer architecture, and high performance and reliable computer systems. In particular, his research group investigates high-performance and error- tolerant processor architectures for the 7 to 10 year time horizon. He received a BS, MS, and PhD from the University of Michigan and has done hardware verification, logic design, and performance modeling at Digital Equipment Corporation, Intel, and HAL Computer Systems, and has consulted for Transmeta, Jet Propulsion Laboratory, HAL, Intel and others.
Transactional Coherence & Consistency
With uniprocessor systems running into ILP limits and fundamental VLSI constraints, parallel architectures provide a realistic path towards scalable performance. Nevertheless, shared memory multiprocessors are neither simple to design nor easy to program. Transactional Coherence and Consistency (TCC) is a new model for shared memory systems with the potential to address both issues. TCC relies on user-defined, light-weight transactions as the basic unit of parallel work, communication, memory coherence, memory consistency, and error recovery. TCC simplifies shared memory hardware design by eliminating the need for cache line ownership tracking in the cache coherence protocol. It also replaces the need for small, low latency messages for cache coherence. TCC simplifies parallel programming by eliminating the need for manual orchestration of parallelism using locks. The use of a single abstraction for parallelism, communication, and synchronization makes it easy for the programmer to identify and remove performance bottlenecks. This talk will introduce the hardware and software aspects of TCC and provide an initial evaluation of its potential as a shared memory model.
Christos Kozyrakis is an assistant professor of Electrical Engineering and Computer Science at Stanford University. He holds a B.S. from the University of Crete in Greece and a Ph.D. from the University of California at Berkeley. Christos' research focuses on architecture, compilation, and programming models for parallel systems.