Building Buggy Chips - That Work!
Building a high-performance microprocessor presents many reliability challenges. Designers must verify the correctness of large complex systems and construct implementations that work reliably in varied (and occasionally adverse) operating conditions. Failure to meet these challenges can result in serious repercussions, ranging from disgruntled users to financial damage to loss of life. In this talk, I will describe a novel design strategy, called "dynamic verification", that works to reduce the burden of correctness on complex systems. The approach creates minimally correct systems that are capable of tolerating most permanent (e.g., design errors) and transient (e.g., particle strikes) faults. I will also detail ongoing work that suggests dynamic verification could render other valuable benefits such as reduced time-to-market, decreased power consumption, and improved performance.
Todd Austin is an Assistant Professor of Computer Science and Engineering at the University of Michigan in Ann Arbor. His research interests include computer architecture, compilers, computer system verification, and performance analysis tools and techniques. Prior to joining academia, Todd was a Senior Computer Architect in Intel's Microcomputer Research Labs, a product-oriented research laboratory in Hillsboro, Oregon. Todd is the first to take credit (but the last to accept blame) for creating the SimpleScalar tool set, a popular collection of computer system performance analysis tools. Todd received his Ph.D. in Computer Science from the University of Wisconsin in 1996.
Chalmers University of Technology
Limits on Speculative Module-level Parallelism in Imperative and Object-oriented Programs on CMP Platforms
In this work, we consider program modules, e.g. procedures, functions, and methods as the basic method to exploit speculative parallelism in existing codes. We analyze how much inherent and exploitable parallelism exist in a set of C and Java programs on a set of chip-multiprocessor architecture models, and identify what inherent program features, as well as architectural deficiencies, that limit the speedup. Our data complement previous limit studies by indicating that the programming style -- object-oriented versus imperative -- does not seem to have any noticeable impact on the achievable speedup. Further, we show that as few as eight processors are enough to exploit all of the inherent parallelism. However, memory-level data dependence resolution and thread management mechanisms of recent CMP proposals may impose overheads that severely limit the speedup obtained.
PER STENSTROM is a Professor of Computer Engineering and Vice-dean of the School of Electrical and Computer Engineering at Chalmers University of Technology, Sweden. His research interests are in computer architecture with a focus on multiprocessor systems. He has been engaged in parallel architecture projects as a visiting scientist at Carnegie-Mellon, at Stanford, and at the University of Southern California.
He has served on numerous program committees of computer architecture and parallel processing conferences and is on the editorial board of JPDC and IEEE Trans. on Computers. He was the General Chair of the 28th ACM/IEEE Int. Symp. on Computer Architecture.
For more information about Stenstrom's activities please refer to URL http://www.ce.chalmers.se/~pers.
Although data memory is logically a linear array of cells, its realization in hardware can be viewed as a multi-dimensional array. Given a dimension in this array, alignment analysis will identify memory locations that always resolve to a single value in that dimension. For example, if the dimension of interest is memory banks, alignment analysis will identify if a memory reference always accesses the same bank. In this talk, we will describe the compiler analyses for extracting alignment information. We also discuss program transformations that can drastically improve the effectiveness of the analysis. We will demonstrate the practicality and effectiveness of these analysis and optimizations by presenting results on SPEC95fp and MediaBench benchmark suites.
Alignment information is useful in a variety of compiler-controlled memory optimizations leading to improvements in programmability, performance, and energy consumption. We will show how alignment analysis is critical in compiling for clustered architectures in general and specifically for the Raw architecture.
Saman Amarasinghe is an Associate Professor in the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology. He leads the commit group and is co-leader of the MIT Raw project. He is interested in discovering novel approaches to improving the performance of modern computer systems without unduly increasing the complexity faced by either application developers, compiler writers or computer architects. Saman received his BS in Electrical Engineering and Computer Science from Cornell University, and his MSEE and Ph.D from Stanford University.
Universitat Politecnica de Catalunya
The Post-Superscalar Era
Current superscalar organizations are approaching a point of diminishing returns. Some of the main hurdles for the scalability of such organizations are their increasing complexity, the limited parallelism of some applications, the increasing memory to processor gap, and problems related to deep sub-micron technologies such as power consumption, power density and wire delays.
This talk will outline our research on future microarchitectures and code generation techniques to address the above problems. Our work focuses on the design of clustered, multithreaded, power-aware microarchitectures and effective code generation techniques for them. After an overview of the different activities in this project, the talk will emphasize the task that focuses on the support for speculative multithreading.
Multithreaded/multicore processors will soon be a mainstream technology. Multithreaded/multicore architectures are effective for throuhput but single thread performance will continue to be important in many scenarios. On the other hand, we know that for some non-numeric applications (e.g., SpecInt95), performance is mainly constrained by the dependences among instructions. benefits of speculation are significantly limited by the sequentiality of the branch prediction process. As soon as a branch is mispredicted, all instructions after it become useless.
A speculative multithreaded execution model is much more effective at exploting speculative execution. The idea is to spawn speculative threads that start at future points of the dynamic instruction stream. Ideally, a spawned thread should start a computation that is sure to be required and is independent of the spawning thread. However, this is too restrictive for non-numeric applications. On the other hand, it may be feasible to identify future threads that are likely to be executed and have few dependences with the parent thread. Besides, if the values that flow through the data dependences among threads can be predicted, these threads could be executed as if they were independent.
In this work we investigate microarchitectural support for speculative multithreaded processors. Speculative state management, data and control speculation and effective thread-spawning schemes are the main focus of our research in this task.
Antonio Gonzalez has been a faculty member of the Computer Architecture Department of the Universitat Politecnica de Catalunya at Barcelona (Spain) since 1986. His research interests center on computer architecture, compilers and parallel processing, with a special emphasis on processor microarchitecture, memory hierarchy and instruction scheduling. He has published over 100 papers on these topics in international journals and symposia and has served in numerous program committees for international symposia. He has conducted research funded by national and international agencies and several private companies such as Analog Devices, IBM, Intel and ST Microelectronics.
For more information and recent publications, please refer to the Web page http://people.ac.upc.es/antonio
University of Wisconsin-Madison
Towards Critical-Path Instruction Processing
Although some instructions hurt performance more than others, processors apply scheduling and speculation as if each instruction was equally costly. The natural method for understanding the costs of parallel instructions is the critical path: if we could predict it, we could replace egalitarian policies with cost-sensitive strategies that will grow increasingly effective as processors become more sophisticated.
This talk will present our technology for critical-path instruction processing. I will start by demystifying the critical path of a microexecution, and follow by describing a predictor that computes the critical path very efficiently, without actually building a dependence graph. I will then focus on the multiple uses of the critical-path predictor in a modern processor, and show that it can improve instruction scheduling by up to 20%, without requiring any major changes to the existing processor designs.
(joint work with Brian Fields and Shai Rubin).
Ras Bodik teaches and explores computer science at the University of Wisconsin-Madison. While his prior training has focused strictly on compilers and program analysis, his current research is applying compiler-optimization tricks to microarchitecture, dynamic program optimization, operating systems, and debugging. www.cs.wisc.edu/~bodik
University of California, San Diego
Living in a Post-SMT World
Simultaneous Multithreading (SMT) architectures, in which multiple programs, or threads, compete for all execution resources every cycle, are poised to begin appearing in mainstream commodity processors. In this talk, we'll briefly recap the SMT research, then we'll examine how this is going to change the way compilers work, the way operating systems work, and the direction of future architectures. Most of our emphasis will be on the many new architectural opportunities that arise when you start with a multithreading processor as a foundation.
Dean Tullsen is an assistant professor at UC San Diego, where he co-directs the High-Performance Processor Architecture and Compilation Lab. He received his PhD from the University of Washington in 1996, where he began his research on simultaneous multithreading. He holds two patents on SMT, and has received an NSF Career award and a Charles Lee Powell faculty fellowship.
University of Rochester
Dynamic Tuning of On-Chip Regular Structures in General-Purpose Processor Architectures
Today's general-purpose dynamic superscalar processors use large on-chip regular structures, such as the register file and caches, in order to exploit the available instruction-level parallelism and improve performance. These structures are often designed in order to maximize the performance of the average application. There is, however, a trade-off between the size of these structures and their speed and power consumption. The best design may differ on a per application and even a per phase basis.
In this talk, I will present some of the techniques we have developed to improve the utilization of cycle time critical structures such as the register file, as well as to leverage technology trends in order to effect speed and power trade-offs dynamically. First, I will present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Next, I will describe a tunable cache and TLB (translation lookaside buffer) hierarchy that leverages repeater insertion to provide dynamic low-cost configurability. This hierarchy is controlled by a novel configuration management algorithm that trades the sizes of the cache and TLB for their speed and power consumption on a per application phase basis. Finally, I will present a glimpse of our efforts to apply these techniques globally across the chip for dynamic scaling of voltage, frequency, and size for reduced energy consumption.
Sandhya Dwarkadas is an Associate Professor of Computer Science and of Electrical and Computer Engineering at the University of Rochester. She received her Bachelor's degree from Indian Institute of Technology, Madras, India, in 1986 and her Ph.D. in Electrical and Computer Engineering from Rice University in 1993, where she continued as a research scientist in the Computer Science department until 1996. Her research interests are in compiler/architecture/run-time interaction, computer architecture, parallel and distributed systems, and performance evaluation. Dr. Dwarkadas has been the recipient of an NSF CISE Experimental Science Postdoctoral Fellowship (1993-1995), and an NSF CAREER Award for junior faculty (1997-2000).
University of Utah
The Impulse Smart Memory Controller
Deep cache hierarchies and the latency-tolerating features of modern superscalar microprocessors hide the increasing relative latency of main memory for many applications. However, applications with poor spatial or temporal locality have low cache and TLB hit rates, and thus suffer significant performance degradation due to the cost of main memory accesses. Such applications often have nearly random access patterns, and thus cannot be easily optimized for locality using static compiler techniques.
The Impulse memory controller is designed to help overcome this problem by enabling applications to create spatial locality where none exists naturally by gathering non-contiguous data into dense cache lines. In essence, Impulse provides gather memory operations for conventional (non-vector) superscalar microprocessors. In addition to its fine-grained gather capabilities, the Impulse memory controller allows applications to remap discontiguous pages into contiguous regions of physical address space, thereby enabling inexpensive superpage creation and page coloring.
In this talk, we first describe the design and implementation of the Impulse memory controller, and then present the performance of a variety of benchmark programs that exploit Impulse. Typical performance improvements for applications with poor memory locality range from a 1.2X to over 5X. We have implemented a synthesizable Verilog verion of the Impulse memory controller, and are currently testing and tuning our design in a large Xilinx FPGA. We are concurrently mapping the design to a 0.18um TSMC ASIC. Our design will contain roughly 1M transistors (approximately 100kbits of dual-port SRAM, plus other datapath and control logic) and have approximately 650 pins.
John Carter is an Associate Professor of Computer Science in the School of Computing at the University of Utah. His research interests include computer architecture, operating systems, parallel and distributed computing, and networking. Of particular interest are memory system designs, both hardware and software, and distributed systems.
Hewlett Packard Laboratories
The Cambridge DELI: A New Control Point in the Computation Chain
Joint work with Giuseppe Desoli, Paolo Faraboschi, Josh Fisher and Nicolay Mateev
Caching, linking and then executing a private copy of executable code is a technique that has been used to improve performance in many different environments. For example, caching and linking has been employed to avoid the repetition of work done to produce the executable itself (in dynamic binary translators) or to permit performance-enhancing transformations to the private copy (in dynamic optimizers).
The Cambridge DELI is a result of recognizing the inherent value of this technique. Rather than just using it in "point implementations", we have extracted the underlying control functionality for caching and linking and have opened it up to the OS and to higher layers of system software and applications. To emphasize the novel aspect of this facility as a control point or service layer we call it a "Dynamic Execution Layer Interface", or DELI.
By providing a new control layer in the computation chain, the Cambridge DELI enables many new capabilities that are not practical at higher levels such as in the compiler or linker. Using the explicit interface of the DELI, we are able to transparently take over control of the running applications, enabling such services as translation, optimization, sandboxing, code patching, hardware virtualization, remote code and data streaming, and more. Despite the great variety of these capabilities, we leverage the investment of building a single well-designed caching and linking mechanism across these many different uses.
In this talk we discuss the design of the DELI infrastructure, illustrate several DELI applications, and report on early experimental results.
University of Wisconsin-Madison
Bandwidth Adaptive Snooping
Database and web server workloads are an increasingly important use of hardware shared memory multiprocessors. In these workloads, cache misses caused by data sharing between processors are frequent and can significantly reduce system performance. Some multiprocessor designs optimize for sharing misses by employing a class of cache coherence protocols known as snooping protocols. However, snooping protocols use excessive bandwidth in large systems. Currently, the only approach for building large systems is to sacrifice sharing-miss performance and employ a bandwidth-efficient scalable directory protocol.
I will present Bandwidth Adaptive Snooping, a hybrid protocol that provides the best characteristics of both snooping and directory protocols by dynamically adapting to available bandwidth. Our hybrid protocol behaves like snooping (by broadcasting requests) when bandwidth is plentiful and behaves like a directory protocol (by unicasting requests) when bandwidth is limited. Moreover, by selectively broadcasting requests, the hybrid protocol outperforms both snooping and directory protocols in the mid-range where available bandwidth makes a static choice of either protocol inefficient.
Milo Martin is a PhD candidate and member of the Wisconsin Multifacet Project (http://www.cs.wisc.edu/multifacet/) at the University of Wisconsin-Madison. His research interests include memory system performance of commercial workloads, techniques to improve multiprocessor cache coherence, and use of dynamic feedback to build adaptive and robust systems.
University of Michigan
A Scalable Instruction Queue Design Using Dependence Chains
Increasing the number of instruction queue (IQ) entries in a dynamically scheduled processor exposes more instruction-level parallelism, leading to higher performance. However, increasing a conventional IQ's physical size leads to larger latencies and slower clock speeds. We introduce a new IQ design that divides a large queue into small segments, which can be clocked at high frequencies. We use dynamic dependence-based scheduling to promote instructions from segment to segment until they reach a small issue buffer. Our segmented IQ is designed specifically to accommodate variable-latency instructions such as loads. Despite its roughly similar circuit complexity, simulation results indicate that our segmented instruction queue with 512 entries improves performance by up to 69% over a 32-entry conventional instruction queue for SpecINT 2000 benchmarks, and up to 398% for SpecFP 2000 benchmarks. The segmented IQ achieves from 55% to 98% of the performance of a monolithic 512-entry queue while providing the potential for much higher clock speeds.
Steven K. Reinhardt is an Assistant Professor of Electrical Engineering and Computer Science at the University of Michigan in Ann Arbor. His primary research interest is in computer system architecture, focusing on uniprocessor and multiprocessor memory systems, multithreaded systems, and operating system/architecture interactions. He received the B.S. degree in Electrical Engineering from Case Western Reserve University in 1987, the M.S. degree in Electrical Engineering from Stanford University in 1988, and the Ph.D. degree in Computer Science from the University of Wisconsin-Madison in 1996. He is the recipient of a 2001 Sloan Research Fellowship and a 1998 NSF CAREER award. Prior to receiving his Ph.D., he held positions at Bellcore and Data General.