Steven K. Reinhardt
University of Michigan, Ann Arbor
Prefetching and Caching Strategies for Modern Memory Systems
Thanks to technology advances, high-end microprocessors of the near future will include large on-chip caches and direct high-bandwidth memory interfaces. Our research concerns how best to exploit these features to counter the ever-increasing impact of memory latency on system performance. Current work focuses on two areas:
Steven K. Reinhardt is an Assistant Professor of Electrical Engineering and Computer Science at the University of Michigan in Ann Arbor. His primary research interest is in computer system architecture, focusing on uniprocessor and multiprocessor memory systems, operating system/architecture interactions, and system simulation techniques. He received the B.S. degree in Electrical Engineering from Case Western Reserve University in 1987, the M.S. degree in Electrical Engineering from Stanford University in 1988, and the Ph.D. degree in Computer Science from the University of Wisconsin-Madison in 1996. Prior to receiving his Ph.D., he held positions at Bellcore and Data General.
Intel Texas Development Center
VLSI: Is it all about integration and performance? Trends and Directions
Integration and Performance are the driving forces of VLSI technology. The anticipated trend in performance and in integration (Moore's law) continue to drive the computing industry. The impact of the market; the performance requirements and the integration cost forces will be described, while emphasis will be given to the new uArchitecture opportunities. Two major Integration spaces will be presented; The WAY we integrate (Glue integration Vs Cohesive integration techniques), and the WHAT we integrate (Fixed function Vs General purpose function). The characteristics and the ingredients of each option will be described together with some examples. MicroProcessor's Performance was and probably will be one of the driving forces of the MicroProcessor industry. The presentation will cover the some ideas that will enable to reach the required performance. The VLSI performance trend and directions, and future integration and future VLSI architecture concepts will conclude the talk.
Dr. Uri Weiser received his B.Sc & M.Sc EE degrees from the Technion, Israel (1970, 1975), and his Ph.D CS from University of Utah (1981). Uri joined the Israeli Department Of Defense in 1970 to work on super-fast analog feedback amplifiers. Later Uri led, at National Semiconductor, the design of NS32532 Microprocessor. Since 1988 Uri has been with Intel, leading various architecture activities such as Pentium feasibility studies, definition of new Multimedia Architecture (MMX Technology), X86 Microarchitecture research and lately he is a Co-Director of Intel's Development Center in Austin, Texas. Uri is an Intel Fellow since 1997, he holds an adjunct Professor position at the Technion and is an Associate Editor of IEEE Micro magazine.
What computer architects should know about VLSI scaling
As architectural complexity grows and clock cycle times shrink, VLSI implementation constraints are becoming an increasing important issue for processor research. Unfortunately, the capability and constraints of the underlying technology is often misunderstood, especially when it comes to the relative speed of gates and wires. At first glance, the future of wires in integrated circuit technologies looks grim. Even projections with copper interconnections and low-k dielectrics show that wire delay for a fixed length wire will increase at a rate that is greater than linear with scaling factor. This has led to a number of papers which have predicted the demise of the world as we know it.
This talk examines historical processor performance scaling, and relates the performance gains to changes in architecture, circuit design, and technology. It then examines gate and wire scaling to get a better idea of what will happen in the future. The results are a little surprising. If an existing circuit is scaled to a new technology, the relative change in the speed of wires versus the speed of gates is modest. Depending on the assumptions on transistor performance under scaling, low-k dielectrics, and higher aspect ratio wires, the ratio is close to one. That does not mean scaling is not without problems. The two main challenges a designer faces are the decreasing numbers of gates allowed in each clock cycle, and the delay of the global wires in the machine. The latter are the wires that don't scale in length as the technology shrinks because the machine got more complicated. These wires are a problem.
Mark Horowitz is the Yahoo Founder's Professor of Electrical Engineering and Computer Science at Stanford University. He received his BS and MS in Electrical Engineering from MIT in 1978, and his PhD from Stanford in 1984. In 1990 he took leave from Stanford to help start Rambus Inc, a company designing high-bandwidth memory interface technology. His current research includes multiprocessor design, low power circuits, memory design, and high-speed links.
The University of North Carolina at Chapel Hill
Fast Tree-Structured Computations and Memory Hierarchies
Fast tree-structured computations employing techniques such as fast multipole, multigrid, and wavelets represent a class of problems whose behavior in memory hierarchies is not completely understood. We present our work in the TUNE project in understanding the impact that cache-conscious algorithm design and non-traditional data layouts can have on the performance of such computations in modern memory hierarchies.
Siddhartha Chatterjee is an associate professor of computer science at the University of North Carolina at Chapel Hill. He received his B.Tech. (Honors) in electronics and electrical communications engineering in 1985 from the Indian Institute of Technology, Kharagpur, and his Ph.D. in computer science in 1991 from Carnegie Mellon University. He was a visiting scientist at the Research Institute for Advanced Computer Science (RIACS) in Mountain View, California, from 1991 through 1994. His research interests include the design and implementation of programming languages and systems for high-performance scientific computations, the interaction of such software systems with high-performance architectures, and parallel algorithms and applications. He is an associate editor of ACM Transactions on Programming Languages and Systems.
IBM Austin Research Lab
An Introduction to Formal Verification and the FM9801 Project
Formal verification is increasingly important in the design process of microprocessors. Current formal verification techniques can be categorized into three groups: equivalence-checking, model-checking, and theorem prover-based. This talk first gives a brief introduction to the formal verification techniques.
Then we discuss a theorem prover-based verification project in detail. The FM9801 is a pipelined microprocessor with a number of performance-oriented features: out-of-order issue and completion of instructions using Tomasulo's algorithm, speculative execution with branch prediction, and so on. Using the ACL2 theorem prover, we formally verified the correctness FM9801. The correctness is defined as a commutative diagram involving sequential and pipelined execution. This talk is based on his dissertation research.
Jun Sawada has completed his doctoral program in the UTCS department in December 1999, and is currently working in IBM Austin Research Laboratory.
Department of Computer Science & Engineering
The Pennsylvania State University
Communication and Scheduling for a Multiprogrammed Cluster
Clusters built with off-the-shelf workstations and networking hardware have become a popular platform for meeting the needs of demanding applications. Advent of high speed networks such as Myrinet and Gigabit Ethernet, network interfaces (such as VIA), and user-level messaging layers have helped lower the cost of communicating between the nodes of a cluster. Previous research on clusters has primarily focussed on systems support and performance improvements from the viewpoint of a single application/user. However, clusters are being increasingly considered and deployed in multiprogrammed environments, where it is important to provide high responsiveness, throughput and other Quality-of-Service (QoS) requirements for each individual application.
In this talk, I will cover a spectrum of closely inter-twined communication and scheduling issues for a multiprogrammed cluster. This will include: (a) support for differential traffic types in cluster networks; (b) network interface and communication software support for scalability and QoS; and (c) responsive scheduling mechanisms to coordinate the activities across the nodes of a cluster. These issues have been investigated using extensive simulations and/or implementations on a cluster of Sun Ultra Enterprise servers, Myrinet hardware and the Solaris operating system. Finally, I will summarize our ongoing related projects on clusters.
Anand Sivasubramaniam received his B.Tech in Computer Science from the Indian Institute of Technology, Madras, in 1989, and the MS and PhD degrees in Computer Science from the Georgia Institute of Technology in 1991 and 1995 respectively. Since Fall 1995 he has been an Assistant Professor in the Department of Computer Science and Engineering at Penn State. Anand's research interests are in architecture, operating systems, and performance evaluation, and application of expertise on these topics to Parallel and Distributed Computing, Multimedia, Spatial Databases and Geographical Information Systems, and Resource-Constrained computing. He has over thirty publications in reputed journals and conferences on these topics. His research has been funded by four NSF grants (including the Career Award), the EPA, and from industries including IBM and Unisys Corp.
Wave Pipelined "Elastic" Interface on the POWER4 Chip
Memory busses have not kept up with processor speed increases. The Power4 I/O design team attacked this area , resulting in the new "Wave-Pipelined Elasitic Interface". This interface allows the system designer to decouple latency from bandwidth and maintain cycle synchronization to a reference target cycle. The end result is all data busses on the Power4 system have a 500+ Mhz transfer rate even for nets as long as 55 cm. This is all backside busses not simply a "close" frontside Cache 3 cm away. The techniques developed to achieve this will be discussed and the upward frequency scalability will be addressed.
Dan Dreps received his BS in Electrical Engineering from Michigan State University in 1983. Since 1983, he has designed many chips in the areas of PLL's, high speed I/O, Fiber Optic transducers. His area of interest is Analog and Custom digital circuit design. At present he is a member of the "POWER4" circuit design team with system responsiblity for meeting the signalling for the 500Mhz interconnects for the processor/nest chips in the AS/RS systems that contain the POWER4 gorilla. He holds 11 US patents, has ~25 pending, and is an IBM Server Division Master Inventor.
Department of EECS
The University of Michigan
Until recently, mainstream research in computer architecture has ignored the impact that architectural choices have had on power consumption, focussing instead on performance. This talk will argue that the focus on performance will need to be less exclusive. Low power consumption is an obvious requirement for portable systems, but it is of growing importance for high performance systems too, because of cost and physical limits. To date, power consumption has been controlled through supply voltage reduction and process shrinks. However, there is a limit to how far the supply voltages may be reduced, and, largely due to frequency and density increases, the power dissipated on chip continues to increase even as process technology improves. Solutions must be found which reduce power at all levels of the design process. In particular, it is becoming important to develop ways to control power consumption at the computer architecture level too. We will discuss architectural techniques that have been proposed to date.
The tools developed by computer architects have reflected their focus on performance and are typified by the cycle simulator. This models the implementation of an architecture at the clock cycle level, allowing the user to obtain fairly accurate projections of execution times. We will conclude our talk with the discussion of a similar architectural level tool - PowerAnalyzer - that is being developed by groups at Michigan and Colorado. It augments the traditional cycle simulator allowing architects to obtain power and performance trade-offs early in the design process.
Sanjay Jeram Patel
Department of Electrical and Computer Engineering
University of Illinois, Urbana-Champaign
rePLay: A Framework for Dynamic Optimization
The use of run-time information is beginning to play a larger role in boosting processor performance. Techniques such as branch prediction, trace caches, value prediction, and instruction reuse all attempt to capitalize on stable patterns in the dynamic behavior of a program in order to reduce its running time.
In the same spirit, the rePLay Framework uses a program's run-time behavior to dynamically optimize its instruction stream. The key to dynamic (and compiler-based) optimization is that of identifying atomic (i.e., single entry, single exit) blocks of instructions upon which to perform optimizations. Just as a trace cache creates physically sequential traces out of logically sequential code, the rePLay Framework dynamically creates atomic regions called frames out of logically atomic code. The key to frame construction is that of converting easily predictable branches into assertions. These assert instructions check if the original branching conditions still hold; they flush the frame and redirect the i-stream if the conditions have changed. Frames are optimized by a hardware optimizer and stored in a frame cache. Fetching of frames is orchestrated by an instruction sequencer that detects when to fetch from the original istream versus when to fetch from the frame cache.
In this talk, I will present our preliminary work on the rePLay Framework. Our initial analysis indicates that the mechanism for frame construction creates frames spanning on average 88 instructions, or over 9 basic blocks, with a probability of complete execution (i.e., no assertions firing) of 98%. These results indicate the strong potential for the rePLay Framework to improve performance.
Director of Microcomputer Research Labs
New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies
Over the last 15 years, CMOS scaling simplified the task of the microprocessor architect. With each new process technology, frequency increased by ~50%, and transistor density increase by 100%. Also, the improvements in manufacturing technology (larger wafers and higher yields) allowed for increasing die sizes without increasing cost. Die sizes of 1 square inch or higher were common.
However, the end of these easy times is in sight, and several new challenges are facing the architect. Die size is no longer going to limited by equipment or manufacturing cost, but rather by power. To date the approach has been to lower voltage with each process generation. But as voltage is lowered, leakage current and energy increase, contributing to higher power. The problems extend beyond power dissipation to power delivery/distribution and increasing power density.
This talk will first look at the historical trends of CMOS process technology in the context of past microprocessors. It will then look at the implications of continued CMOS scaling, as described above, and the new challenges they pose. Microarchitecture techniques that have exacerbated the power problem will also be covered. Finally, the talk will describe some of the microarchitecture directions that may lead to more power-efficient and cost-efficient microprocessors.
Fred Pollack is the Director of the Microprocessor Research Labs (MRL), with labs in Santa Clara, Calif.; Hillsboro, Ore.; Haifa, Israel; and Beijing, China. MRL focuses in several different areas including computer architecture, compilers, circuits, graphics, video, security, speech recognition, and new computing models.
From mid-1992 to early 1999, Fred was director of the MAP group in MPG. This division is responsible for all Intel platform architecture and performance analysis. In this role, he was also responsible for directing the planning of Intel's future microprocessors. From mid-1990 to mid-1992, he was the Architecture Manager for the Pentium Pro microprocessor. He joined Intel in 1978.
Earlier assignments included manager of the i960 architecture and chief architect for an advanced object-oriented, distributed operating system. In January of 1993, he was named an Intel Fellow.
IBM Austin Research Lab
Scalability and Performance of a ccNUMA-based Wintel System
The talk describes a study of application performance in a 16-way ccNUMA Wintel environment. We built the system using a cache-coherent switch that connects four 4-processor SMP's, featuring sixteen 350MHz Intel Xeon processors and a total of 4 GBytes of physical memory. Such an environment poses several performance challenges to Windows NT, which assumes that memory is equidistant to all processors. To overcome these problems, we have followed a combined software/hardware approach to support performance evaluation and tuning. On the hardware side, we have built a programmable performance monitor that provides measurements concerning the frequency of remote memory accesses. The monitor does not incur any performance overhead and can be deployed in production mode, providing the possibility for dynamic performance tuning as the application's workload changes over time. On the software side, we have implemented an abstraction called a Resource Set, which allows threads to specify their execution and memory affinity across the ccNUMA complex. We used Webbench and a suite of parallel applications from the Splash-II benchmark to evaluate the scalability and performance of the system. Our results suggest that the scalability of this environment is limited by the poor performance of local memory access in the current generation of Intel-based systems, and by the performance limits of existing I/O subsystems.
Work done jointly with Bishop Brock, Gary Carpenter, Eli Chiprout, Mark Dean, Philippe De Backer, Hubertus Franke, Mark Giampapa, David Glasco, Jim Peterson, Ram Rajamony, Freeman Rawson, Ron Rockhold and Andrew Zimmerman.
J Strother Moore
Department of Computer Sciences
University of Texas at Austin
Proving Commercial Microprocessors Correct: Recent Results with ACL2
ACL2 is an interactive mechanical theorem prover designed for use in modeling hardware and software and proving properties about those models. It has been used to prove correct a variety of algorithms and designs, including floating-point division and square root on both the AMD K5 (where the operations are done in microcode) and K7 or Athlon (where they are done using different algorithms in hardware). ACL2 has also been used for microprocessor modeling on designs including the Motorola CAP DSP chip and the Rockwell Collins JEM1 microprocessor.