Building Demanding Hard Real-Time Systems with Software Thread Integration
Software thread integration (STI) is a compiler technique which statically interleaves instructions from functions of multiple threads into one while reconciling control- and data-flow conflicts. This gives generic microprocessors low- or no-penalty context switches, increasing concurrency and enabling hardware to software migration on much slower processors than previously possible.
This talk describes new methods that enable STI to tackle much more challenging applications than previously possible. In particular, interrupt response times are cut dramatically, long functions can be integrated, and threads can now make asynchronous progress. We demonstrate how these new methods are used on a network protocol controller and a video display controller. The latter will be demonstrated with working hardware designed around a 20 MHz 8-bit microcontroller.
Alex Dean has been an assistant professor at NC State University's Dept. of ECE since 2000 and is associate director of the Center for Embedded Systems Research. He has developed two courses on embedded systems and received an NSF Career award. He received his MS (1993) and PhD (2000) in ECE at Carnegie Mellon University developing thread integration concepts. Between the degrees, he worked for over two years at United Technologies Research Center, designing, analyzing and implementing embedded network architectures for jet engines, elevators, cars, and building climate control systems.
Intelligent Energy Management for systems-on-chip
A critical concern for mobile handheld devices is the need to deliver high levels of performance given ever-diminishing power budgets. The need for low power is evident in mobile phones, where battery life per gram of battery has improved by a factor of 50 to 100 over the past seven years. These devices are increasingly running sophisticated workloads with widely varying resource requirements, which puts pressure on designers to optimize for an increasing number of use-cases. One of the most effective ways of bridging the gap between the different operating requirements is the use of dynamic voltage scaling (DVS).
This talk will cover ARM's Intelligent Energy Management (IEM) technology, which is a set of software and hardware components that enable the use of DVS and include predictive algorithms for determining the minimum required performance level of the processor. Deploying DVS is a particular challenge on system-on-chip designs and requires support from EDA tools. Our ongoing research addresses the need for self-characterizing systems that simplify validation, and open the door to power optimizations based on die-specific and ambient conditions. Razor is a prototype microarchitecture under development with the University of Michigan, which supports failure recovery to push the operating parameters of a system to its physical limits and beyond instead of always relying on parameters that satisfy worst-case conditions and always-correct operation.
Krisztian Flautner is a Principal Researcher at ARM Limited. He holds Ph.D., M.S.E., and B.S.E. degrees in Computer Science and Engineering from the University of Michigan. His thesis explored the relevance of multithreading for interactive desktop workloads and described the implementation of an automatic power-management algorithm for processors supporting dynamic voltage scaling. Dr. Flautner’s research interests are focused on simple ideas that enable high-performance low-power processing platforms to support advanced software environments. In the research group at ARM Limited, he is currently working on next generation ARM architectures.
Energy Consumption in Mobile Devices: Why Future Systems Need Requirements-Aware Energy Scale-Down
The current proliferation of mobile devices has resulted in a large diversity of designs, each optimized for a specific application, form-factor, battery life, and functionality (e.g., cell phone, pager, MP3 player, PDA, tablet, laptop). Recent trends, motivated by user preferences towards carrying less, have focused on integrating these different applications in a single general- purpose device, often resulting in much higher energy consumption and consequently much reduced battery life. Our research argues that in order to achieve longer battery life, such systems should be designed to include requirements-aware energy scale-down techniques. Such techniques would allow a general-purpose device to use hardware mechanisms and software policies to adapt energy use to the user's requirements for the task at hand, potentially approaching the low energy use of a special-purpose device.
We make two main contributions. We first provide a model for energy scale-down. We argue that one approach to design scale-down is to use special-purpose devices as examples of power-efficient design points, and structure adaptivity using insights from these design points. To understand the magnitude of the potential benefits, we present an energy comparison of a wide spectrum of mobile devices (to the best of our knowledge, the first study to do so). Based on the insights from this study, we propose and evaluate three specific requirements-aware energy scale-down optimizations, in the context of the display, wireless, and CPU components of the system. Our optimizations reduce the energy consumption of their targeted subsystems by factors of 2 to 10 demonstrating the importance of energy scale-down in future designs.
In the talk, I will primarily focus on display scale-down with focus on some work we have done for laptops and handhelds. I will also briefly talk about our other work on processors, wireless, and servers.
Partha Ranganathan is currently a research scientist at Hewlett Packard Labs. His research interests are in low-power system design, system architecture, and performance evaluation. His recent research focuses on designing power- and energy-efficient systems for future computing environments (from small mobile devices to dense servers in data centers). This work has led to a class of "energy scale-down" optimizations that use adaptivity in resources to match system energy efficiency with desired user functionality to achieve significant energy savings. Partha is currently exploring the potential of energy scale-down optimizations in the context of the data center as part of the data center architecture team. Partha received his B.Tech degree from the Indian Institute of Technology, Madras and his M.S. and Ph.D. from Rice University, Houston. He is a primary developer of the publicly distributed Rice Simulator for ILP Multiprocessors (RSIM), and a recipient of the Lodieska Stockbridge Vaughan fellowship and an IIT Madras Alumni Award.
Physical Design of High Performance Microprocessors
The design methodology used for designing the Power4 and Power5 microprocessors will be described. How the methodology dealt with the challenges of discipline, 170 to 250 million transistors, high frequency, tight schedules, etc will be detailed. The barriers of wire resistance, leakage, cost to future high performance processors will be discussed.
Carl J. Anderson received his BS in physics from the University of Missouri in 1974 and his PhD. in Physics from the University Wisconsin in 1979. He joined IBM Research in 1979 where he did circuit design, package design and test on the Josephson Superconducting Computer program. From 1983 to 1992 he worked in GaAs and Optoelectronics design and fabrication. In 1992 Carl became the Si circuit design manager and was responsible for the conversion of the S/390 high-end mainframe from bipolar technology to custom CMOS technology. In 1997 Carl started work on the Power4 microprocessor responsible for the physical design. Carl was appointed an IBM Fellow in 2000 and received and Honorary Doctor of Science Degree of the University of Wisconsin in 2003. Presently Carl is working on the next generation of IBM processors and helping near term programs meet their performance goals.
Painting Kernel Permissions with Mondriaan Memory Protection
Mondriaan memory protection (MMP) is a proposal for a hardware mechanism which finally allows practical, efficient fine-grained memory protection. The promise of fine-grained hardware protection is more reliable software, and reliable software starts with the operating system.
This talk describes the author's experience with modifying the Linux kernel (2.4.19) to use Mondriaan. The hope was that the MMP hardware would let the OS make its memory sharing explicit without having to change its current structure. A few implementation techniques sufficed for different kinds of kernel subsystems, e.g., drivers (EIDE disc and network), kernel modules that are not drivers, e.g., unix domain sockets, and non-module, non-driver subsystems, e.g., printk and its variants. These techniques, which included building a new software abstraction unintended by the hardware designers, indicate that the MMP primitives are useful for software.
Preliminary performance results on hefty workloads will be presented, and a kernel bug exposed by MMP will be discussed.
Emmett Witchel will be an assistant professor at University of Texas at Austin starting in January 2004. He is receiving his doctorate from MIT with a thesis on Mondriaan Memory Protection, an efficient, fine-grained memory protection system. While at MIT he as published work on reducing energy consumption in caches, and low-power instruction sets. In 1997 he co-founded Incert Software, which developed a multiple platform static instrumentation technology which could efficiently monitor program control flow during program execution. Incert merged with Geodesic in 2002, which was recently aquired by Veritas. Before arriving at MIT he published several papers as part of Stanford's SimOS project, including Embra, which is still the fastest reported full machine simulator. He is interested in computer architecture, and how the architecture is used by operating systems and compilers.
Network Processing - Multi-threaded, multi-processing
"Network Processing" is a catch-all term for the sorts of computing tasks performed by various pieces of networking gear. In general, these tasks have characteristics that make for some interesting computer architecture. For instance, a large transport router can take advantage of massive task-level parallelism, but it is also required to meet very rigorous reliability and real-time processing demands.
Some of the issues and opportunities presented by networking tasks will be discussed in the context of describing the architecture of the iFlow Packet Processor from Silicon Access Networks. The iFlow Packet Processor supports full-duplex data rates of 20 Gigabits/sec. The chip integrates 32 multi-threaded packet processors, packet buffer memory, 512 KBytes of on-chip local memory, 20Kbytes of instruction RAM, and a 1K x 72-b general purpose ternary CAM. It sports two full duplex 12.8-Gbps (each way) SPI 4.2 packet interfaces for connection to off-the-shelf framer and fabric devices, and a number of coprocessor and memory interfaces.
Mike O'Connor is a Senior Microprocessor Architect at Texas Instruments in Austin. Prior to joining TI, he was the Director of Architecture and Chief Processor Architect at Silicon Access Networks where he was intimately involved in the development and design of the iFlow product line. Before his foray into startup-land, he spent many years at Sun, where he was involved in the development of UltraSPARC-I and UltraSPARC-III, and was the lead architect of the picoJava cores. He has a B.S.E.E. from Rice University and a Masters in Computer Engineering from UT-Austin. He has received 27 patents to date.
IBM's POWER5 Micro Processor Design and Methodology
POWER4 introduced chip multiprocessing with two microprocessors sharing a second level cache integrated onto a single chip in 2001. POWER4 performance is enhanced by the synergy of static instruction scheduling modern compilers can offer with the dynamic instruction issuing capability inherent in the out-of-order execution of the system.
POWER5, scheduled for introduction in 2004, builds on POWER4 by adding Simultaneous Multi-Threading (SMT). To software SMT makes each POWER5 processor appear as if it is two independent processors. By dynamically applying resources to each thread, system resources can more effectively be utilized.
The design is a natural extension of POWER4's micro-architecture. POWER4, uses register renaming to allow out-of-order program execution. With SMT, this capability is leveraged to support twice the number of architected registers. Thus effectively maintaining thread independence. Additional rename registers are required to insure a ready supply of rename registers. The dynamic instruction scheduling already present in the microprocessor now manages what appears to be a larger number of architected registers increasing the instruction window from which to schedule instructions. This increased parallelism permits higher instruction throughput without increasing the number of execution engines.
In this talk Ron will describe POWER5's implementation with focus on what was done to deliver high performance. Some of the unique features in the POWER5 SMT implementation are described. The talk will discuss enhancement made to the memory structure of POWER4 to allow scalability to a 64 way symmetric multiprocessor system, with 128 threads of execution. Design challenges such as power, lab debug, and serviceability will also be discussed.
Ron Kalla (email@example.com) is the lead engineer for IBM's POWER5 specializing in processor core development. He has designed processor cores for S/370, M68000, AS/400 and RS/6000 machines. He holds numerous patents on processor architecture. Ron also has an extensive background in post silicon hardware bring up and verification. He has 12 issued US Patents with 15 additional patents pending and has published 15 technical disclosures.
University of Illinois, Urbana-Champaign
rePLay: the Phenomena behind and Limits of Dynamic Optimization in Hardware
Optimization of a processor's dynamic instruction stream has gained popularity in the last few years as a way to further increase application performance. While optimizations at the time of compilation and linking often take advantage of information about the source code or the language semantics, our work has focused on the potential for further increasing performance by leveraging predictable control flow at the time of execution. This approach in fact complements compile-time optimizations, as we demonstrated in earlier work by comparing the relative benefits for unoptimized and highly-optimized executables.
In this talk, I explore the sources and limitations of our approach to dynamic optimization, in the process exposing several interesting dataflow and program structure phenomena. I first describe a methodology for characterizing the dataflow structure of dynamic instruction streams by examining instruction-level traces in reverse order of execution. Many properties of instruction streams are more amenable to processing in this direction, and the combined two-pass (creation and post-processing) analysis renders many measurements of dataflow structure straightforward. Using this tool, I characterize the streams of the SPEC2000 integer benchmarks compiled for the Alpha ISA, highlighting the major sources of performance gain for trace-based dynamic optimization on the Alpha.
I next assume an ideally efficient optimizer implementation and explore the performance potential of trace-based dynamic optimization. Using a trace-driven simulation framework, I examine the relationship between the length of optimized traces and the resulting performance benefits. With no overhead and completely accurate control prediction, this study explores the limits of the approach and exposes the phenomena behind those limits.
Steven S. Lumetta is an Assistant Professor of Electrical and Computer Engineering and a Research Assistant Professor in the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. He received an A.B. in Physics from U. C. Berkeley in 1991, an M.S. in Computer Science from U. C. Berkeley in 1994, and a Ph.D. in Computer Science from U. C. Berkeley in 1998. Lumetta's research interests are in optical networking, computer architecture, and high-performance networking and computing. In the past, he also worked on a number of problems in scalable parallel computing, including languages, tools, algorithms, and runtime systems.
Throughput Oriented Computing
The optimization center for traditional microprocessors, often driven by desktop requirements, has proven to be a less-than-perfect fit for server-side computing. For instance, many of the ideas proposed in academia and in the industry are analyzed and justified using single-threaded specint/fp-style benchmarks, a very limited representation of server workloads. This has led manufacturers to mostly optimize "latency computing" as opposed to "bandwidth computing" or what Sun Microsystems calls "Throughput Computing".
For server workloads, one can architect a processor so that the available memory bandwidth is used wisely and fully. Given aggressive memory bandwidth technology, if applications can run at memory bandwidth speed instead of memory latency speed, on a throughput-oriented microprocessor, very high levels of performance can be achieved.
A set of novel techniques needed for optimizing throughput-computing microprocessors will be described, and will be matched to processors on our roadmap.Biography:
Marc Tremblay is a Sun Fellow, Vice President, and Chief Architect for Sun's Processor and Network Products group. In his role Tremblay sets future directions for Sun's processor roadmap. His mission is to move Sun asap to the Throughput Computing paradigm. The new generation of processors that incorporate techniques he has helped develop over the past several years, namely, Chip Multiprocessing, Chip Multithreading, speculative multithreading, and assist threading, will gradually form the cornerstone of Sun's processor portfolio.
Prior to his current position, he was co-architect for Sun's UltraSPARC I, the MDR Microprocessor of the Year in 1995, and chief architect for the UltraSPARC II microprocessor. He was also the chief architect for the MAJC Program, which was nominated for best emerging technology in 1999 and best media processor in 2000 by MDR Analysts. He also started and architected the picoJava processor core, a Java bytecode engine.
Tremblay holds a M.S. and Ph.D. in Computer Science from UCLA and a B.S. in Physics Engineering from Laval University in Canada. He holds 72 US patents in various areas of computer architecture. Tremblay was nominated for Innovator of the year by EDN Magazine in 1999. He was the Co-Chair of the Hot Chips 2000 Conference and is a member of IEEE and ACM.