Advanced Micro Devices
Performance Analysis and Workload Characterization of PC systems at AMD
Performance of a PC computer system from the end user point of view depends not only on the processor's speed but also on the performance of its memory subsystem, bus protocols, operating system, device drivers, and nature of the application running on the system. System modeling and analysis are used by micro-architects and design engineers to address these issues. In this talk, I will present our PC system performance modeling and system trace generation technology that has been developed at AMD in Austin and used by the microprocessors and chipset design teams.
Biography: Ali Poursepanj works at AMD in the CPU and System Architecture Group in the AMD Architecture Lab, where he concentrates on performance analysis of AMD based systems. Before that, worked at IBM in the Power PC processor performance team lead, where he developed tools and methodologies for performance analysis of 601, 603, 604, and 620.
Dr. Poursepanj received his PhD from UT Austin, in 1995. His dissertation was in the area of trace sampling. His areas of interest include processor and system performance analysis, workload characterization and software engineering.
Department of Electrical and Computer Engineering
Architectural Techniques for Improving Fine-grain Multiprocessor Performance
Recent processor architecture advances have greatly improved the available parallelism in each instruction thread. However, certain multiprocessor operations, such as device interactions or critical sections, suffer from high overhead and longer latencies than general code. Amdahl's law shows that, without due attention, these sections will become more and more crucial to performance, especially as our programming style shifts from single-threaded to fine-grained multithreading.
A close examination of these sections of code shows that most current architectures enforce more dependences or restrictions than necessary. My talk will discuss architectural techniques to remove these extraneous dependences, and present some simulation-based performance results from the application of these changes.
University of Wisconsin-Madison
Multicast Snooping: A New Coherence Method Using a Multicast Address Network
Large applications, such as simulators and database servers, require cost-effective computation power beyond that of a single microprocessor. Shared-memory multiprocessor servers have emerged as a popular solution, because the system appears like a multi-tasking uniprocessor to many applications. Most shared memory multiprocessors use per-processor cache hierarchies that are kept transparent with a coherence algorithm.
The two classic classes of coherence algorithms are snooping and directories. Snooping keeps caches coherent using a totally ordered network to broadcast coherence transactions directly to all processors and memory. In contrast, directory protocols transmit a coherence transaction over an arbitrary point-to-point network to a directory entry (usually at memory), which, in turn, re-directs the transaction to a superset of processors caching the block.
This talk discusses a new coherence method called MULTICAST SNOOPING that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast "mask." Transactions are delivered with an ordered multicast network, such as an Isotach network. Processors handle transactions as they would with a snooping protocol, while a simplified directory operates in parallel to check masks and gracefully handle incorrect ones (e.g., previous owner missing).
Preliminary performance numbers provide encouragement that multicast snooping can obtain data directly (like broadcast snooping) but apply to larger systems (like directories). For SPLASH-2 benchmarks running on 32 processors and we can limit multicasts to an average of 2-6 destinations (<< 32) and deliver 2-5 multicasts per network cycle (>> broadcast snooping's 1 per cycle). A paper based on this work appears in the 1999 International Symposium on Computer Architecture.
How are we going to design a 400 Million transistor chip?
Sometime in the next few years IC technology will provide the capability to pattern and yield 100M transistors/cm2 resulting in silicon systems of 400+ million transistors. The current buzz in the industry is that reuse and SoC's will solve the implementation problem, but in reality there will always be a need to create ever more complex designs. Given the current productivity metrics it is painfully obvious that a breakthrough in design productivity will be needed to implement these silicon systems. Is it possible or is there a train wreck up ahead?
Biography: Mark McDermott graduated from the University of New Mexico in 1977 with a BSEE. He joined Motorola on the Engineering Rotational Program, with assignments in process engineering, product engineering, systems engineering and IC design engineering. From 1978 to 1981 he worked as an IC Designer in the CMOS Design Group working on microprocessor and speech synthesizers. From 1981 to 1984 he worked at TEGAS Systems, Inc., on a special purpose attached processor for digital logic simulation. In August of 1984, he co-founded Accelerated Solution Corp. whose primary product was a high performance EDA hardware accelerator. After ASC closed its doors, Mark Joined Motorola's Computer-X group as a system designer working on factory automation computer design. In 1986 he then joined the 68332 MCU design team where he worked on system integration design. From 1988 to 1990 he was Project Leader for the 88110 microprocessor. During this time he received his MSEE degree from the University of Texas majoring in Computer Architecture and Engineering Design Automation. In 1991 Mark joined Cyrix Corp. where he was Director of the Austin Design Center, and managed the design of a high performance X86 microprocessor. In 1995 Motorola hired him as Director of the Somerset Design Center, where he managed the joint IMB/Motorola PowerPC design effort. In 1997 he was promoted to Director, Networking and Computing Core Technology and then to Senior Director, SoC Design Technology. In 1998, Mark joined Intel Corporation in Austin, TX to start a new design center focusing on next generation IA32 processors. Mark currently holds 17 patents in the field of microprocessor design and is a member of the IEEE, ACM, NSPE and TSPE.
Managing Thread-Shared Hardware Resources on Simultaneous Multithreaded Processors
Prof. Susan Eggers
University of Washington
Simultaneous multithreading is a high-performance processor design that executes instructions from multiple threads every cycle. By dynamically sharing processor resources among threads, SMT increases functional unit utilization, thereby boosting instruction throughput. The result is greater speedups for multiprogramming, parallel and commercial database workloads.
Over the past few years we have done SMT-related research in several different areas that include architectural studies, as well as compiler and operating systems support for SMT. In this talk I will cover two of them:
Barriers & Solutions to High-performance x86 Processing
G. Glenn Henry
President and CEO, Centaur Technologies
The x86 (Intel-compatible) architecture has many baroque, ornate and mysterious features that present significant barriers to high-performance execution. The general design of a modern x86 processor is described along with the key problematic elements of the architecture. The design "solutions" of a particular x86 processor are described. Challenges in the verification process and our solution will also be described.
G. Glenn Henry is the founder and President of Centaur Technologies, a subsidiary of IDT that designs and sells the WinChip family of Intel-compatible processors. Prior to the startup of Centaur in April, 1995, Henry was the Senior Vice President of the Product Group and CTO of Dell Computers. Prior to joining Dell in July 1988, Mr. Henry spent 21 years at IBM where he was named an IBM Fellow for his contributions to the architecture, design, and management of high-technology products such as the IBM System/32, IBM System/38, and the early IBM RISC processors.
Intelligent DRAM Memories
Prof. Jack Lipovski
University of Texas at Austin
Processor-in-memory chips get around the bottleneck between a processor and its memory. They may be able to implement vectorized graphics algorithms and full-text database retrieval much better than conventional processor-memory systems.
This talk introduces a design of a processor-in-DRAM, called the Dynamic Associative Access Memory (DAAM). In DAAM chips, a large number of small processing elements (PEs) are put in a DRAM's sense amps. The PE is a one-bit ALU that is throughput-matched to a DRAM, and that is also made into an associative processor with the addition of a gate. One or two of these chips can perform three-dimensional graphics front-end processing. Thousands of these chips will be mounted on "memory boards" in "TONY" full-text database servers.
This talk shows how this chip technology differs from conventional technology, and gives glowing estimates of its cost-performance. This processor-in-memory is more than two orders of magnitude more cost-effective than conventional processors for vector processing. This talk also shows how a TONY server system using this chip is used in internet database machines, which can generate a market for the chip. A $2,500,000 server farm can hold a 200 gigabyte database, at a cost of 5 cents per printed page, and execute a 20 term Boolean or inner product query in about 50 microseconds, to support over a million on-line users. This processor-in-memory is more than two orders of magnitude more cost-effective than conventional processors for full-text retrieval.
Finally, this talk will explore speculative design ideas using very cheap parallel SIMD processors and associative memories for pattern matching and brute force artificial intelligence.
Biography: G. Jack Lipovski is a full professor in electrical engineering and in computer science at The University of Texas. He is a computer architect internationally recognized for his design of the pioneering data-base computer, CASSM, and the parallel computer, TRAC. He received his Ph.D. degree from the University of Illinois, 1969, and has taught at the University of Florida, and at the Naval Postgraduate School, where he held the Grace Hopper chair in Computer Science. He has consulted for Harris Semiconductor, designing a microcomputer, and for the Microelectronics and Computer Corporation, studying parallel computers. He founded the company Linden Technology Ltd., and is the chairman of its board. His current interests include parallel computing, data-base computer architectures, artificial intelligence computer architectures, and microcomputers.
A Language for Describing Predictors and Its Use to Automatically Synthesize Them
Guessing with Darwin's Help
Dr. Joel Emer
Compaq Computer Corporation
As processor architectures have increased their reliance on speculative execution to improve performance, the importance of accurate prediction of what to execute speculatively has increased. Furthermore, the types of values predicted have expanded from the ubiquitous branch and call/return targets to the prediction of indirect jump targets, cache ways and data values. In general, the prediction process is one of identifying the current state of the system, and making a prediction for some as yet uncomputed value based on that state. Prediction accuracy is improved by learning what is a good prediction for that state using a feedback process at the time the predicted value is actually computed. While there have been a number of efforts to formally characterize this process, we have taken the approach of providing a simple algebraic-style notation that allows one to express this state identification and feedback process. This notation allows one to describe a wide variety of predictors in a uniform way. It also facilitates the use of an efficient search technique called genetic programming, which is loosely modeled on the natural evolutionary process, to explore the design space. In this talk we describe our notation and the results of the application of genetic programming to the design of branch and indirect jump predictors.
Biography: Dr. Joel S. Emer is a Senior Consulting Engineer in Compaq's Alpha Development Group. He holds a Ph.D. in Electrical Engineering from the University of Illinois, and M.S.E.E. and B.S.E.E. degrees from Purdue University. He is a 20 year Digital/Compaq employee, where he has worked on processor architecture and performance modeling methodologies for a number of VAX and Alpha CPUs, as well as researched heterogeneous distributed systems and networked file systems. His current research interests include multithreaded processor organizations, techniques for increased instruction level parallelism, instruction and data cache organizations, branch prediction schemes and data prefetch strategies.
SimOS: A Full System Simulator for Computer Architects
Dr. Tom Keller
IBM Austin Research Lab
Full system simulators, such as Stanford's SimOS, allow computer hardware and software architects to faithfully model the behavior of today's systems by fully emulating the processor, memory hierarchy and i/o devices. SimOS supports user-developed packages for data and instruction cache simulation, the execution profiling of all code, as well as providing a practical performance and functional debugging environment for operating systems.
Keller's group at IBM's Austin Research Lab has adopted Stanford's SimOS to model existing and proposed PowerPC-based machines in sufficient detail that IBM's AIX operating system is booted on SimOS and, 29 billion emulated instructions later, emits the login prompt. The emulated system behaves exactly like a real system, running 500 times or so slower. It allows the capturing of complete traces of both kernel and user activity, as well as cache, disk and memory activity, while allowing complete user and system code profiling. SimOS will be used in IBM as a debugging environment for new operating systems, for generating traditional traces, as well as a framework to plug in detailed models of proposed processors and memory systems.
By directly executing the kernel, significant workloads, such as the TPC-C database workload, can be captured and analyzed. The workloads of high-end server systems, such as Unix Symmetric Multiprocessing Systems, have behaviors significantly different than that of workstation-based benchmarks like SPEC, and SimOS allows knowledge of these differences to be used by designers.
In contrast to usual industry practice, the SimOS-PPC source is available to any researcher through Stanford University.
Biography: Dr. Tom W. Keller is a Senior Technical Staff Member in IBM Research Division's Austin Research Lab. He holds a Ph.D. in Computer Sciences from the University of Texas at Austin. He has been with IBM since 1989, where he has worked in Unix performance tools development and processor and system architecture development. Before joining IBM, he worked in MCC's "shared nothing" database machine project, where his performance team developed what is now known as the TPC-C database benchmark. Dr. Keller has also served as Associate Director of the U.T. Austin Computation Center and as a staff member at Los Alamos National Laboratory, where he helped inaugurate the service of the first Cray computers.
Replenishing the Microarchitecture Treasure Chest
Prof. John P. Shen
The 1960's were a golden decade of computer architecture that saw the emergence of ideas such as: pipelining, cache memory, multiple-instruction issue, out-of-order execution, and virtual memory. The past three decades have leveraged these ideas and the unrelenting CMOS VLSI technology to achieve phenomenal performance increases for single-chip microprocessors. To sustain the same rate of performance improvement, radically new ideas for microarchitecture are needed. It is time to replenish the treasure chest.
This talk will present two recent new ideas, namely value prediction and trace caching, that are currently being researched by the Carnegie Mellon Microarchitecture Research Team (CMuART). Critical issues that must be addressed by future research and possible new microarchitecture paradigms will also be highlighted.
Biography: John P. Shen got his usual degrees from the University of Michigan and the University of Southern California, and spent a number of years working in the aerospace industry at Hughes and TRW. He joined the ECE Department of CMU in 1981, and published bunch of papers in VLSI testing, fault-tolerant computing, and processor design. He currently heads up the Carnegie Mellon Microarchitecture Research Team (CMuART).