Computer Architecture Seminar Abstracts

Spring 2003

Paolo Faraboschi
HP Labs

Embedded processors and customization: lessons from the design of the Lx family


The custom-fit processor project started at HP Labs in 1994 with the goal of producing customized VLIW engines for high-performance embedded applications. The project is now a reality, has been productized in a family of cores (Lx/ST200) by STMicroelectronics, and is used in a variety of digital video products. This talks describes a history of the project from its conception to products, and presents some of the lessons that we learned along the way. In particular, it discusses the aspects of customization, the importance of the compiler strategy, the role of physical processor design, and the impact of legacy code and tools.


Paolo Faraboschi is a principal research scientist at HP Labs, where he has been leading the "Custom-Fit Processor" project, with a goal to produce customized high-performance VLIW processors, compilers and tools for embedded systems. Together with STMicroelectronics, the project co-developed the Lx/ST200 family of VLIW embedded cores, now deployed in a variety of digital video consumer products. Paolo received his PhD degree in Electrical Engineering from the University of Genoa (Italy) in 1993 and joined HP Labs in 1994. Paolo is actively involved in the Computer Architecture community (MICRO-34 program co-chair), and teaches a class on VLIW for the ALARI master at USI (Lugano, Switzerland). His main interests include computer architecture, VLIW processors, compilers and tools for high-performance instruction-level parallelism in embedded systems, and - more recently - the computing aspects of content-delivery and digital publishing systems.

Michael Flynn
Stanford Univesity

Computer Architecture & Technology: The Road Ahead


The road ahead in computer systems is determined by the continuing progress in the scaling of the Silicon technology. But rather than a simple extension of current products, several shifts in design emphasis are occurring. In the first shift greater importance is placed on lowering power rather than increasing performance. Power can be reduced by upwards to a factor of one million times less than current power levels. In a second shift the increased circuit density enables entire systems (computer plus memory and communications support) to be on a chip. These shifts may enable wearable (watch type) and similar computing devices. As chip costs fall interconnections dominate cost and wireless interconnections become all the more important, also enabling the movement to small or wearable devices. Still to use these technology advances there's a good deal of computer engineering research and development to be done.


Mike Flynn is Emeritus Professor of Electrical Engineering at Stanford University, where he has taught for almost 30 years. Before that, he taught at Johns Hopkins and Northwestern. Before that, he was a project manager at IBM on the 360/91, one of the most interesting of IBM machines in the history of that company. He has had a long and illustrious career in teaching, research, and consulting. He founded American Supercomputing in the 1980s. He has received a number of awards, most notably the Eckert- Mauchly Award, the highest honor in the field of Computer Architecture, in 1992.

Burton Smith

The Evolution of Multithreaded Architecture


Multithreading as an idea for processor architecture has evolved significantly since the days of the CDC 6600. The motivations for its use have varied widely over the years, and many novel ideas have been tried. This talk will focus on the genealogy of the "conventional" branch of the multithreaded processor family tree, one which generally sprang from the need to improve von Neumann processor architecture rather than from the evolution of dataflow ideas.


Burton J. Smith is one of our co-founders and has been our Chief Scientist and a Director since early 1988. He served as our Chairman from 1988 to June 1999. He is a recognized authority on high performance computer architecture and programming languages for parallel computers. He is the principal architect of the MTA system and heads our Cascade project. Mr. Smith was a Fellow of the Supercomputing Research Center (now Center for Computing Sciences), a division of the Institute for Defense Analyses, from 1985 to 1988. He was honored in 1990 with the Eckert-Mauchly Award given jointly by the Institute for Electrical and Electronic Engineers and the Association for Computing Machinery, and was elected a Fellow of both organizations in 1994. In February 2002 he was elected as a member of the National Academy of Engineering. Mr. Smith received S.M., E.E. and Sc.D. degrees from the Massachusetts Institute of Technology.

Per Stenström
Chalmers University of Technology

Module-Level Speculative Execution Techniques on Chip Multiprocessors


Chip multiprocessors (CMPs) are an interesting architectural style to address the diminishing returns of pushing the superscalar paradigm further. While CMPs can potentially exploit thread-level parallelism that is inherent in many technical/numerical, and commercial workloads and across independent jobs, they may fall short when boosting performance of single-threaded applications.

We have studied the opportunities and challenges of dynamically extracting threads by simply spawning a speculative thread to execute the code after a procedure, function, or method invocation. This execution model is called module-level speculative execution. This talk will report on the main experiences gathered using such a simple execution model.

We have focused on the inherent limitations of this execution model both concerning programming style (imperative versus object-orientation) and architectural limitations; especially how effectively speculative threads are managed and how data dependences can be resolved by different value prediction techniques. We have found that it is possible to gain a speedup of a factor of two on CMPs with between four and eight cores. Then, however, it is critical to keep the thread management at a reasonable level and we show techniques for how to accomplish that.


Per Stenstrom is Professor of Computer Architecture at Chalmers University of Technology, Sweden, and acts as vice-dean at the School of Computer Science and Engineering where he is responsible for the graduate school education.

His main research interests are on high-performance computer architecture. He has worked on topics including compiler optimizations and VLSI design principles, but his main contributions are to scalable cache coherence solutions and latency tolerating techniques. He has held faculty and visiting positions at Lund University, Carnegie Mellon University, Stanford University, University of Southern California, and recently at Sun Microsystems. He has provided service as program chair and member of more than thirty program committees of IEEE/ACM conferences and is an associate editor of Trans. on Computer, IEEE Computer Architecture Letter, and Journal of Parallel and Distributed Computing. He acted as the general chair of the 28th IEEE/ACM International Symposium on Computer Architecture held in Goteborg 2001. Dr. Stenstrom is Senior Member of the IEEE and Member of ACM/SIGARCH.

Guri Sohi
University of Wisconsin

Speculative Multithreading: from Multiscalar to MSSP


Single-chip processors currently have microarchitectures capable of supporting multiple threads of execution (either via multithreading or via chip multiprocessing), a capability whose use is likely to continue to increase. Speculative multithreading refers to a broad class of recently proposed techniques to speculatively ``parallelize'' the execution of a sequential program. My research group at Wisconsin has been working on speculative multithreading techniques for over a decade. This talk will overview some of what we have learnt over the years. We will start with our early work on multiscalar, continue with data-driven multithreading and speculative slices (a.k.a prefetch threads or helper threads or scout threads), and then on to our most recent work on master-slave speculative parallelization (MSSP).


Guri Sohi received a Ph.D in Electrical and Computer Engineering from the University of Illinois, and has been a faculty member at the University of Wisconsin-Madison since 1985. He is currently a Professor in the Computer Sciences department.

Sohi's research has been in the design of high-performance computer systems. He has co-authored several papers and patents that have influenced both researchers and commercial microprocessors. In the mid 1980s, while most computer architects were investigating in-order processors, he investigated out-of-order processors. His paper "Instruction Issue Logic for High-Performance, Interruptible Pipelined Processors" (in ISCA 1987) articulated a model for a dynamically- scheduled processor supporting precise exceptions, a model that was widely adopted by several microprocessor manufacturers. (This paper, and the journal version in IEEE Trans. on Computers, March 1990, have been referenced by over 120 U.S. patents.)

He received the 1999 ACM SIGARCH Maurice Wilkes award "for seminal contributions in the areas of high issue rate processors and instruction level parallelism". At the University of Wisconsin he was selected as a Vilas Associate in 1997 and won the WARF Kellett Mid-Career Faculty Researcher award in 2000.

Charles Webb

Reliability, Availability and Servicability Architecture of IBM S/390 Systems


IBM's zSeries processors continue a long line of "mainframe" products noted for their reliability, availability, and serviceability (RAS) as well as enterprise-scale commercial performance. This seminar will describe the evolution of the S/390 and zSeries processors over the last few years. It will then focus on the industry-leading fault detection and recovery capabilities of these processors. These capabilities include Dynamic CPU Sparing, which allows a spare CPU in the configuration to take over the work of a failing CPU in a manner completely transparent to all software, including the operating system.


Mr. Webb is an IBM Distinguished Engineer in Server Group development. He received his B.S. degree in 1982 and his M.Eng. in 1983, both from Rensselaer Polytechnic Institute. He joined IBM in 1983 at the Product Development Laboratory in Poughkeepsie, where he has remained since. Mr. Webb has worked on the ES/9000 processor, the S/390 G4 and G5 CMOS processors, and the eServer z900 processor in the areas of performance analysis, architecture, and design. He is currently responsible for zSeries processor design and is co-chief engineer for a future family of IBM eServer products.

Brian O'Krafka
Sun Microsystems

The Sun Initiative on Computer System Performance Analysis and the sysmodel Analytic/Simulation Toolset


This talk will describe an ongoing project in Sun Labs to improve the quality, consistency and timeliness of commercial system performance projections.

The project has two components. The first is a methodology and toolset for the thorough characterization of commercial workloads. Data from this component drives the second, which is a modeling toolset for building queueing models of multiprocessor systems. This toolset complements the cycle-accurate timers that are routinely developed for microprocessors.

Workload characterization is the process of estimating software pathlengths, cache miss rates, and processor abstraction data for broad ranges of multiprocessor system configurations. The largest piece of this is the projection of miss rates using cache simulations driven by system bus traces. We will describe the miss rate estimation process, with a focus on some novel techniques for creating a miss rate "surface" from a sparse set of cache simulation points.

The results of workload characterization drive the "sysmodel" toolset, a queueing network solver/simulator environment that was developed to simplify analytic modeling and make it accessible to a broad set of users. With sysmodel, a multiprocessor is described in a timing-centric set of macros within the 'C' language. These are automatically converted into a simulation model or a set of equations that can be solved analytically using mean value analysis. We will review the basic mean value analysis algorithm and some of the approximations we use to model non-product-form behavior.

This methodology has been validated against Starfire, Sunfire and Serengeti servers. In each of these cases the modeling results were within 10% of machine measurements.


Brian O'Krafka is a a member of the Sun Labs Architecture and Performance Group. Brian received a Ph.D. in electrical engineering and computer science from the University of California at Berkeley in 1992, after which he joined the IBM Austin Laboratory. From 1992 to 1997, he worked on multiprocessor verification for RS/6000 servers. In 1997 he joined the RS/6000 performance group, where he worked on multiprocessor performance modeling and analysis. Since the fall of 2000, Brian has been with Sun Labs working on the performance analysis of Sun systems.