Computer Architecture Seminar Abstracts

Spring 2006

Glenn Henry
CenTaur Technology

How to make a highly secure x86 processor


The talk will cover two topics. First is the design & build methodology, tools, etc. that allow Centaur to design a very small 2-GHZ Pentium-4 compatible processor with only 30 designers. Second is a description of Centaur's embedded high performance security features (such as AES encryption). The physical design of these security components will be used as examples to explore the overall design & build methodology.


Glenn Henry is the founder and president of Centaur Technology. Throughout his career, he has played an integral role in the development of the computer industry in the U. S..

Prior to founding Centaur in April 1995, Henry served as a consultant to MIPS Technology (SGI) for one year. From 1988 to 1994 he was Chief Technology Officer and Senior Vice President of the Product Group at Dell Computer Corporation. In that position, he was responsible for all product development activities and, at various times, also responsible for product marketing, manufacturing, procurement, information systems and technical support.

Before his tenure at Dell, Henry served 21 years with IBM. He was the instigator, lead architect and development manager responsible for the IBM System/32, System38 (forerunner of AS/400), and RT/PC (forerunner of Power systems). In 1985, he was appointed an IBM Fellow.

Jean-Yves Bouguet

PIRO: Benchmarking a Personal Image Retrieval System


It is now common to have accumulated tens of thousands of personal pictures. Efficient access to that many pictures can only be done with a robust image retrieval system. This application is of high interest to processor architects. It is highly compute intensive, and could motivate end users to upgrade their personal computers to the next generations of processors.

A key question is how to assess the robustness of a personal image retrieval system. Personal image databases are very different from digital libraries that have been used by many Content Based Image Retrieval Systems. Personal image databases are very different from digital libraries that have been used by many Content Based Image Retrieval Systems. For example a personal image database has a lot of pictures of people, but a small set of different people typically family, relatives, and friends. Pictures are taken in a limited set of places like home, work, school, and vacation destination. The most frequent queries are searched for people, and for places. These attributes, and many others affect how a personal image retrieval system should be benchmarked, and benchmarks need to be different from existing ones based on art images, or medical images for examples.

The attributes of the data set do not change the list of components needed for the benchmarking of such systems as specified in: data sets, query tasks, ground truth, and evaluation measures.

This talk proposes a way to build these components to be representative of personal image databases, and of the corresponding usage models.


Jean-Yves Bouguet is a Senior Researcher at Intel's Microprocessor Research Labs since 1999. He received his diplome d'ingenieur from the Ecole Superieure d'Ingenieurs en Electrotechnique et Electronique (ESIEE, Paris) in 1994 and the M.S. and Ph.D. degrees in Electrical Engineering from the California Institute of Technology (Caltech) in 1994 and 1999, respectively. Computer Vision, Computer Graphics are his main research fields of interest.

During his thesis work, he has developed and patented a simple and inexpensive method for scanning objects using shadows. Subsequently, he developed modeling techniques that combine 3D geometry capture and scene reflectance acquisition for realistic rendering of real and synthetic scenes with complex shape and surface characteristics for which he also holds a patent. Jean-Yves has received a number of distinctive awards including the J. Walker von Brimer award for "extraordinary accomplishments in the field of 3D photography" in 1999. Recently, his research focus has moved to applying computational vision techniques to image and video mining applications with a special emphasis on search and retrieval in personal image and video collections.

Matthew Arnold
IBM T.J. Watson Research Center

The Future of Virtual Machine Performance


Users of virtual machines care most about two aspects of performance: startup and throughput. In this talk, I will give a brief overview of the techniques commercial VMs use to improve these aspects of performance, and discuss the challenges that still remain. I will then present two new, nontraditional approaches for making progress in these areas.

1) Improving startup performance using a cross-run profile repository (OOPSLA'05). Despite the important role that profiling plays in achieving high performance, current virtual machines discard a program's profile data at the end of execution. Our work presents a fully automated architecture for exploiting cross-run profile data in virtual machines. This work addresses a number of challenges that previously limited the practicality of such an approach.

2) Throughput performance: "Online Performance Auditing" (PLDI'06). This work describes an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work encourages a shift in the way optimizations are developed and tuned for online systems, and may allow much of the work in offline empirical optimization search to be applied automatically at runtime.

All of this work is implemented and evaluated using IBM's product J9 Java Virtual Machine.


Matthew Arnold received his Ph.D. from Rutgers University in 2002, and is now a Research Staff Member at the IBM T.J. Watson Research Center in Hawthorne, NY. For his thesis work he developed low-overhead profiling techniques and showed how they can be used to drive feedback-directed optimization in a virtual machine; this work is currently used in IBM's product JVM. He has worked with the Jikes Research Virtual Machine and IBM's production JVM, and continues to use both for his research. His current interests include virtual machine performance, low overhead profiling, and dynamic analysis of software.

Rodric Rabbah

Toward Introspective and Adaptive System Architectures


The performance gap between processor and memory has widened continuously over the last decade. As emerging multicore architectures are packing even more computational power onto a single chip, the memory bottleneck is becoming a central obstacle to achieving scalability. Such architectures generally magnify long memory access latencies, and require locality aware and latency hiding techniques to prevent the memory system from becoming a severe performance bottleneck.

This talk will describe a simple and effective methodology for mitigating the memory bottleneck. The strategy leverages speculative and predicated execution, and is readily applicable to commercial processors available today. In this work, the compiler uses cache-miss profiling to focus on a relatively small set of delinquent program references that suffer expensive cache misses. The compiler then automatically embeds new instructions into the host program to orchestrate runtime data management. The new instructions execute as part of the same instruction stream as their host, but effectively run ahead to carry out various optimizations that improve the overall performance. This talk will focus on data prefetching as one such optimization. In an implementation for the Itanium Processor Family, the optimization led to 30% faster execution, with an average 45% reduction in memory stalls. A significant aspect of this work is its ability to dynamically adapt to runtime information and dynamic behavior. For example, the compiler-embedded instructions self-nullify when they are likely to increase the burden on the memory system. The ability to dynamically change execution behavior marks a significant step toward autonomous, introspective, and adaptive applications.

This talk will also describe a transparent, lightweight, and online profiling scheme that identifies long latency memory references. The technique is part of a general-purpose dynamic instrumentation and code manipulation framework. The combination affords the possibility of performing memory-centric optimizations dynamically, as the application executes. The technique does not require modifications to the program source code, and works on general-purpose programs, legacy and third party binaries. The transparency is especially important since these applications must run efficiently on emerging architectures for which they were not originally designed.


Rodric Rabbah is involved in several projects as a Research Scientist at MIT. He is a leading contributor to StreamIt, a domain specific language and compiler for stream programming. He also leads the development of Reptile, an explicitly parallel compiler for tiled architectures. Currently, he is developing metrics to systematically categorize applications based on their runtime characteristics. This work culminates in VersaBench, a new benchmark suite intended to aid architects in the design of future microprocessors. Since 1999, he has led the development of the Trimaran VLIW processor simulator. Trimaran is an open-source compilation and simulation infrastructure for EPIC and VLIW research, and is used at more than thirty universities worldwide.

Mattan Erez
Stanford Univeristy

Merrimac -- High-Performance, Highly-Efficient Scientific Computing with Streams


Advances in VLSI technology have made the raw ingredients for computation plentiful. Large amounts of fast functional units, memory, and bandwidth can be made efficient in terms of chip area, cost, and energy, however, high-performance computers realize only a small fraction of VLSI's potential. In this talk I will describe the Merrimac streaming supercomputer, which is being developed with an integrated view of the applications, software system, compiler, and architecture. I will show how this approach leads to an order of magnitude gain in performance per unit cost, unit power, and unit floor-space for scientific applications compared to common scientific computers designed around clusters of conventional CPUs. The talk will cover Merrimac's stream architecture, mapping scientific applications to effectively run on the stream architecture, and system issues in the Merrimac supercomputer.

The stream architecture is designed to take advantage of the properties of modern semiconductor technology --- very high bandwidth over short distances and very high transistor counts, but limited global on-chip and off-chip bandwidths --- and match them with the characteristics of scientific codes --- large amounts of parallelism and access locality. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative computations by an order of magnitude or more. Hence a processing node with a fixed memory bandwidth (which is expensive) can support an order of magnitude more arithmetic units (which are inexpensive). Because each node has much greater performance (128 double-precision GFLOPs in our current design) than a conventional microprocessor, a streaming supercomputer can achieve a given level of performance with fewer nodes, reducing costs, simplifying system management, and increasing reliability.


Mattan Erez received a B.Sc. in Electrical Engineering and a B.A. in Physics from the Technion, Israel Institute of Technology in 1999. He subsequently received his M.S in Electrical Engineering from Stanford University in 2002. His previous work experience includes army service at a technical research branch of the Israeli Defense Force, and working as a computer architect in the Israeli Processor Architecture Research team, Intel Corporation. As a Ph.D. candidate at Stanford University he participated in the Smart Memories project and is currently leading the Merrimac Stanford Streaming Supercomputer project, where his main areas of interest are architecture and its interaction with the compilation system and the programmer.

Edward Suh
Pufco, Inc.

AEGIS: Architectural EnGine for Information Security


The Internet is expanding into the physical world, connecting billions of devices. In this expanded network, two contradictory trends are appearing. On the one hand, the cost of security breaches is increasing as we place more responsibilities on the devices that surround us. On the other hand, computing elements are becoming small, disseminated, unsupervised, and physically exposed. Unfortunately, existing computing systems do not address physical threats, presenting a significant vulnerability in future embedded systems.

We have built a tamper-resistant platform using a single-chip secure processor called AEGIS. Our platform protects applications from physical attacks as well as software attacks. This enables several applications such as secure sensor networks, certified execution, and copy protection of media and software. This talk will describe the architecture of the AEGIS secure processor and its key primitives, namely, physical random functions, memory encryption and integrity verification.

Physical Unclonable Functions (or PUFs) are a tamper resistant way of establishing shared secrets with a physical device. They rely on the inevitable manufacturing variations between devices to produce an identity for a device. This identity is arguably unclonable.

Memory encryption and integrity verification protect content stored in external memory, and are essential to build a secure computing system that is powerful enough to run applications requiring large memory. The talk will discuss memory encryption and integrity verification schemes that are secure, yet efficient and practical.

We have fabricated and tested Physical Unclonable Function chips in TSMC 0.18u technology, and implemented the AEGIS processor on an FPGA.


Edward Suh has recently received a Ph.D. degree in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT). Currently, he is leading an effort to develop secure embedded processors at Pufco Inc. He has worked in the areas of high performance memory systems, embedded processors, and secure hardware architecture, and has co-authored over a dozen papers in these areas. His current research focuses on secure computing systems, in particular, secure processors and their applications.


Is Hardware Innovation Over?


Does the spread of multicore architectures mean the demise of Application Specific Integrated Circuits (ASIC)? Power constrained, handheld devices may be one of the most important economic drivers for the semiconductor industry in the coming decades. Will the future cell phone functionality be delivered primarily through multi-core processors? Or will it be through reconfigurable FPGAs or a system composed of heterogeneous blocks? We will describe how it is possible to synthesize, quickly and efficiently, large and complex SoC's from a library of microarchitectural IP blocks, including embedded PowerPC models, DSPs and a variety of specialized hardware blocks (radios, MPEG4 decoders, ...). Our project, will provide, among other things, PowerPC "gateware" for others to use, and will shed light on how IP blocks should be written to be easily modifiable and reusable.


Arvind is the Johnson Professor of Computer Science and Engineering at MIT where he has been since 1979. In 1992, his group, in collaboration with Motorola, built the Monsoon dataflow machines and its associated software. A dozen of these machines were built and installed at Los Alamos National Labs and other universities, before Monsoon was retired to the Computer Museum in California.

In 2000, Arvind took a two-year leave of absence to start Sandburst, a fabless semiconductor company to produce a chip set for 10G-bit Ethernet routers. In 2003, Arvind co-founded Bluespec Inc., an EDA company to produce a set of tools for high-level synthesis. He currently serves on the board of both Sandburst and Bluespec.

In 2001, Dr. R. S. Nikhil and Arvind published the book "Implicit parallel programming in pH". Arvind's current research interests are synthesis and verification of large digital systems described using Guarded Atomic Actions; and Memory Models and Cache Coherence Protocols for parallel architectures and languages.

Dileep Bhandarkar

Multi-Core Microprocessor Chips: Motivation & Challenges


Advances in semiconductor process technology allow hundreds of millions of transistors to be integrated on a single chip. Intel's 90 nm technology Montecito chip was the first Billion transistor chip featuring dual cores and large cache in 2005. Nanotechnology that continues to drive Moore's Law provides a doubling of the transistor density every two years. Multi-core chips will become common not only in high end servers but also in desktop and mobile PCs.

Multi-core processors present several challenges related to on-chip system architecture, power management, reliability, and software scaling. This talk will touch on some of these challenges and discuss some possible solutions.


Dr. Bhandarkar is an IEEE Fellow, and a Distinguished Alumnus of the Indian Institute of Technology, Bombay, where he received his B. Tech in Electrical Engineering. in 1970. He also has a M.S. and Ph.D. in Electrical Engineering from Carnegie Mellon University, and has done graduate work in Business Administration at the University of Dallas. He is currently Director of the Enterprise Architecture Lab in processors and chipsets. He has been with Intel since 1995 and has managed system architecture and performance analysis activities.

Prior to joining Intel, he spent almost 18 years at Digital Equipment Corporation, where he managed processor and system architecture, and performance analysis work related to the VAX, Prism, MIPS, and Alpha architectures. He also worked at Texas Instruments for 4 years in their research labs in a variety of areas including magnetic bubble memories, charge coupled devices, fault tolerant memories, and computer architecture. Dr. Bhandarkar holds 15 U.S. Patents and has published more than 30 technical papers in various journals and conference proceedings. He is also the author of a book titled Alpha Architecture and Implementations.