Microarchitectural Support for Automatic Parallelization
This talk will describe my group's ongoing research on effectively and practically parallelizing general purpose programs to small scale parallel systems (on the order of about 8 single-threaded superscalar cores). Our approach to this problem has been "careful" speculative parallelization. Data dependences are learned dynamically, to avoid relying on "brittle" compiler analyses and transformations, but enforced conservatively to avoid the low probabilities inherent in techniques like value speculation. I will describe the compiler techniques we use to find thread boundaries that allow complete control independence of threads, the dynamic slicing technique we use to implement an efficient dynamic dataflow engine, and the dependence prediciton mechanism we use to perform accurate pointer analysis. Together these mechanisms allow us to effectively parallelize general purpose programs without dramatically increasing the number of instructions speculatively fetched or executed.
Matthew Frank is an Assistant Professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. His degrees include a B.S. in Computer Science, 1994, from the University of Wisconsin-Madison, M.S. in Computer Science, 1997, Massachusetts Institute of Technology and Ph.D. in Computer Science, 2003, Massachusetts Institute of Technology. His research interests include Computer System Architecture and Compilers. He and his students are currently designing PolyFlow, an implicitly parallel architecture that automatically parallelizes programs as they run.
DNA Self-assembly and Computer System Fabrication
The migration of circuit fabrication technology from the microscale to the nanoscale has generated a great deal of interest in how the fundamental physical limitations of materials will change the way we engineer computer systems. The changing relationships between performance, defects, and cost have motivated research into so-called disruptive or exotic technologies. This talk will present the theory, design, and methods of fabrication for DNA self-assembled nanostructures within the context of circuit fabrication. The advantages of this technology go beyond the simple scaling of device feature sizes (sub-20nm) to enable new modes of computation that are impractical under the constraints of conventional fabrication methods. A brief survey of several computer architectures that can take advantage of this new technology will also be presented.
Dwyer received his B.S. in computer engineering from the Pennsylvania State University in 1998, and his M.S. and Ph.D. in computer science from the University of North Carolina at Chapel Hill in 2000 and 2003, respectively. He worked in the Department of Physics & Astronomy at UNC as a Postdoctoral Fellow and the Department of Computer Science at Duke as a Visiting Assistant Professor from 2003-2004. Dwyer joined the Department of Electrical & Computer Engineering at Duke University as an Assistant Professor in 2004.
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors
Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance growth by integrating more cores on the die with each new technology generation. In the short term, on-chip integration of a few relatively large cores may yield sufficient throughput when running multiprogrammed workloads. However, harnessing the full potential of CMPs in the long term makes a broad adoption of parallel programming inevitable.
We envision a CMP-dominated future where a diverse landscape of software in different stages of parallelization exists at all times. Unfortunately, in this future, the inherent rigidity in current proposals for CMP designs makes it hard to come up with a "universal" CMP that can accommodate this software diversity.
In this talk I will discuss Core Fusion, a CMP architecture where cores can "fuse" into larger cores on demand to execute sequential code very fast, while still retaining the ability to operate independently to run highly parallel code efficiently. Core Fusion builds upon a substrate of fundamentally independent cores and conventional memory coherence/ consistency support, and enables the CMP to dynamically morph into different configurations to adapt to the changing needs of software at run-time. Core Fusion does not require specialized software support, it leverages mature micro-architecture technology, and it can interface with the application through small extensions encapsulated in ordinary parallelization libraries, macros, or directives.
José Martínez (Ph.D.'02 Computer Science, UIUC) is assistant professor of electrical and computer engineering and graduate field member of computer science at Cornell University. He leads the M3 Architecture Research Group at Cornell, whose interests include multicore architectures, reconfigurable and self-optimizing hardware, and hardware-software interaction. Martínez's work has been selected to IEEE Micro Top Picks twice (2003 and 2007). In 2005, he and his students received the Best Paper Award at HPCA-11 for their work on checkpointed early load retirement. Martínez is also the recipient of a NSF CAREER Award and, more recently, an IBM Faculty Award. His teaching responsibilities at Cornell include computer architecture at both undergraduate and graduate levels. He also organizes the AMD Computer Engineering Lecture Series.
IBM T.J. Watson Research Center
IBM eDRAM: What's all the Fuss?
As the "nano" era of microelectronics approaches, technologists have begun to predict the "end of scaling" for six-transistor SRAM. Though easily 50% of modern microprocessor silicon area is occupied by caches, a particularly long-standing debate has surrounded one dense storage alternative: embedded DRAM. This talk will shed light on the technology causes of the infamous memory wall, provide a tutorial on the technology behind eDRAM, and look at the architectural fundamentals of latency, density, and availability pertaining to use of SRAM replacements in future systems.
Hillery Hunter is a Research Staff Member in the Exploratory Systems Architecture Department of IBM's T.J. Watson Research Center in Yorktown Heights, NY. Her current research focuses on next-generation cache design, leveraging advances in storage cells, arrays, and microarchitecture. She is interested in cross-disciplinary research, spanning circuits, microarchitecture, and compilers to achieve new solutions to traditional problems. She received the Ph.D. degree in Electrical Engineering from the University of Illinois, Urbana-Champaign in 2004.
Comprehensive Detection of Errors in Multithreaded Memory Systems
Multithreaded architectures, including multicore processors and multithreaded uniprocessors, are becoming ubiquitous. Our goal is to detect all possible errors in the memory systems of these machines, without resorting to large amounts of expensive and power-hungry redundancy. Because correct operation of the memory system is defined by the memory consistency model, we can detect errors by checking if the observed memory system behavior deviates from the specified consistency model. We have designed a framework for dynamic verification of memory consistency (DVMC), and this framework applies to all existing commercial consistency models. Our DVMC framework consists of mechanisms to dynamically verify three invariants that we have proven to be equivalent to memory consistency. We have developed an implementation of the framework for the SPARCv9 architecture, and we have experimentally evaluated its performance using full-system simulation of commercial workloads.
Daniel J. Sorin is an assistant professor of Electrical and Computer Engineering and of Computer Science at Duke University. His research interests include dependable computer architecture and system design. He received a PhD and MS in electrical and computer engineering from the University of Wisconsin, and he received a BSE in electrical engineering from Duke University. He is the recipient of an NSF Career Award and a Warren Faculty Scholarship at Duke.
IBM T.J. Watson Research Center
An Analysis of the Effects of Miss Clustering on the Cost of a Cache Miss
A new technique, called Pipeline Spectroscopy, is described that allows pipeline delays to be monitored and analyzed in detail. We use this technique to measure the cost of each cache miss. The cost of a miss is displayed (graphed) as a histogram, which represents a precise readout showing a detailed visualization of the cost of each cache miss throughout all levels of the memory hierarchy. We call the graphs 'spectrograms' because they reveal certain signature characteristics of the processor's memory hierarchy, the pipeline, and the miss pattern itself. Cache miss spectrograms are produced by analyzing misses according to the miss cluster size, and comparing instruction sequences and execution times that occurred near the miss cluster in a 'finite cache' simulation run to the same set of instructions and execution times in an 'infinite cache' run, then calculating the difference in run times. We show that in a memory hierarchy with N cache levels (L1, L2, ..., LN, and memory) and a miss cluster of size C, there are (C+N) choose C possible clusters of miss penalties. This represent all possible sums from all possible combinations of the miss latencies from each level of the memory hierarchy (L2, L3, ... Memory) for a given cluster size. Additionally, a theory is presented that describes the shape of a spectrogram, and we use this theory to predict the shape of spectrograms for larger miss clusters. Detailed analysis of a spectrograph leads to much greater insight in pipeline dynamics, including effects due to prefetching, and miss queueing delays.
Thomas R. Puzak received a B. S. in Mathematics and M. S. in Computer Science from the University of Pittsburgh and a Ph. D. in Electrical and Computer Engineering from the University of Massachusetts. Since joining IBM he has spent over thirty years working in IBM Research. While at IBM he received Technical Achievement, Outstanding Contribution, and Innovation Awards, served as Chairman of the Computer Architecture Special Interest Group at the T. J. Watson Research Center and holds more than 30 patents, on processor and memory design.
Carnegie Mellon University
Temporal Memory Streaming
While semiconductor scaling has steadily improved processor performance, scaling trends in memory technology have favored improving density over access latency. Because of this processor/memory performance gap-often called the memory wall-modern server processors spend over half of execution time stalled on long-latency memory accesses. To improve average memory response time for existing software, architects must design mechanisms that issue memory requests earlier and with greater parallelism. Commercial server applications present a particular challenge for memory system design because their large footprints, complex access patterns, and frequent chains of dependent misses are not amenable to existing approaches for hiding memory latency. Despite their complexity, these applications nonetheless execute repetitive code sequences, which give rise to recurring access sequences-a phenomenon I call temporal correlation. In this talk, I present Temporal Memory Streaming, a memory system design paradigm where hardware mechanisms observe repetitive access sequences at runtime and use recorded sequences to stream data from memory in parallel and in advance of explicit requests.
Tom Wenisch is completing his Ph.D. in Electrical and Computer Engineering at Carnegie Mellon University, specializing in computer architecture. Tom's current research includes memory streaming, multiprocessor memory system design and computer system performance evaluation. His future research will focus on multi-core/multiprocessor systems, with particular emphasis on improving system programmability and debuggability.
University of Michigan
Designing Efficient Processors through Acceleration and Virtualization
Consumers will always demand more performance from their computer systems. Real-time ray tracing and speech recognition are just two of the many compelling applications that remain outside of the computational capabilities of current computers. Recently, however, the traditional method of attaining performance through higher processor clock frequencies has driven power consumption to a point where it is too expensive to cool the processors. This trend, known as The Power Wall, has led to a focus on computational efficiency (getting the most work out of each joule) as a key metric in computer system designs.
This talk will describe some of my work on making computation more efficient in processors. First, I will present the design of a novel accelerator specifically targeted to execute the most common acyclic computation patterns from a wide range of applications. Next, I will describe a virtualization technique to make integrating the accelerator into computer systems as cost-effective as possible. Finally, I will demonstrate how the accelerator design and virtualization ideas generalize to a much broader set of accelerators, and discuss some preliminary results and future directions in this area.
Nate Clark is currently a member of the Compilers Creating Custom Processors research group at the University of Michigan, where he has received his B.S.E., M.S.E., and will shortly receive his Ph.D. Nate's research interests broadly lie in computer architecture and compilers; more specifically targeted at customizing computer architectures for particular application domains, and the compilation/virtualization challenges that inevitably arise from such customization. His dissertation work has led to twelve publications, five patent filings, and a few industry prototypes that might show up in your cell phone one day. Nate is also one of the primary developers of Trimaran, a research compiler used by more than 50 universities worldwide.
IBM T.J. Watson Research Center
Compiler Optimizations for Highly Constrained Multithreaded Multicore Processors
As processor performance for general applications starts to plateau due to limiting factors like power and temperature, multicore processors designed for domain specific applications have recently emerged as a promising new technique. They can achieve much higher performance due to simplified and specialized architectural designs. Meanwhile, by pushing part of the complexities to the compiler and adding extra hardware constraints, cores can be clocked much faster and made much smaller. With ample cores on die, these processors are often heavily multithreaded. However, to reap the full benefits of their processing power, it is critical for the compiler to generate efficient code under additional hardware constraints.
In this presentation, I will talk about my research on Intel's IXP processor, which is specially designed for network applications. To take advantage of the packet level parallelism, this processor incorporates many multithreaded cores. Hardware imposes extra constraints on operand fetching, which must be properly handled by the compiler. Threads are simultaneously active to avoid context switch overhead such that long latency operations can be overlapped through fast context switches. However, this mechanism greatly increases register pressure. Moreover, OS is considered too expensive to be installed, although some of the OS services are desperately needed. We proposed a number of compiler techniques to address the hardware constraints, increase resource sharing across threads, and manage thread execution intelligently. Through clever compiler optimizations, we were able to achieve up to 50% performance improvement and eliminate most of the unnecessary stalls (another 20-30% speedup). Some of the optimizations were subsequently implemented by Intel in their research compiler.
Dr. Xiaotong Zhuang is currently a postdoctoral researcher at IBM T.J. Watson Research Center. He received his Ph.D. from Georgia Tech's College of Computing and a minor degree from ECE. Xiaotong also holds a BE degree in EE and a MS degree in CS from Shanghai Jiaotong University. His main areas of interest are compilers, especially backend compiler optimizations, secure computer architecture, embedded systems etc. As a result of his endeavor, Xiaotong has published about 30 papers in conferences and journals. In addition, he received the Outstanding Graduate Research Assistant Award from Georgia Tech upon graduation and the Invention Achievement Award from IBM Research.
Barcelona: AMD's Next-Generation Quad-core Microprocessor
This talk introduces Barcelona, an upcoming native quad-core microprocessor from AMD. The presentation explains many of the performance features of the processor and the integrated northbridge, including the wider "SSE128" media and floating point units, details on the core microarchitecture, the more efficient dram controller, and the shared L3 cache. Additionally, we discuss virtualization and virtualization performance, and some of the new power- savings technologies introduced on Barcelona.
Ben Sander is a Principal Member of Technical Staff and manager of the performance modeling group at AMD in Austin, TX. He joined AMD in 1995 and has worked on several generations of AMD microprocessors, including the AMD K5, AMD Athlon(tm), and AMD Opteron(tm) processors. His group is responsible for the performance modeling and micro-architecture development for next-generation AMD microprocessors. Ben holds 14 patents and has numerous pending applications, and is an IEEE member.
He studied in Professor Wen-mei Hwu's IMPACT group, and graduated from the University of Illinois in 1993 (B.S.) and 1995 (M.S.)
Hardware and Software Support for Parallel Network Services
Although multicore processors are now pervasive, the performance of such systems depends entirely on the ability of the target applications to exploit parallelism. This talk first presents Aspen, a parallel programming language and runtime system that currently targets network service applications. Aspen programs resemble task flowcharts, with the nodes being instances of computational modules and the edges being unidirectional communication channels. Aspen automatically and transparently supports task-level parallelism among module instances and data-level parallelism across different flows in an application or, in some cases, across different work items within a flow. Aspen adaptively allocates threads to modules according to the dynamic workload seen at those modules. Experimental results indicate performance competitive with (and sometimes better than) current server programming models while using 54-96% fewer lines of user code.
This talk also presents LineSnort, a self-securing programmable Ethernet controller. LineSnort parallelizes the Snort network intrusion detection system (NIDS) using concurrency across TCP sessions and executes those parallel tasks on multiple low-frequency/low-power RISC cores. LineSnort additionally exploits opportunities for intra-session concurrency based on domain-specific characteristics of NIDS. The system includes dedicated hardware for high-bandwidth data transfers and for high-performance string matching. Detailed simulation results show that LineSnort can achieve intrusion detection throughputs in excess of 1 Gbps for fairly large rule sets, thus offloading the computationally difficult task of intrusion detection from a server's host CPU and enabling protection against both external and LAN-based attacks.
This talk includes research performed jointly with Derek Schuff, Gautam Upadhyaya, and Sam Midkiff.
Vijay S. Pai received a BSEE degree in 1994, an MS degree in electrical and computer engineering in 1997, and a Ph.D. degree in Electrical and Computer Engineering in 2000, all from Rice University. He joined the faculty of Purdue University in August 2004. Prior to that, he had served as an assistant professor at Rice University (2001-2004) and as a senior developer at iMimic Networking (1999-2001). He received the NSF CAREER award in 2003 and Purdue Eta Kappa Nu's William H. Hayt Outstanding Instructor award in 2006.