Graphics Hardware & GPU Computing: Past, Present, and Future
Modern GPUs have emerged as the world’s most successful parallel architecture. GPUs provide a level of massively parallel computation that was once the preserve of supercomputers like the MasPar and Connection Machine. For example, NVIDIA's GeForce GTX 280 is a fully programmable, massively multithreaded chip with up to 240 cores, 30,720 threads and capable of performing up to a trillion operations per second. The raw computational horsepower of these chips has expanded their reach well beyond graphics. Today's GPUs not only render video game frames, they also accelerate physics computations, video transcoding, image processing, astrophysics, protein folding, seismic exploration, computational finance, radioastronomy - the list goes on and on. Enabled by platforms like the CUDA architecture, which provides a scalable programming model, researchers across science and engineering are accelerating applications in their discipline by up to two orders of magnitude. These success stories, and the tremendous scientific and market opportunities they open up, imply a new and diverse set of workloads that in turn carry implications for the evolution of future GPU architectures.
In this talk I will discuss the evolution of GPUs from fixed-function graphics accelerators to general-purpose massively parallel processors. I will briefly motivate GPU computing and explore the transition it represents in massively parallel computing: from the domain of supercomputers to that of commodity "manycore" hardware available to all. I will discuss the goals, implications, and key abstractions of the CUDA architecture. Finally I will close with a discussion of future workloads in games, high-performance computing, and consumer applications, and their implications for future GPUs including the newly announced "Fermi" architecture.
Dr. David Luebke
Senior Manager, NVIDIA Research
David Luebke helped found NVIDIA Research in 2006 after eight years on the faculty of the University of Virginia. Luebke received his Ph.D. under Fred Brooks at the University of North Carolina in 1998. His principal research interests are GPU computing and real-time computer graphics. Luebke's honors include the NVIDIA Distinguished Inventor award, the NSF CAREER and DOE Early Career PI awards, and the ACM Symposium on Interactive 3D Graphics "Test of Time Award". Dr. Luebke has co-authored a book, a SIGGRAPH Electronic Theater piece, a major museum exhibit visited by over 110,000 people, and dozens of papers, articles, chapters, and patents.
University of Pennsylvania
Secure Low-Level Programming via Hardware-Assisted Memory-Safe C
Many security vulnerabilities and memory corruption bugs stem from a design flaw in the C programming language: its lack of memory bounds checking. Although modern languages such as Java avoid such problems by enforcing memory safety, most low-level systems code that exists today is written in C or C++. In this talk I will describe both the significant obstacles to efficiently retrofitting legacy C code with complete bounds checking and the solutions my group has been developing to meet these challenges. Specifically, we have proposed a hardware-assisted approach (HardBound) and a software-only compiler-based implementation (SoftBound), both of which use disjoint storage of pointer metadata to provide efficient and highly compatible bounds checking for legacy C source code.
Milo Martin is an Assistant Professor in the Computer and Information Science Department at the University of Pennsylvania. His research focuses on making computers easier to design, verify, and program. Specific projects include transactional memory, adaptive cache coherence protocols, hardware-aware verification of concurrent software, and hardware-assisted memory-safe implementations of the C programming language. Dr. Martin is a recipient of the NSF CAREER award and received a PhD from the University of Wisconsin- Madison.
University of Michigan
A Case Against Unbridled Parallelism
The fundamental problem with shared-memory multi-threaded programming model is that it exposes an unbounded number of thread interleavings to the parallel runtime system. Current testing methods focus on stress testing, which try to expose as many different thread interleavings as possible. But, it remains impractical for programmers to test and ensure the correctness of all possible thread interleavings.
I will first argue that instead of investing more and more effort on stress testing, we should develop runtime mechanisms that would constrain the thread interleaving during a production run to avoid untested interleavings, which I will show could reduce the chance of triggering a concurrency bug significantly. I will discuss techniques for encoding tested interleavings in a program's binary, and hardware support for efficiently enforcing those constraints in production runs.
I will also talk about deterministic replay, which could help programmers understand and debug a multi-threaded program execution by allowing them to reproduce the thread interleaving seen during an execution. Prior software techniques incur more than 10x runtime performance overhead, but I will discuss a speculative recording technique that enabled us to build a software record-and-replay system that incurs only about 30-50% overhead.
Satish Narayanasamy is an Assistant Professor in the EECS Department at the University of Michigan. He has a Ph.D. in Computer Science from the University of California, San Diego. His research interests include computer architecture, hardware mechanisms and software tools for programming many-cores, and system reliability. He has received two IEEE Top Picks awards.
Why Design Must Change: Rethinking Digital Design
In the mid 1980's the power growth that accompanied scaling forced the industry to focus on CMOS technology, and leave nMOS and bipolars for niche applications. Twenty years later, CMOS technology is facing power issues of its own. After first reviewing the "cause" of the problem, it will become clear that there are not easy solutions this time -- no new technology or simple system/circuit change will rescue us. Power, and not number of devices is now the primary limiter of chip performance, and the need to create power efficient designs is changing how we do design. In the past this we would turn to specialized computation (ASICs) to create the needed efficiency, but the rising NRE costs for chip design (now over $10M/chip) has caused the number of ASIC design starts to fall not rise.
To get out of this paradox, we need to change the way we think about chip design. For many reasons I don't believe that either the current SoC, or high-level language effort will solve this problem. Instead, we should acknowledge that working out the interactions in a complex design is complex, and will cost a lot of money, even when we do it well. So once we have worked it out, we want to leverage this over solution over a broader class of chips. We can accomplish this by creating a "fixed" system architecture, but of very flexible components. That is instead of building a programmable chip to meet a broad class of application needs, you create a virtual programmable chip, that is MUCH more flexible than any real chip. The application designer (the new chip designer) will then configure this substrate to optimize for their application and then create that chip. To demonstrate how this might work, we use a multiprocessor generator to create an customized CMP which executes H.264 encode with an energy efficiency comparable to an ASIC. As we show in this example for very low energy computation, DRAM energy can be any issue, and we will end the talk describing how to address this final energy frontier.
Mark Horowitz is the Chair of the Electrical Engineering Department and the Yahoo! Founders Professor of the School of Engineering at Stanford University. In addition he is Chief Scientist at Rambus Inc. He received his BS and MS in Electrical Engineering from MIT in 1978, and his PhD from Stanford in 1984. Dr. Horowitz has received many awards including a 1985 Presidential Young Investigator Award, the 1993 ISSCC Best Paper Award, the ISCA 2004 Most Influential Paper of 1989, and the 2006 Don Pederson IEEE Technical Field Award. He is a fellow of IEEE and ACM and is a member of the National Academy of Engineering and the American Academy of Arts and Science.
Dr. Horowitz's research interests are quite broad and span using EE and CS analysis methods to problems in molecular biology to creating new design methodologies for analog and digital VLSI circuits. He has worked on many processor designs, from early RISC chips to creating some of the first distributed shared memory multiprocessors, and is currently working on on-chip multiprocessor designs. Recently he has worked on a number of problems in computational photograph. In 1990, he took leave from Stanford to help start Rambus Inc, a company designing high-bandwidth memory interface technology, and has continued work in high-speed I/O at Stanford. His current research includes multiprocessor design, low power circuits, high-speed links, computational photography, and applying engineering to biology.
IBM T.J. Watson Research Center
Scaling the Memory Wall with Phase Change Memories
DRAM has been the building block for main memory systems for several decades. However, with each technology generation, significant portion of the total system power and the total system cost is spent in the DRAM memory system, and this trend continues to grow making DRAM a less desirable choice for future larger system memories. Therefore, architects and system designers must look at alternative technologies for growing memory capacity. Phase-Change Memory (PCM) is an emerging technology which is denser than DRAM and can boost memory capacity in a scalable and power-efficient manner. However, PCM has it own unique challenges such as higher read latency (than DRAM), much higher write latency, and limited lifetime due to write endurance.
In this talk I will focus on architectural solutions that can leverage the density and power-efficiency advantages of PCM while addressing its challenges. I will propose a "Hybrid Memory" system that combines PCM-based main memory with a DRAM buffer, thereby obtaining the capacity benefits of PCM and latency benefits of DRAM. I will then describe a simple, novel, and efficient wear leveling technique for PCM memories that obtains near-perfect lifetime while incurring a storage overhead of less than 13 bytes. Finally, I will provide extensions to PCM memories than can adaptively "cancel" or "pause" write requests to reduce latency of read requests when there is significant contention from the (slow) write requests.
Dr. Moinuddin Qureshi is a research staff member at IBM Research. His research interest includes computer architecture, scalable memory system, fault tolerant systems, and analytical modeling of computer systems. His recent research effort is focused on exploiting emerging technologies for scalable and power-efficient memories and has led to the following contributions: Hybrid memory system using PCM (ISCA'09), efficient wear leveling for PCM (MICRO'09), and write pausing in PCM (HPCA'10). He holds three US patents and has more than a dozen publications in flagship architecture conferences. He contributed to the development efficient caching algorithms for Power 7 processors. He received his PhD from the University of Texas at Austin in 2007.
End-to-End Critical Path Analysis
Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Identifying performance bottlenecks is difficult because, as in any concurrent system, overheads in one component may be hidden due to overlapping with other operations.
Critical path analysis is a well-known approach to identifying bottlenecks in highly concurrent systems. However, building dependence graphs for this analysis typically requires detailed domain knowledge, making it difficult to apply across all the hardware and software components in a system. We address this problem by developing a straightforward methodology for identifying end-to-end critical paths across software and simulated hardware in complex networked systems. By modeling systems as collections of state machines interacting via queues, we can trace critical paths through multiplexed processing engines, identify when resources create bottlenecks (including abstract resources such as flow-control credits), and predict the benefit of eliminating bottlenecks by increasing hardware speeds or expanding available resources.
We implement our technique in a full-system simulator, instrumenting a TCP microbenchmark, a web server, the Linux TCP/IP stack, and a simulated Ethernet controller. From a single run of the microbenchmark, our tool--within minutes--correctly identifies a series of bottlenecks, and predicts the performance of hypothetical systems in which these bottlenecks are successively eliminated, potentially saving hours or days of head scratching and repeated simulations.
This is joint work with Ali Saidi (ARM), Nate Binkert (HP Labs), and Trevor Mudge (Michigan).
Steven K. Reinhardt is a Fellow in AMD's Research and Advanced Development Labs, where he is investigating future system architectures. Steve is also is an Adjunct Associate Professor at the University of Michigan. From 2006 to 2008, Steve was at Reservoir Labs, Inc., where he managed the development of a DoE-sponsored network processor-based high-speed intrusion-prevention system. From 1997 through 2006 he was a full-time faculty member at University of Michigan, researching system architectures for high-performance TCP/IP networking, cache and memory system design, multithreading, and processor reliability.