The Future of Multi-core: Intel's Tera-scale Computing Research
The Intel® Tera-scale Computing Research Program is a worldwide research effort to create platforms for the next decade with capabilities only dreamed of today. This requires embracing a shift to massive parallelism through scalable multi- core architectures, platforms and software which use 10s to 100s of cores to efficiently process hundreds of threads and terabytes of data.This presentation is about Intel's research vision and how the 80-core Teraflops Research Processor advances this vision with a tiled design, on-chip interconnect fabric, and innovations in energy management.
Jim Held is an Intel Fellow who leads a virtual team of architects conducting Tera-Scale Computing Research in Intel's Corporate Technology Group. Since joining Intel in 1990, he has led research and development in a variety of Intel's architecture labs concerned with media and interconnect technology, systems software, multi-core processor architecture and virtualization. Before coming to Intel, Jim worked in research and teaching capacities in the Medical School and Department of Computer Science at the University of Minnesota where he earned a Ph.D. (1988) in Computer and Information Science.
University of Catalonia
Supercomputing for the Future, Supercomputing from the Past
Supercomputers, which were a long time ago built on technology developed from scratch, are nowadays built from commodity components. On one side, this means that designers of such systems have to closely monitor the evolution of mass market developments. On the other side, supercomputing becomes a driving boost for technology and systems for the future, providing requirements for the performance and design. As fundamental limits in the single processor per chip, in terms of performance/power ratio, are already on the table, multicore chips and massive parallelism have become the trend to achieve the required performance levels. A hierarchical structure, both in hardware and software, is the unavoidable approach build future supercomputing systems. The talk will first address how these systems have been built in the past and how we envisage their design in the near future.
The gap between peak and real performance for current systems will become worst if designers don't adopt a vertical approach, from processor to node and system design (including interconnect), parallel programming models, dynamic resource management to improve load balancing, tools for performance analysis, prediction and optimization, and new numerical methods, algorithms and applications. Research and proposals in this direction will be presented during the talk in the framework of the future 10/100 Petaflops architecture for the Marenostrum site, the Barcelona Supercomputing Center.
Professor Valero was born in 1952 in a small town in Aragon, not far from Zarazoga. He obtained his Telecommunication Engineering Degree from the Technical University of Madrid (UPM) in 1974 and his Ph.D. in telecommunications from the Technical University of Catalonia (UPC) in 1980. He has been teaching at UPC since 1974. In 1983, he bacame full professor in the Computer Architecture Department. He has served as Chair of the Computer Architecture Department (1983-84, 1986-87, 1989-90, and 2001-2005) and Dean of Computer Engineering School (1984-85). Today, he is a Full Professor in the Department and has been the founding Director (since its inception in 2004) of the Barcelona Supercomputer Center, a 25 million dollar a year enterprise funded by the governements of Spain and Catalunya.
His research is in the area of computer architecture, with special emphasis on high performance computers: processor organization, memory hierarchy, systolic array processors, interconnection networks, numerical algorithms, compilers, and performance evaluation. He has co-authored over 400 publications: more than 250 in Conference and the rest in Journals and Book Chapters. He has graduated more than 30 PhDs in Computer Architecture, 12 of whom are today full professors at leading engineering departments in Spain.
Intelligent Speculation for Pipelined Multithreading
In recent years, microprocessor manufacturers have shifted their focus from single-core to multicore processors. To avoid burdening programmers with the responsibility of parallelizing their applications, some researchers have advocated automatic thread extraction. Within the scientific computing domain automatic parallelization techniques have been successful, but in the general purpose computing domain few, if any, techniques have achieved comparable success.
Despite this, recent progress hints at mechanisms to unlock parallelism from general purpose applications. In particular, two promising proposals exist in the literature. The first, a group of techniques loosely classified as thread-level speculation (TLS), attempts to adapt techniques successful in the scientific domain, such as DOALL and DOACROSS parallelization, to the general purpose domain by using speculation to overcome complex control flow and data access patterns not easily analyzed statically. The second, a non-speculative technique called Decoupled Software Pipelining, partitions loops into long-running, fine-grained threads organized into a pipeline (pipelined multithreading or PMT). DSWP effectively extends the reach of conventional software pipelining to codes with complex control flow and variable latency operations.
Unfortunately, both techniques suffer key limitations. TLS techniques either suffer from over speculation, in an attempt to speculatively transform a loop into a DOALL loop, or realize little parallelism in practice because DOACROSS parallelization puts core-to-core communication latency on the critical path. DSWP avoids these pitfalls with its pipeline organization and decoupled execution using inter-core communication queues. However, its non-speculative nature and restrictions needed to ensure a pipeline organization prevent DSWP from achieving balanced parallelism on many key application loops.
In this talk, I present two key contributions that advance the state of automatic parallelization of general purpose applications. First, I propose extending pipelined multithreaded execution with intelligent speculation. Rather than speculating all loop-carried dependences to transform loops into DOALL loops, I propose speculating only key predictable dependences that inhibit balanced, pipelined execution. I will present results from our automatic compiler transformation, Speculative DSWP, demonstrating the efficacy of this technique. Second, to support decoupled speculative execution, I will describe an extension to a multi-core architecture's memory subsystem allowing it to support memory versioning. The proposed memory systems resemble those present in TLS architectures, but provide efficient execution in the presence of large transactions, many simultaneous outstanding transactions, and eager data forwarding between uncommitted transactions. In addition to supporting usage patterns exhibited by speculative pipelined multithreading, the proposed memory system facilitates existing and future speculative threading techniques
Neil Vachharajani is a Ph.D. student in the Department of Computer Science at Princeton University. His research interests include compilers, computer architecture, and programming languages, particularly focused on concurrency and multi-core architectures. Neil is a NSF graduate fellow, has a BSE in Electrical Engineering from Princeton University, and a MA in Computer Science from Princeton University.
With each technology generation, we are experiencing an increased rate of cosmically-induced soft errors in our chips. We are starting to see a dark side to Moore's Law in which the increased functionality we get with our exponentially increasing number of transistors is being countered with a exponentially increasing soft error rate. This will take increasing effort and cost to cope with. Architectural solutions for this problem are inherently expensive and often not cost-effective for the commodity processor market.
In this talk, I will present a new breed of cost-effective fault identification technique called fault screeners. Fault screeners probabilistically detect if a transient fault has affected the state of a processor. I will demonstrate that fault screeners function because of two key characteristics. First, much of the intermediate data generated by a program inherently falls within certain consistent bounds. Second, these bounds are often violated by the introduction of a fault. Thus, fault screeners can identify faults by directly watching for data inconsistencies arising in an application's behavior.
I will present an idealized algorithm capable of identifying over 85% of injected faults on the SpecInt suite and over 75% on average overall. Further, in a realistic implementation on a simulated Pentium-III-like processor, about half of the errors due to injected faults are identified while still in speculative state. Errors detected this early can be eliminated by a pipeline flush. A hardware-based version of this screening algorithm reduces overall performance by less than 1%.
Shubu Mukherjee is a Principal Engineer and Director of Intel's SPEARS Group (Simulation and Pathfinding of Efficient and Reliable Systems). The SPEARS Group is responsible for spearheading architectural change and innovation in the delivery of enterprise processors and chipsets by building and supporting simulation and analytical models of performance, power, and reliability. Dr. Mukherjee is widely recognized both within and outside Intel as one of the experts on architecture design for soft errors. He has made pioneering contributions towards the design of Redundant Multithreading (RMT) techniques, architectural vulnerability modeling for soft errors, creation of performance modeling infrastructure called Asim (jointly with Dr. Joel Emer), design of the Alpha 21364 interconnection network, and the creation of the first shared memory prediction scheme.
Prior to joining Intel, Shubu worked in Compaq for 3 years and Digital Equipment Corporation for 10 days. Dr. Mukherjee received his B.Tech. from the Indian Institute of Technology, Kanpur and M.S. and PhD from the University of Wisconsin-Madison. He was the General Chair of ASPLOS (Architectural Support for Programming Languages and Operating Systems), 2004. He has co-authored over 40 external papers. He holds 8 patents and has filed over 30 more in Intel. Dr. Mukherjee's book titled, "Architecture Design for Soft Errors" just appeared in the market.
University of Illinois, Urbana-Champaign
Multiprocessor Architectures for Speculative Multithreading
One of the biggest challenges facing computer architecture today is the design of parallel architectures that make it easy for programmers to write parallel codes. One of the architectural technologies that is showing great versatility and potential in this direction is Speculative Multithreading. In this talk, I will discuss the many uses of this technology in multiprocessors, and its remarkable potential for performance and programmability (Thread-Level Speculation, Speculative Synchronization, Transactional Memory, and BulkSC), hardware reliability (Paceline), and software dependability (ReEnact and Iwatcher).
Josep Torrellas (http://iacoma.cs.uiuc.edu) is a Professor and Willett Faculty Scholar at the University of Illinois. Prior to being at Illinois, Torrellas received a PhD from Stanford University. He also spent a year IBM's T.J. Watson Research Center. Torrellas's research area is multiprocessor computer architecture. He has been involved in the Stanford DASH and the Illinois Cedar multiprocessor projects, and lead the Illinois Aggressive COMA and FlexRAM Intelligent Memory projects.
Exploiting Multicore Parallelism with Dynamic Instrumentation and Compilation
The emerging multicore era has brought many opportunities and challenges to systems research. Two of the challenges I have been focusing on are (i) how to provide detailed analysis of parallel programs and (ii) how to map computations in a parallel program to the underlying hardware in order to achieve the optimal performance.
For (i), we have developed the Pin dynamic instrumentation system, which has become very popular for writing architectural and program analysis tools. By inserting instrumentation codes on the fly, Pin can perform fine-grain monitoring of the architectural state of a program. As an example, I will discuss a parallel programming tool called Thread Checker which we built with Pin for detecting common parallel programming bugs like data races and deadlocks. I will also discuss the dynamic compilation techniques behind Pin. In addition, I will present an extension of Pin called PinOS, which performs whole-system instrumentation (i.e. including both OS and applications) by using virtualization techniques.
For (ii), I have developed the Qilin parallel programming system, which exploits the hardware parallelism available on machines with a multicore CPU and a GPU. Qilin provides a C++ API for writing data-parallel operations so that the compiler is alleviated from the difficult job of extracting parallelism from serial code. At runtime, the Qilin compiler automatically partitions these API calls into tasks and maps these tasks to the underlying hardware using an adaptive algorithm. Preliminary results show that our parallel system can achieve significant speedups (above 10x) over the serial case for some important computation kernels.
At the end, I will outline my future works in parallel programming, compilation, and virtualization.
Chi-Keung (CK) Luk is currently a Senior Staff Researcher in the Software Pathfinding and Innovation Group at Intel, where he conducts research and advanced development in parallel programming, dynamic compilation, computer architecture, program analysis tools, and virtualization. Most recently, he has founded the Qilin parallel programming system project and the PinOS whole-system instrumentation project. He was also a core developer of both the Pin dynamic instrumentation system and the Ispike Itanium binary optimizer.
CK obtained his Ph.D. from the University of Toronto, under the supervision of Todd Mowry. He also spent two years as a visiting scholar at Carnegie Mellon University. He has over 20 publications and one issued patent with another five pending. He has served on the program committees of WBIA'05, MSP'02, and MICRO'01.
Among the honors CK received, he is most proud of the Intel Achievement Award---the most prestigious award at Intel---he received in 2008 for his contributions to Pin, and the nomination for the ACM Doctoral Dissertation Award in 2000
IBM TJ Watson Research Center
Datacenters of the Future
New workloads are creating opportunities for novel optimized computing platforms in the datacenter. Furthermore, modern data centers are growing due to economies of scale and are facing significant challenges around power, underutilization, and high management cost.
The first part of this presentation will focus on how the requirements of workload consolidation, real world aware and network optimized computing will result in a diversity of platforms optimized for power and cost. I will discuss optimal SMP design points, stream processing, and the role of massive muticore and hybrid architectures.
The second part of the presentation will focus on the simplification of systems management. A new "datacenter architecture" is emerging to support massive application growth. This trend, coupled with some key technology trends such as virtualization and autonomic-homogeneous server-ensembles, will lead to fundamental changes in traditional enterprise datacenters. I will describe an exciting "living-lab" we have created IBM Research to explore this new data center architecture.
Dr. Tilak Agerwala is Vice President of Systems at IBM Research. He is responsible for all IBM's Systems research activities worldwide in Deep Computing (for example Blue Gene, Cell and the DARPA HPCS project) and commercial systems (for example, BladeCenter, System p, and mainframes). This research spans the space from microprocessors and tools to operating systems and systems management, and also includes novel algorithms and computational biology. Tilak received the W. Wallace McDowell Award from the IEEE in 1998 for "outstanding contributions to the development of high performance computers." He is a founding member of the IBM Academy of Technology and a Fellow of the Institute of Electrical and Electronics Engineers. He received his B.Tech. in electrical engineering from the Indian Institute of Technology, Kanpur, India and his Ph.D. in electrical engineering from the Johns Hopkins University, Baltimore, Maryland.
Stream Programming: Luring Programmers into the Multicor Era
As the computer industry has moved to multicore processors, the historic trend of exponential performance improvements will now depend on ordinary programmers and their ability to parallelize their code. However, most programmers are already overwhelmed by the complexity of modern software and are unwilling to expend extra effort on parallelization. Hence, for programmers to embrace a parallel abstraction, we believe that it must come with new capabilities--unrelated to parallelism--that simplify application development and lure programmers into changing their ways.
In this talk, I will describe stream programming: an inherently parallel model that also offers powerful new capabilities for the domain of multimedia, graphics, and digital signal processing (DSP). Programs with a streaming structure are naturally parallelized on a multicore target. At the same time, streaming language abstractions enable the compiler to automate tasks that are typically performed by a DSP expert, including whole-program algebraic simplification and translation from the time domain to the frequency domain. By automating such transformations, stream programming reduces the overall burden on programmers and enables them to transition to the multicore era.
Bill Thies is a Ph.D. candidate at the Massachusetts Institute of Technology, where he is a member of the Computer Science and Artificial Intelligence Laboratory. His research focuses on programming languages and compilers for emerging technologies, from multicore architectures to microfluidic chips. He is also interested in creating appropriate information technologies for use in developing countries. Bill earned a B.S. in computer science, a B.S. in mathematics, and an M.Eng. in computer science, all from MIT.
University of Illinois, Urbana-Champaign
Hardware-Software Co-Design for General-Purpose Processors
The shift toward multi-core processors is the most obvious implication of a greater trend toward efficient computing. In the past, hardware designers were willing to spend superlinear area and power for incremental performance improvements, but that era has come to an end. With the low-hanging fruit of processor microarchitecture having largely been picked, it is my belief that we will increasingly see a trend toward co-designing hardware with the software that runs on it. Processor designers will ask "what minimal features and interfaces must be place in hardware to achieve our performance goals?"
In this talk, I will discuss our recent work exploring a collection of hardware primitives for: 1) making trivial the implementation of speculative compiler optimizations (which both increase performance and reduce power consumption), 2) implementing a strongly-atomic Transactional Memory where common-case transactions execute in hardware with no overhead, but the semantics are defined by software, and 3) instrumenting cod
Craig Zilles is an Assistant Professor in the Computer Science department at the University of Illinois at Urbana-Champaign. His current research focuses on the interaction between compilers and computer architecture, especially in the context of managed and dynamic languages. He received B.S. and M.S. degrees from MIT and his Ph.D. from the University of Wisconsin-Madison. Prior to his work on computer architecture and compilers, he developed the first algorithm that allowed rendering arbitrary three-dimensional polygonal shapes for haptic interfaces (force-feedback human-computer interfaces). His work has been selected for the 2008 IEEE Micro "Top Picks from Computer Architecture Conferences", he holds 5 patents, and he was awarded the NSF CAREER award, the UIUC Rose Award for Teaching Excellence and the Everitt Award for Teaching Excellence.
University of Toronto
Developments in FPGA Technology
The talk will address the present state of Field Programmable Gate Array (FPGA) technology. This technology has advanced to the point where FPGA chips are now used to implement entire high-performance digital systems. Most advanced FPGA devices can implement millions of equivalent logic gates and may contain megabytes of memory cells. FPGAs are widely used in applications such as automotive, video processing, communications, computers, medical and industrial test equipment.
Huge advances have been made during the past few years in both the implementation of FPGA devices and the CAD tools needed to use these devices in practical applications. Continuing quest for increased performance is now accompanied by efforts to reduce the power consumption of FPGAs. The most important factors are the reduction in feature size, the architecture of the FPGA devices and the ability of CAD tools to realize efficient designs. These issues will be discussed, as seen by an academic who is now working in FPGA industry
Zvonko Vranesic received his B.A.Sc., M.A.Sc., and Ph.D. degrees, all in Electrical Engineering, from the University of Toronto. From 1963 to 1965 he worked as a design engineer with the Northern Electric Co. Ltd. in Bramalea, Ontario. In 1968 he joined the University of Toronto, where he is now a Professor Emeritus in the Department of Electrical and Computer Engineering. During the 1978-79 academic year, he was a Senior Visitor at the University of Cambridge, England, and during 1984-85 he was at the University of Paris, 6. From 1995 to 2000 he served as Chair of the Division of Engineering Science at the University of Toronto. Presently, he is working at the Altera Toronto Technology Center, as a member of Altera's University Program group. His research interests have included computer architecture and field-programmable VLSI technology.
He is a coauthor of five books: Fundamentals of Digital Logic with VHDL Design, 3rd ed.; Fundamentals of Digital Logic with Verilog Design, 2nd ed.; Computer Organization, 5th ed.; Microcomputer Structures; and Field-Programmable Gate Arrays. He has represented Canada in numerous chess competitions. He holds the title of International Master.