A Search for an Efficient Reconfigurable Computing Substrate
Computing substrates such as multicore processor chips or Field Programmable Gate Arrays (FPGAs) share the characteristic of having two-dimensional arrays of processing elements interconnected by a routing fabric. At one end of the spectrum, FPGAs have a computing element that is a single-output programmable logic function and a statically-configurable network of wires. At the other end, the computing element in a multicore is a complex 32-bit processor, and processors are interconnected using a packet-switched network.
We are designing a reconfigurable substrate that shares characteristics of both FPGAs and multicores. Our substrate is configured to run one application at a time, as with FPGAs. The computing element is a processor, and processors are connected using an interconnection network with virtual channel routers that use table-based routing. Bandwidth-sensitive oblivious routing methods that statically allocate virtual channels to application flows utilize the network efficiently. To accommodate bursty flows, the network contains adaptive bidirectional links that increase bandwidth in one direction at the expense of another. We are in the process of building a compiler that compiles applications onto this architecture so as to maximize average throughput of the applications. Our plan is to use the compiler to refine the architecture and then to build a reconfigurable processor chip.
Srini Devadas is a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT), and has been on the faculty of MIT since 1988. He currently serves as the Associate Head of Computer Science. Devadas has worked in the areas of Computer-Aided Design, testing, formal verification, compilers for embedded processors, computer architecture, computer security, and computational biology and has co-authored numerous papers and books in these areas. Devadas was elected a Fellow of the IEEE in 1998.
Petascale Computing with Accelerators
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this talk, I will describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™ 8i accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. I will describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. I will show actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.
Mike Kistler is a Senior Technical Staff Member in the IBM Austin Research Laboratory. He received his BA in Math and Computer Science from Susquehanna University in 1982, an MS in Computer Science from Syracuse University in 1990, and MBA from Stern School of Business (NYU) in 1991. He joined IBM in 1982 and has held technical and management positions in MVS, OS/2, and Lotus Notes development. He joined the IBM Austin Research Laboratory in May 2000 and is currently working on design and performance analysis for IBM's PowerPC and Cell/B.E. processors and systems. His research interests are parallel and cluster computing, fault tolerance, and full system simulation of high- performance computing systems.
Ghent University, Belgium
Per-Thread Cycle Accounting in SMT Processors
Simultaneous Multi-threading (SMT) processors run multiple hardware threads simultaneously on a single processor core. While this improves hardware utilization substantially, co-executing threads affect each other's performance in often unpredictable ways. System software however is unaware of these performance interactions at the micro-architecture level, which may lead to unfair scheduling at the system level.
Starting from a mechanistic performance model, we derive a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. This is done by accounting each cycle to either a base, miss event or waiting cycle component. Single-threaded alone execution time is then estimated as the sum of the base and miss event components; the waiting cycle component represents the lost cycle count due to SMT execution. The cycle accounting architecture incurs reasonable hardware cost (around 1KB of storage) and estimates single-threaded performance accurately with average prediction errors around 7.2% for two- program workloads and 11.7% for four-program workloads.
The cycle accounting architecture has several important applications to system software and its interaction with SMT hardware. For one, the estimated single- thread alone execution time provides an accurate picture to system software of the actually consumed processor cycles per thread. The alone execution time instead of the total execution time (timeslice) may make system software scheduling policies more effective. Second, a new class of thread-progress aware SMT fetch policies based on per-thread progress indicators enable system software level priorities to be enforced at the hardware level. Third, per- thread cycle accounting enables substantially more effective symbiotic job scheduling.
Lieven Eeckhout is an assistant professor at Ghent University, Belgium, and is a postdoctoral fellow with the Fund for Scientific Research -- Flanders (FWO). He received his PhD degree in computer science and engineering from Ghent University in 2002. His main research interest include computer architecture, virtual machines, performance modeling and analysis, simulation methodology, and workload characterization. He has published papers in top conferences such as ISCA, ASPLOS, HPCA, OOPSLA, PACT, CGO, DAC and DATE; he has served on multiple program committees including ISCA, PLDI, HPCA and IEEE Micro Top Picks; and he is the program chair for ISPASS 2009. His work on hardware performance counter architectures was selected by IEEE Micro Top Picks from 2006 Computer Architecture Conferences as one of the "most significant research publications in computer architecture based on novelty and industry relevance". He graduated 5 PhD students, and currently supervises one postdoctoral researcher, 4 PhD students and 3 MSc students.