Scalability Challenges for Future Chip Multiprocessor Architectures
Technology forecasts predict a biennial doubling of the number of processor cores for the next ten years. But the road forward to unleash their computational power to applications will be increasingly challenging. Some of these challenges are: i) how can architects provide a more productive interface to the software? ii) How can the memory system be architected so that memory bandwidth can scale up exponentially? iii) How can the chip resources be used to make the whole chip infrastructure scalable to a large number of processor cores. As for high-productivity hardware/software interfaces, I will talk about our recent contributions to realize hardware transactional memory and the direction we are taking beyond that towards a high-productivity hardware/software interface. Concerning memory systems there is considerable room for improvement in the utilization of on-chip memory resources. I will present work-in-progress on our value-centric approach in designing memory systems. Eventually, the serial bottleneck, so cleverly framed by Amdahl decades back, will hit us. If there is time, I will also talk about a design-space exploration exercise we did in which we have found that in some data mining applications that potentially scale to hundreds of cores, reduction operations can limit scalability substantially. Through validated analytical models, we find that while asymmetric chip multiprocessors intuitively could mitigate these serial bottlenecks, symmetric chip multiprocessors with more powerful cores appear to be a better pursuit.
Per Stenstrom is a professor of computer engineering at Chalmers University of Technology since 1995. His research interests are devoted to design principles for high-performance computer systems and he has made multiple contributions to especially high-performance memory systems. He has authored or co-authored three textbooks and more than a hundred publications in international journals and conferences. He is regularly serving program committees of major conferences in the computer architecture field. He is also an associate editor of IEEE Transactions on Parallel and Distributed Processing Systems and associate editor-in-chief of the Journal of Parallel and Distributed Computing. He co-founded the HiPEAC Network of Excellence funded by the European Commission. He has acted as general and program chair for a large number of conferences including the ACM/IEEE Int. Symposium on Computer Architecture, the IEEE High-Performance Computer Architecture Symposium, and the IEEE Int. Parallel and Distributed Processing Symposium. He is a Fellow of the ACM and the IEEE and a member of Academia Europaea and the Royal Swedish Academy of Engineering Sciences.
University of Southern California
WearMon: Ultra Low Cost Wearout Monitoring for Detecting Processor Aging
As process technology shrinks, circuits experience accelerated wearout. Monitoring wearout will be critical to improve the efficiency of error detection and correction approaches. In this talk I will describe WearMon an adaptive critical path monitoring architecture which provides accurate and real-time measure of the processor's timing margin degradation. Special test patterns check a set of critical paths in the circuit-under-test. By activating the actual devices and signal paths used in normal operation of the chip, each test will capture up-to-date timing margin of these paths. While WearMon is highly efficient in the common case there are two factors that cause WearMon costs to rise. First, industrial designs may have steep critical path walls. Second, circuit wearout depends on process variations, circuit operation conditions, and runtime path utilization. Dynamic nature of wearout coupled with steep critical path walls make selection of a group of paths to be monitored a challenging task. In the second part of the talk we will describe a novel cross-layer framework that combines application layer information with design time modifications to provide near zero cost monitoring.
Murali Annavaram is an Assistant Professor in the Electrical Engineering department at USC. Prior to his appointment at USC, he gained significant industrial experience working first at Intel research on hardware design and then at Nokia research on mobile phone technologies for a total of 6 years. His work on Energy-Per- Instruction throttling at Intel is implemented in Intel Core i7 processor to turbo boost performance at a fixed power budget. His work on Virtual-Trip-Lines at Nokia formed the foundation for Nokia Traffic Works product that provides real time traffic sensing using mobile phones. Murali's research at USC spans energy efficiency in mobile devices and reliability in server architectures (more info on his website http://www.usc.edu/dept/ee/scip/). Murali is a recipient of NSF CAREER award in 2010. His passion is a combination of teaching, working with his graduate students and traveling. Murali received his Ph.D. from the University of Michigan.
Arizona State University
Multi-core Challenge: Missing Memory Virtualization
The multi-core era is irreversibly here. The transition from single-core to few cores has been relatively smooth. However, the unending need for higher performance will bring processors with hundreds and thousands of cores in the market pretty soon. But what are the implications of this to engineering, and software industry in general and computer science in particular? How is industry embracing this change? Are we ready?
One of the challenges that we have been working on is the absence of memory virtualization in many-core architectures. Caches were the most important pillar of computer architecture in the single-core era. Caches provided the illusion of a single large unified memory, and kept programming simple and same. However, caches do not scale well with number of cores, and also consumes a lot of power. Therefore to improve the power-efficiency, and enable large number of cores in a processor, computer architects are in search of alternative memory hierarchies.
Limited Local Memory multi-core architecture is a scalable memory design in which each core has access to only its small local memory, and explicit DMA instructions have to be inserted in the program to transfer data between memories. The IBM Cell processor, which is in the Sony Playstation 3 is a popular example of this architecture. The roadrunner supercomputer, which broke the peta-scale computation record is one of the most power-efficient super-computers, and is made of IBM Cell processors. Such high power-efficiency comes partly at the cost of simplicity of programming. Programming LLM architecture is not simple, as it requires application change. Application developers have to be cognizant of the small size of the local memory, and have to insert instructions to perform this data transfer between the memories. My talk will summarize our efforts at automating this memory management.
Aviral Shrivastava is Assistant Professor in the School of Computing Informatics and Decision Systems Engineering at the Arizona State University, where he has established and heads the Compiler and Microarchitecture Labs (CML). He received his Ph.D. and Masters in Information and Computer Science from University of California, Irvine, and bachelors in Computer Science and Engineering from Indian Institute of Technology, Delhi. He is a 2011 NSF CAREER Award Recipient and is credited for over $1.5 million of research. His research lies at the intersection of compilers and architectures of embedded and multi-core systems, with the goal of improving power, performance, temperature, energy, reliability and robustness. His research is funded by NSF and several industries including Intel, Nvidia, Microsoft, Raytheon Missile Systems etc. He serves on organizing and program committees of several premier embedded system conferences, including ISLPED, CODES+ISSS, CASES and LCTES, and regularly serves on NSF and DOE review panels.
Improving Permeability in System Architecture
Hardware complexity has outpaced software development by a wide margin. Long gone are the days where well written applications and compilers could extract every drop of performance in a computing platform. Software developers are faced with the daunting task of parallelizing their applications using archaic tools that have not kept pace with hardware. Further, programmers attempting to utilize performance of specialized hardware must become proficient using hybrid environments like OpenCL.
The principal of hardware/software codesign is often cited as the panacea for closing the complexity gap and improving programmer productivity. While conceptually simple, codesign is not well defined and does not necessarily lead to higher programmer productivity. The concept of improving permeability through the hardware/software barrier is introduced as a technique to reduce overall system architecture complexity, with the eventual goal of improving both efficiency and productivity.
In this talk I will explore tradeoffs that can move specific functions from software to hardware, both for productivity and for efficiency. I will also look at examples where moving a function from hardware to software improved flexibility without compromising efficiency. Using recent product experience, we will discuss the software interfaces to hardware functions and attempt to make sense of hardware/software codesign with heterogeneous hardware.
Doug Carmean is an Intel Fellow and Researcher At Large at Intel Labs.
He is responsible for creating the vision and concept for a fully programmable graphics pipeline based on IA processors that supports highly visual and parallel workloads. Carmean led the team that founded a new group at Intel to define, build and productize products from an architecture that targets the high-end discrete graphics business. He is responsible for growing the development of Larrabee from an early concept to a core piece of Intel's graphics strategy. Carmean enlisted and included key industry software developers in Larrabee's definition to ensure a compelling product.
Since joining Intel in 1989, he has held several key roles and provided leadership in Intel's microprocessor architecture development and product roadmap. As Nehalem's first chief architect, a next-generation x86 flagship processor, he led the team during the early phases of architecture definition. Prior to this position, he was a principal architect for the Pentium 4 processor where he completed the memory cluster and power architecture definition including algorithms, structures and overall functionality.
Carmean holds more than 25 patents and many pending in processor architecture and implementation, memory subsystems and low power design. He has published more than a dozen technical papers. Doug enjoys fast cars, Canadian bicycles and scary, Italian motorcycles.
Intel's "Cool" New Microarchitecture: Sandy Bridge
From doubling FLOPs to leading edge branch prediction, from sophisticated power management to integrated graphics, Intel's Sandy Bridge microarchitecture provides break-through performance in a highly efficient package. This talk will discuss many of the primary architectural enhancements that contribute to Sandy Bridge's success, including vector widening, scalable ring interconnect, turbo boost power budgeting, and more.
Beeman Strong, Senior Performance Architect, Intel Corp
BSEE, UT-Austin, 1996
Beeman began his career at Intel in 1996, validating ISA features on Willamette (P4). He spent 9 years in validation before moving to the architecture team in 2005, to work on SMT and livelock prevention. In 2007, Beeman moved to focus on branch prediction, and is currently working on enhancements for future generation CPUs.
Power-Aware Definition of Reliable, Multi-Core Processors
The power and reliability “walls” are two key impediments to scalable performance growth in next generation multi-core chips and systems. The problem becomes increasingly more acute as we scale to ever larger systems, in order to enter the so-called extreme-scale computing regime. These systems target orders of magnitude improvement in performance over current large-scale server or supercomputing systems. The targets must be achieved for the historically established reliability metrics at the system level, while adhering to hard limits on power delivery and dissipation, driven by overall cost. In this talk, I will try to highlight some key architecture and modeling challenges at the processor chip level, while relating that to the full system view. I will also point to some promising solution approaches to known difficult problems, based on ongoing work at R&D groups across universities and industry.
Pradip Bose is a Research Staff Member and Manager of the Reliability- and Power- Aware Microarchitectures Department at IBM T. J. Watson Research Center. He has been with IBM for over twenty-five years, and has been involved in the definition and pre-silicon modeling of virtually all IBM POWER-series microprocessors. Dr. Bose is a member of the IBM Academy of Technology and is an IBM Master Inventor. He is a Fellow of IEEE.
University of Alberta
Don't Hide Your Program's Behavior Behind Averages
Multiple training runs of a computer program with varied data inputs can be used to characterize the behavior of the program. This information can then be used either to generate a version of the program that is better suited to a given architecture --- a process known as Feedback-Directed Optimization (FDO) compilation, or to inform design decisions in future architectures. But, how should the multiple profiles be combined? Is it sufficient to simply average the multiple measurements? Is it necessary to compute the parameters for an assumed statistical distribution of the measurements? Or is there a simple technique to combine the measurements and provide useful statistics to FDO? In this talk I argue that, even in the most-commonly used benchmarks such as SPEC, there are significant variations of behavior due to data inputs. I also arguee that these variations should be taken into account in FDO and architecture-design decisions. I present our methodology that uses a non-parametric empirical distribution to build a Combined Profile and a query system to allow a compiler or designer to ask questions about program behavior variations. I discuss the implementation of this Combined Profiling methodology in LLVM. This work was done jointly with my Ph.D. student Paul Berube, and with Adam Preuss, who is a B.Sc. researcher assistant.
Jose' Nelson Amaral is a professor of Computing Science at the University of Alberta, Canada. He received the Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, in 1994, the M.E. from the Instituto Tecnolo'gico de Aeronautica, Sa~o Jose' dos Campos, SP, Brazil, and the B.E. from the Pontif'icia Universidade Cato'lica do Rio Grande do Sul (PUCRS), RS, Brazil. His current research interests include Compiler Design, Static Analysis, Feedback Directed Compilation, Computer Architecture, High-Performance Computer Systems, and the application of learning methods to the design of compilers. His previous research includes Cache-Conscious Algorithms, Internet Protocol Routing Caches, Artificial Neural Networks, Combinatorial Optimization Problems, Parallel Architectures for Symbolic Processing, Multi-Threaded Architectures and Programming Models. Dr. Amaral is a Senior Member of the IEEE and a Distinguished Speaker for ACM.