OceanStore: An Architecture for Global-Scale Persistent Storage
In the past decade we have seen astounding growth in the performance of computing devices. Even more significant has been the rapid pace of miniaturization and related reduction in power consumption of these devices. Based on these trends, many envision a world of ubiquitous computing devices that add intelligence and adaptability to ordinary objects such as cars, clothing, books, and houses. Given this vision, however, one question immediately comes to mind: where does persistent information reside? One possible answer to this question is OceanStore, a utility infrastructure being developed at Berkeley. OceanStore is designed to span the globe and provide continuous access to persistent information. OceanStore service is provided by a confederation of cooperating service providers, much in the same way that electricity is provided to California customers. Since OceanStore is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through pro-active movement of data. This talk will describe the mechanisms of OceanStore and discuss the status of its implementation.
Department of Electrical and Computer Engineering
North Carolina State University
Extensions to EPIC Architectures
VLIW architectures grew out of horizontal microcode and compiler research of the mid-80s. They are currently the market leading architecture in high-performance DSP architectures, which in turn are the supercomputing platforms of today. But because of code compatibility problems, the technology is not used in general purpose computing. EPIC is a relaxation of the VLIW paradigm that hides some pipeline details. Although Intel and HP are promoting its use for server computing, some limitations to EPIC reduce its power when compared to extensions to the industry-standard superscalar paradigm.
This talk will review extensions that seek to address the limitations of EPIC. The first extension, developed by my student Chao-ying Fu, applies value speculation to an otherwise statically-scheduled machine. The second extension, developed by my students Emre Ozer and Tripura Ramesh, looks at techniques to hide unpredictable latencies and under-utilized pipeline resources through single-program, multiple-PC operation welding. If time permits, other research projects in flight in my research group will also be presented.
Tom Conte is currently an associate professor of ECE and director of the Center for Embedded Systems Research at NC State University. He received his M.S. and Ph.D. degrees from the University of Illinois at Urbana-Champaign in 1988 and 1992, respectively. Conte directs 7 Ph.D. students in the TINKER project (www.tinker.ncsu.edu), which is supported by HP, Intel, IBM, TI, Sun, Compaq, DARPA and NSF. He is currently on leave for one year from NC State to serve as the chief microarchitect of VLIW DSP core vendor BOPS, Inc. (www.bops.com).
University of Washington
On the Anatomy of Cache Predictors
Caches were introduced over 30 years ago. They have evolved from a single level sectored cache, whose presence was oblivious to the Instruction Set Architecture (ISA), to a multi-level hierarchy of caches, of various sizes and associativities, that are exposed to the ISA, and that are accompanied by a variety of hardware and software assists.
Caches have been a great success for enhancing the performance of computer systems and in this talk, we briefly review some of the progress made on cache design and performance. However, in spite of the abundance of literature on the subject, caches are not as efficient as they could be and they will remain an active area of research as long as the challenge of the "memory wall" is still present. We will describe our current methodology for the design of cache assists, a methodology that borrows from paradigms used in branch and value prediction, and will show its application to enhancing the performance of some features of on-chip and off-chip caches.
Jean-Loup Baer received a Diplome d'Ingenieur in Electrical Engineering and a Doctorat 3e cycle in Computer Science from the Universite de Grenoble (France) and a Ph.D. from UCLA in 1968.
He is Professor of Computer Science and Engineering and Adjunct Professor of Electrical Engineering at the University of Washington, where he has been since 1969. He was Chair of the Department of Computer Science and Engineering from 1998 till 1993. His present research interests are in computer systems architecture with a concentration on the design and evaluation of memory hierarchies, and in parallel and distributed processing. He is a Guggenheim Fellow, an ACM Fellow, and an IEEE Fellow.
Compaq Western Research Labs
General-Purpose Architectures for Media Processing Applications
Workloads on general-purpose computing systems have changed dramatically over the past few years, with greater emphasis on applications such as databases, media processing, networking, and communications. Several of these workloads require orders-of-magnitude higher computing performance than what is available in current systems. However, until recently, most high performance computing studies have primarily focused on scientific and engineering workloads, potentially leading to design decisions not suitable for emerging workloads. Our work has been the first to use detailed simulation to study several emerging workloads on state-of-the-art general-purpose systems. In this talk, I will discuss some of my recent work on understanding and improving the performance of media processing workloads on general-purpose architectures.
An analysis of the effectiveness of state-of-the-art features (techniques to extract instruction-level parallelism, media instruction-set extensions, software prefetching, and large caches) identifies two key trends: (1) media workloads on current general-purpose systems are primarily compute-bound and (2) current trends towards devoting a large fraction of on-chip transistors (up to 80%) for caches can often be ineffective for media workloads. In response to these results, I will discuss a new cache organization, called a reconfigurable cache, that allows the on-chip cache transistors to be dynamically divided into partitions that can be used for other processor activities (e.g., instruction memoization, application-controlled memory, and prefetching buffers). Our design of the reconfigurable cache requires very few modifications to existing caches and has small impact on cache access times. I will discuss how, for media applications, reconfigurable caches can be effectively used for instruction memoization to reuse memory for computation.
Parthasarathy Ranganathan is currently at Compaq Western Research Labs at Palo Alto, California. He received his B.Tech degree from the Indian Institute of Technology, Madras, and his M.S. and Ph.D. degrees from Rice University. His research interests are in high-performance computer architecture, parallel computing, and performance evaluation, with specific focus on architectures for emerging applications and techniques to use instruction-level parallelism. He is a primary developer and maintainer of the publicly distributed Rice Simulator for ILP Multiprocessors (RSIM) and a recipient of the Lodieska Stockbridge Vaughan fellowship. He is also the secretary of the IIT Madras alumni association overseeing all North American activities. More details can be found at www.ece.rice.edu/~parthas
The InfiniBand Architecture
The InfiniBand(tm) Architecture is a new industry-standard architecture for server I/O and inter-server communication. It was developed by the InfiniBand(sm) Trade Association, a collaboration of over 180 companies including all the major server vendors (e.g., Dell, Compaq, HP, IBM, Intel, Microsoft, Sun, Cisco, 3Com, Fujitsu, Hitachi, Nortel, etc.) to provide greater reliability, availability, performance, and scalability than can be achieved with bus-oriented I/O systems. This talk will provide an overview of InfiniBand: the problems it solves, how it was defined, what it specifies (and doesn't), and the elements that that make up an InfiniBand subsystem.
Dr. Gregory Pfister is a Senior Technical Staff Member in the IBM Server Technology & Architecture group in Austin, Texas, working on InfiniBand, clustered systems and technical strategy. Dr. Pfister received his Ph.D. from MIT, has taught at MIT and UC Berkeley, and has worked on parallel computing for over 20 years. His numerous published papers include two that received awards at major international conferences. He has been a Distinguished visitor of the IEEE Computer Society, elected to several honor societies, has been elected a member of the IBM Academy of Technology, and is on several industry and conference advisory comittees related to clusters. He is the author of "In Search of Clusters," published by Prentice Hall, which is currently in its second edition and widely referred to as "The Bible of Clusters."
Pentium (tm) 4 Processor Micro-architecture and design tradeoffs
This talk will discuss some of the key micro-architectural features of the Pentium (tm) 4 Processor and the motivation for doing them. It will discuss some of the tradeoffs made in doing this new processor design. This will be done by comparing the Pentium 4 design with the P6 micro- architecture.
Glenn Hinton is an Intel Fellow. He has been at Intel for 17 1/2 years and worked on the design of 7 different state-of-the-art processors from Intel. These designs included 3 processors from the i960 line -- including the world's first superscalar microprocessor. He was one of the 3 senior architects of the P6 micro-architecture. He lead the micro-architecture development of the Pentium (tm) 4 Processor (Willamette processor). He graduated with an MSEE from Brigham Young University in 1983. Glenn holds many patents on all aspects of processor micro-architectures.
Self-Managing Storage Systems
Storage systems are becoming larger, more complex, and more difficult to manage each year. An enterprise-scale computer installation can contain dozens of hosts and tens or even hundreds of disk arrays, connected via storage area network (SAN) fabrics and easily encompassing thousands of disks and logical volumes. Total capacities of tens of terabytes are becoming commonplace. Even smaller installations may include storage devices from multiple vendors and different performance levels that must be managed together. In addition, storage systems must provide guaranteed minimum performance and dependability, even in the presence of device failures, upgrades and changes in application requirements.
Designing and maintaining large storage systems to meet such requirements is difficult because of the enormous number of configuration choices available in storage hardware and the complexity and variety of application workloads. Current approaches for managing this complexity are more art than science, requiring human intervention and depending on decisions based on experience, intuition and guesswork. This frequently leads to solutions that are grossly over-provisioned or substantially under-performing or both.
We believe that the solution is a self-managing system, in which objects (such as files, database tables, or file systems) are stored in a shared storage pool in which the details of low-level object-to-device assignment, management, and load balancing are managed invisibly by the system. Our work in automating storage management draws on knowledge of several key areas: characterization of I/O workloads, techniques for modeling storage device behavior, optimization techniques for assigning objects to devices, and strategies for reorganizing data in the running system, while maintaining access from multiple hosts accessing the distributed storage pool. This talk will give an overview of our work and our recent results.
Biography: Dr. Kimberly Keeton is a researcher in the Storage Systems Program at Hewlett-Packard Laboratories. Her research focuses on I/O workload characterization to support the group's goal of automatic storage system management. She received a BS degree in Computer Engineering and Engineering and Public Policy from Carnegie Mellon University, and MS and PhD degrees in Computer Science from the University of California at Berkeley. She is co-chair of the Fourth Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-01), to be held before the Symposium on High-Performance Computer Architecture (HPCA-7), in January 2001.
For more information on HPL's Storage Systems Program, see: http://www.hpl.hp.com/SSP/
David A. Wood
University of Wisconsin-Madison
Timestamp Snooping: An Approach for Extending SMPs
Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processor's cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies.
This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order.
We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28% faster than directories, at a cost of 13-43% more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29% faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency.