Guang R. Gao,
University of Delaware
Vivek Sarkar
MIT Laboratory for Computer Science
1. Background
A memory consistency model specifies the semantics of concurrent memory operations (load/store or synchronization operations). The bulk of past work on memory consistency models has been pursued from the viewpoint of uniprocessor and multiprocessor hardware in which a memory consistency model is used to specify certain behavioral properties that are guaranteed by the hardware. This has led to definitions of memory consistency that are described in terms of the lowest level of program execution in hardware where a program is viewed as a fixed dynamic ordering (``program order'') of primitive operations on each processor. Memory consistency is then defined with respect to the relative ordering of when memory operations are ``performed'' or ``globally performed'' in this low-level view of hardware.
While this low-level view of memory consistency has been helpful in the design of more efficient and scalable cache consistency protocols for shared-memory multiprocessors, it is too narrow to be useful in an end-to-end view of computer systems. For example, these memory consistency models cannot be used to specify the semantics of a parallel (multithreaded) program with accesses to shared data because the notions of a fixed program order of all primitive operations and of memory operations being ``performed by a processor'' make no sense in a high-level programming language.
The hardware memory consistency model that has been most commonly used as a basis for past work is {\it sequential consistency} (SC)~\cite{Lamport79}. It has been observed that sequential consistency limits performance by preventing the use of common uniprocessor hardware optimizations such as store buffers and out-of-order memory operations~\cite{GharachorlooEtAl90,AdveHil90}. The main approach taken in recent work on memory consistency models is to allow performance optimizations to be applied, while guaranteeing that sequential consistency is retained for a restricted class of programs --- mainly programs that do not exhibit data races~\cite{AdveHil93}. Therefore, we refer to these weaker memory consistency models as {\it SC-derived} models. Recently proposed SC-derived models include {\it weak ordering} (WO)~\cite{DuboisSchBri86} , {\it release consistency} (RC)~\cite{GharachorlooEtAl90}, {\it data-race-free-0} (DRF0)~\cite{AdveHil90}, and {\it data-race-free-1} (DRF1)~\cite{AdveHil93}.
A central assumption in the definitions of all SC-derived memory consistency models is the {\em memory coherence} assumption, which can be stated as follows~\cite{GharachorlooEtAl90}: ``all writes to the same location are serialized in some order and are performed in that order with respect to any processor''. Memory coherence is a less restrictive form of serializability --- it enforces serializability of operations performed on the same location, rather than serializability of all memory operations.
The memory coherence assumption in past hardware memory consistency models poses another obstacle to obtaining an end-to-end definition of memory consistency in computer systems. The semantic models for parallel programs typically assumed by software have an underlying partial order for program and memory operations defined by the concurrency and synchronization constructs used by the program. These partial orders can be nested and composed in accordance with the structure of the program --- the relative ordering constraints of operations in one subprogram are independent of the ordering constraints of operations in another subprogram.
However, the memory coherence assumption places an additional restriction on the ordering of memory operations that goes beyond the partial order. This restriction couples the ordering of memory operations in different subprograms by requiring that they all observe writes to a memory location in the same order. This coupling makes it difficult for a programmer to work at the source program level with a memory consistency model that contains the memory coherence assumption. Currently, programmers are forced to address this coupling by very messy program modifications (e.g., insertion of ``volatile'' variable declarations) to increase the likelihood that their parallel programs will run correctly. These modifications are tedious and also make the programs run more slowly. Even after making the modifications, the correctness of the modified parallel program remains dependent on the versions of the system software (compiler, operating system) used. An end-to-end memory consistency model would eliminate the need for making program modifications (such as the declaration of volatile variables) for correctness, and also avoid the inefficiencies that accompany these modifications.
In summary, current memory consistency models are defined at a very primitive level of program execution and include a memory coherence assumption that enforces serializability of memory operations performed on the same memory location. Since the primary purpose of a memory consistency model is to serve as a contract between hardware and software, there is a need for end-to-end definitions of memory consistency that can be understood uniformly at different levels of software and hardware in a computer system.
2. Our Position
The thesis of this position paper is that it is essential to provide an end-to-end definition of memory consistency that can be understood at all software and hardware levels. We believe that this is possible with a memory consistency model based on partial order execution semantics rather than on the memory coherence assumption. The Location Consistency (LC) model~\cite{GaoSarkar94,SarkarGao95} is one example of a memory consistency model based on partial order semantics; the Dag Consistency model~\cite{BFJLR96} used in the MIT Cilk project is another such example.
A partial order execution semantics is fundamental to many layers of software and hardware in a computer system. For example:
(1) The semantics of most high-level programming languages declares the order of operand evaluation in an expression as being unspecified, and allows the implementation to choose any ordering that it desires. The programmer can easily understand this underspecification as a partial order that represents a set of possible execution sequences. But it would be very difficult for the programmer to work with a definition of memory consistency that assumes a fixed ordering for operations whose ordering are left unspecified in the semantics of the programming language.
(2) Many compiler optimizations work with program representations that reflect a partial order. For example, the Sethi-Ullman instruction reordering algorithm for improved register allocation assumes that the partial order is represented by an instruction tree. Similarly, the instruction scheduling phase of an optimizing back-end assumes that the partial order is represented by an instruction dag. It is very difficult for such compiler optimizations to work with a memory consistency model that assumes a fixed instruction ordering, since they would either need to conservatively disable all instruction reordering or go to great lengths to prove that a change in instruction order (e.g., an interchange of two independent load instructions) does not violate the memory consistency semantics of the program.
(3) Most definitions of memory consistency view hardware as the lowest (primitive) level of instruction execution. However, there may be multiple layers of interpretation even in hardware execution of instructions. Just as in software, one level of an instruction set architecture may leave unspecified the execution ordering of the memory operands of a (compound) instruction. As before, it would be difficult to work with a memory consistency model that assumes a fixed ordering of memory operations when defining the semantics of such an instruction set architecture.
(4) Even though compound instructions with memory operands have become less prevalent with the advent of RISC architectures, this same phenomenon can be observed in on-the-fly binary translation that is being proposed for modern computer systems. On-the-fly translation is frequently performed as a pattern matching of instruction trees in a source instruction stream. Once again, this raises the possibility of reordering operations and memory accesses during the translation which would make it difficult to work with a memory consistency model that assumes a fixed ordering.
All these examples reinforce the fact that program execution is fundamentally a partial order, and thus motivate the need for a memory consistency semantics that is based on viewing program execution as a partial order.
3. Location Consistency: An Alternative to Memory Coherence ---------------------------------
In this section, we give a brief summary of the Location Consistency (LC) model, as an example of a memory consistency model that is based on partial order execution semantics and does not use the memory coherence assumption. We expect that other memory consistency models based on partial order semantics will be defined in the future, such as the Dag Consistency model~\cite{BFJLR96} used in the MIT Cilk system.
Location Consistency models the state of a memory location as a partially ordered multiset (pomset). Each element of the pomset corresponds to either a write operation to the location or a synchronization operation. The partial orders in the pomsets are determined entirely by the ordering constraints imposed by the concurrency and synchronization constructs in the program being executed --- no extra program modifications (such as the insertion of volatile declarations) need to be made. Thus, the LC model provides a uniform view of memory consistency at all levels of software and hardware in a computer system. Only the partial order defined by the program needs to be obeyed; there is no additional coherence requirement that all writes to the same location be observed in the same order by all read operations.
An ordering in the pomset is created between two write operations to the same location if they originate from the same sequential thread. A synchronization operation inserts an element in the pomsets of all locations affected by the synchronization. The state of a location is observed by read operations: a read operation of location L returns one of a set of possible ``most recent write'' values from the pomset corresponding to L. Informally, we say that a software/hardware level of a computer system is Location Consistent if each read operation that is executed at that level returns a value that belongs to the most recent write set for the target location. This definition allows for the possibility of different partial orders at different levels of system software and hardware. (Note that the pomset model for the state of a location is solely an abstraction used for defining the LC model, and need not be implemented in any software/hardware layer of the computer system.)
There are several benefits that follow from the Location Consistency model due to the fact that it is based on a partial order execution semantics:
(1) The same memory consistency model can be applied in an end-to-end view of all levels of software and hardware in a computer system.
(2) The compiler optimizations and transformations mentioned earlier (e.g., instruction scheduling) can be performed safely by obeying the partial order constraints as usual, and without requiring that shared data be treated as volatile variables.
(3) Interpretation, emulation, or on-the-fly translation of an instruction set architecture or virtual machine can also be performed safely by only obeying the partial order constraints.
(4) The LC model lends itself to using ``multiple-owner'' cache consistency protocols that can be more efficient and scalable than ``single-owner'' protocols. Lazy Release Consistency and Data Merging are examples of two multiple-owner protocols from past work. We have recently gained some initial experience with a more relaxed multiple-owner cache consistency protocol that is also location consistent~\cite{Merali96,Petry97}. Preliminary simulation results show the potential for 1.5x - 10x reductions in # cache misses using this more relaxed LC protocol may be possible for certain programs.
4. Workshop Issues
In this section, we provide a brief outline of how our position statement addresses the workshop issues.
(1) Research directions:
We believe that the research directions that will offer a quantum leap in providing an end-to-end memory consistency model for the next generation of computer systems are those that break away from the shackles of memory coherence. The fruit of these research directions will be seen in robust compiler optimization techniques for explicitly parallel programs, new (multiple-owner) cache consistency protocols that provide more efficient and scalable support for shared-memory multiprocessing, and an end-to-end view of memory consistency that is as simple to articulate as the semantics of an assignment statement.
(2) Grand challenge problems:
We believe that two important grand challenge problems for today's research in the area of memory consistency are: a) How to build a shared-memory multiprocessor that is effective at a scale of a thousand processors or more? b) How to compile and optimize an explicitly parallel program so that its single-processor performance is comparable to that of highly optimized uniprocessor program?
The major obstacle in solving these problems is the memory coherence assumption in today's memory consistency model that must be obeyed by all levels (software and hardware) of a computer system. Memory coherence leads to the use of single-writer cache consistency protocols which (by their very nature) have limited scalability. Memory coherence also restricts reordering of instructions and severely limits the optimizations (e.g., register allocation, instruction scheduling) that can safely be performed on an explicitly parallel program.
(3) New paradigms:
We believe that defining memory consistency in the context of a partial order program execution model will lead to a new and simple paradigm for thinking about parallel programs and their behavior on shared data.
(4) Synergy among research areas in computer systems:
There is a natural synergy between the areas of compilers and computer architecture in implementing a new end-to-end view of memory consistency. However, a synergistic relationship will also be built with other aspects of computer systems that need a semantics for sharing data with applications (e.g., operating systems, graphics, networks, etc.)
(5) Infrastructure requirements:
Availability of high-quality infrastructure is crucial for being able to conduct research on an end-to-end view of memory consistency. Such an infrastructure should include (at a minimum) a good research compiler, a powerful architecture/system simulation platform, a representative set of benchmarks, a performance instrumentation and evaluation tool set, and a program development and performance debugging and tuning environment. The lack of such infrastructure support often poses a serious barrier to entry for research in this field.
(6) Prioritization of research areas: In our opinion, all the following research areas are important but we list them in an estimated order of diminishing pay-offs:
(a) Sound but simple specification of memory consistency models based on partial order execution semantics.
(b) Compiler optimization of explicitly parallel programs with shared memory for the memory consistency models in (a). The biggest source of explicitly parallel programs today are multithreaded programs that use popular thread libraries such as pthreads and Java threads. These optimization techniques may be employed either in static compilation or dynamic code generation contexts.
(c) Design and implementation of (more) scalable cache consistency protocols for shared-memory multiprocessors.
(d) Comparative studies of the performance obtained by (b) and (c) with the corresponding performance for traditional memory consistency models.
(e) Special studies of (b) and (c) for multithreaded architectures with hardware support for fine-grain context switching.
5. Conclusions
In this position paper, we argued that it is important to take an end-to-end view of memory consistency that is applicable at all levels of software and hardware in a computer system. In particular, we believe that the semantics of the memory coherence assumption in current memory consistency models is not meaningful in an end-to-end view of computer systems, and that the memory coherence assumption also imposes serious limitations on compiler optimization of parallel programs and scalability of shared-memory multiprocessors
We proposed that memory consistency models instead be based on the partial order execution semantics of parallel programs without relying on the memory coherence assumption. The Location Consistency and Dag Consistency models are two specific examples of the kind of such memory consistency models.
Finally, we included a brief outline of how our position statement addresses the six workshop issues listed in the call for participation.