# Memory Consistency Models

# **Outline**

- Review of multi-threaded program execution on uniprocessor
- Need for memory consistency models
- Sequential consistency model
- Relaxed memory models

   weak consistency model
  - release consistency model
- Conclusions



#### Multi-threaded programs on multiprocessor • Each processor executes one thread • Derations in each thread are executed in order • One processor at time can access global memory to perform load/store/atomic operations • no caching of global data

 no caching of global data
 You can show that running multi-threaded program on multiple processors does not change possible output(s) of program from uniprocessor case

# More realistic architecture

• Two key assumptions so far:

- 1. processors do not cache global data
  - improving execution efficiency:
  - allow processors to cache global data
  - » leads to cache coherence problem, which can be solved using coherent caches as explained before
- 2. instructions within each thread are executed in order
   improving execution efficiency:
  - allow processors to execute instructions out of order subject to data/control dependences
    - » surprisingly, this can change the semantics of the program
    - » preventing this requires attention to memory consistency model of processor

#### Recall: uniprocessor execution

- Processors reorder operations to improve performance
- Constraint on reordering: must respect dependences
  - data dependences must be respected: in particular, loads/stores to a given memory address must be executed in program order
  - control dependences must be respected
- Reorderings can be performed either by compiler or processor

# Permitted memory-op reorderings

| Stores t     program                     |                                     | ocations can $\leftrightarrow$   | be performed out of<br>store b1, flag<br>store v1, data |    |
|------------------------------------------|-------------------------------------|----------------------------------|---------------------------------------------------------|----|
| <ul> <li>Loads fi<br/>program</li> </ul> |                                     | y locations ca $\leftrightarrow$ | an be performed out of<br>load data,r2<br>load flag, r1 |    |
|                                          | id store to different n<br>am order | nemory locat                     | ions can be performed ou                                | Jt |



### Problem in multiprocessor context

- Canonical model
  - operations from given processor are executed in program order
  - memory operations from different processors appear to be interleaved in some order at the memory
- Question:
  - If a processor is allowed to reorder independent operations in its own instruction stream, will the execution always produce the same results as the canonical model?
  - Answer: no. Let us look at some examples.

# Example (I)

Code: Initially A = Flag = 0

P1 A = 23; Flag = 1;

P2 while (Flag != 1) {;} ... = A;

#### Idea:

- P1 writes data into A and sets Flag to tell P2 that data value can be read from A.
- P2 waits till Flag is set and then reads data from A.

| Execution Sequence for (I)                                                                                                                                          |                                     |                    |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|--------------------|--|--|--|
| Code:<br>Initially $A = Flag = 0$<br>P1<br>A = 23;<br>Flag = 1;                                                                                                     | P2<br>while (Flag != 1) {;}<br>= A; |                    |  |  |  |
| Possible execution sequence on each processor:                                                                                                                      |                                     |                    |  |  |  |
| P1                                                                                                                                                                  | P1 P2                               |                    |  |  |  |
| Write A 23                                                                                                                                                          | Read Flag //get 0                   |                    |  |  |  |
| Write Flag 1                                                                                                                                                        |                                     |                    |  |  |  |
|                                                                                                                                                                     | Read Flag                           |                    |  |  |  |
|                                                                                                                                                                     | Read A                              | //what do you get? |  |  |  |
| Problem: If the two writes on processor P1 can be reordered, it is<br>possible for processor P2 to read 0 from variable A.<br>Can happen on most modern processors. |                                     |                    |  |  |  |

|                           | xample II                     |  |  |  |
|---------------------------|-------------------------------|--|--|--|
| Code: (like Dekker's al   | 5 /                           |  |  |  |
| Initially Flag1 = Flag2 = | = 0                           |  |  |  |
| P1                        | P2                            |  |  |  |
| Flag1 = 1;                | Flag2 = 1;                    |  |  |  |
| If $(Flag2 == 0)$         | If (Flag1 == 0)               |  |  |  |
| critical section          | critical section              |  |  |  |
| Possible execution seq    | uence on each processor:      |  |  |  |
| P1                        | P2                            |  |  |  |
| Write Flag1, 1            | Write Flag2, 1                |  |  |  |
| Read Flag2 //get 0        | Read Flag1 //what do you get? |  |  |  |

# Execution sequence for (II)

| Code: (like Dekker's algorithm)<br>Initially Flag1 = Flag2 = 0                                                                                                                               |                        |  |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|--|--|--|--|--|
| P1                                                                                                                                                                                           | P2                     |  |  |  |  |  |
| Flag1 = 1;                                                                                                                                                                                   | Flag2 = 1;             |  |  |  |  |  |
| If (Flag2 == 0)                                                                                                                                                                              | If $(Flag1 == 0)$      |  |  |  |  |  |
| critical section                                                                                                                                                                             | critical section       |  |  |  |  |  |
| childal section                                                                                                                                                                              | chucal section         |  |  |  |  |  |
| Possible execution seque                                                                                                                                                                     | nce on each processor: |  |  |  |  |  |
| P1 .                                                                                                                                                                                         | P2                     |  |  |  |  |  |
| Write Flag1, 1                                                                                                                                                                               | - Write Flag2, 1       |  |  |  |  |  |
| Read Flag2 //get 0                                                                                                                                                                           | Read Flag1, ??         |  |  |  |  |  |
| riodd i idge i/gor o                                                                                                                                                                         |                        |  |  |  |  |  |
| Most people would say that P2 will read 1 as the value of Flag1.                                                                                                                             |                        |  |  |  |  |  |
| Since P1 reads 0 as the value of Flag2, P1's read of Flag2 must happen before P2<br>writes to Flag2. Intuitively, we would expect P1's write of Flag to happen before P2's<br>read of Flag1. |                        |  |  |  |  |  |
| However, this is true only if reads and writes on the same processor to different<br>locations are not reordered by the compiler or the hardware.                                            |                        |  |  |  |  |  |
| Unfortunately, this is very common on most processors (store-buffers with load-<br>bypassing).                                                                                               |                        |  |  |  |  |  |

### <u>Lessons</u>

- Uniprocessors can reorder instructions subject only to control and data dependence constraints
- These constraints are not sufficient in shared-memory context
  - simple parallel programs may produce counterintuitive results
- Question: what constraints must we put on uniprocessor instruction reordering so that
  - shared-memory programming is intuitive
- but we do not lose uniprocessor performance?Many answers to this question
  - answer's to this question
     answer is called memory consistency model supported by the processor

# **Consistency models**

- Consistency models are not about memory operations from different processors.
- Consistency models are not about dependent memory operations in a single processor's instruction stream (these are respected even by processors that reorder instructions).
- Consistency models are all about ordering constraints on independent memory operations in a single processor's instruction stream that have some high-level dependence (such as flags guarding data) that should be respected to obtain intuitively reasonable results.

# Simplest Memory Consistency

#### <u>Model</u>

 Sequential consistency (SC) [Lamport]

 our canonical model: processor is not allowed to reorder reads and writes to global memory



# Sequential Consistency

- SC constrains all memory operations:
  - Write → Read
  - Write  $\rightarrow$  Write
  - Read  $\rightarrow$  Read, Write
- Simple model for reasoning about parallel programs
  - You can verify that the examples considered earlier work correctly under sequential consistency.
- However, this simplicity comes at the cost of uniprocessor performance.
- Question: how do we reconcile sequential consistency model with the demands of performance?

### Relaxed consistency model: Weak consistency

- Programmer specifies regions within which global memory operations can be reordered
   Processor has fence instruction:
  - all data operations before fence in program order must complete before fence is executed
  - all data operations after fence in program order must wait for fence to complete
     fences are performed in program order
- Implementation of fence:
- processor has counter that is incremented when data op is issued, and decremented when data op is completed
- Example: PowerPC has SYNC instruction
- Language constructs:
- OpenMP: flush
- All synchronization operations like lock and unlock act like a fence



# Example (I) revisited Code: Initially A = Flag = 0 P1 P2 A = 23; flush; Flag = 1; P2 State flush; flush; flush; flush; flush; flush; P1 P3 P1 P4 P3 P4 P4 P4 P4 P4 P4 P4 P5 P4 P4 P4 P4 P4 P5 flush; flush; ... = Å; P4 P4 P4 P5 P4 P4 P4 P5 P4 P4 P4 P4 P4 P5 P4 P5 P4 P4 P4 P5 P4 P5 P4 P5 P4 P4 P4 P5 P4 P4 P4 P5 P4 P4 P4 P4 P4 P4 P4

# Another relaxed model: release consistency

- Further relaxation of weak consistency
- Synchronization accesses are divided into
  - Acquires: operations like lock
  - Release: operations like unlock
- Semantics of acquire:
- Acquire must complete before all following memory accesses Semantics of release:
- all memory operations before release are complete
- However,

#### - acquire does not wait for accesses preceding it

accesses after release in program order do not have to wait for release
 operations which follow release and which need to wait must be protected by an acquire



| Implementations on Current<br>Processors |                              |                               |                                |                               |                                           |                                            |                            |                                        |
|------------------------------------------|------------------------------|-------------------------------|--------------------------------|-------------------------------|-------------------------------------------|--------------------------------------------|----------------------------|----------------------------------------|
|                                          | Loads Reordered After Loads? | Loads Reordered After Stores? | Stores Reordered After Stores? | Stores Reordered After Loads? | Atomic Instructions Reordered With Loads? | Atomic Instructions Reordered With Stores? | Dependent Loads Reordered? | Incoherent Instruction Cache Pipeline? |
| Alpha                                    | Y                            | Y                             | Y                              | Y                             | Y                                         | Y                                          | Y                          | Y                                      |
| AMD64                                    | Y                            |                               |                                | Y                             |                                           |                                            |                            |                                        |
| IA64                                     | Y                            | Y                             | Y                              | Y                             | Y                                         | Y                                          |                            | Y                                      |
| (PA-RISC)                                | Y                            | Y                             | Y                              | Y                             |                                           |                                            |                            |                                        |
| PA-RISC CPUs                             |                              |                               |                                |                               |                                           |                                            |                            |                                        |
| POWER                                    | Y                            | Y                             | Y                              | Y                             | Y                                         | Y                                          |                            | Y                                      |
| SPARC RMO                                | Y                            | Y                             | Y                              | Y                             | Y                                         | Y                                          |                            | Y                                      |
| (SPARC PSO)                              |                              |                               | Y                              | Y                             |                                           | Y                                          |                            | Y                                      |
| SPARC TSO                                |                              |                               |                                | Y                             |                                           |                                            |                            | Y                                      |
| ×86                                      | Y                            | Y                             |                                | Y                             |                                           |                                            |                            | Y                                      |
| (x86 OOStore)                            | Y                            | Y                             | Y                              | Y                             |                                           |                                            |                            | Y                                      |
| zSeries                                  |                              |                               |                                | Y                             |                                           |                                            |                            | Y                                      |

# **Comments**

- In the literature, there are a large number of other consistency models
  - processor consistency
  - total store order (TSO)
  - ....
- It is important to remember that these are concerned with reordering of independent memory operations within a processor.
- Easy to come up with shared-memory programs that behave differently for each consistency model.
- Emerging consensus that weak/release consistency is adequate.

# **Summary**

- Two problems: memory consistency and memory coherence
   Memory consistency model

   what instructions is compiler or hardware allowed to reorder?
   nothing really to do with memory operations from different processors/threads
   sequential consistency: perform global memory operations in program order
   releved consistency: medals; all of them roly operations operations
  - \_
  - relaxed consistency models: all of them rely on some notion of a fence operation that demarcates regions within which reordering is permissible
- Memory coherence
  - Preserve the illusion that there is a single logical memory location corresponding to each program variable even though there may be lots of physical memory locations where the variable is stored