# Foundations: Concurrency Concerns Synchronization Basics

Chris Rossbach

**CS378** 

# Today

- Questions?
- Administrivia
  - Office hours changes
  - You've started Lab 1 right?
- Foundations
  - Parallelism
  - Basic Synchronization
  - Threads/Processes/Fibers, Oh my!
  - Cache coherence (maybe)

### Multithreaded programming





- Acknowledgments: some materials in this lecture borrowed from
  - Emmett Witchel (who borrowed them from: Kathryn McKinley, Ron Rockhold, Tom Anderson, John Carter, Mike Dahlin, Jim Kurose, Hank Levy, Harrick Vin, Thomas Narten, and Emery Berger)
  - Mark Silberstein (who borrowed them from: Blaise Barney, Kunle Olukoton, Gupta)
  - Andy Tannenbaum
  - · Don Porter
  - me..
  - Photo source: https://img.devrant.com/devrant/rant/r\_10875\_uRYQF.jpg

### Faux Quiz (answer any 2, 5 min)

- Who was Flynn? Why is her/his taxonomy important?
- How does domain decomposition differ from functional decomposition? Give examples of each.
- Can a SIMD parallel program use functional decomposition?
   Why/why not?
- What is an RMW instruction? How can they be used to construct synchronization primitives? How can sync primitives be constructed without them?

# Review: Problem Partitioning

- Domain Decomposition
  - SPMD
  - Input domain
  - Output Domain
  - Both
- Functional Decomposition
  - MPMD
  - Independent Tasks
  - Pipelining



# Exercise: Functional Decomposition

- Each CPU gets part of:
- $v_t(i,j) = F(v_{t-1}(of \ all \ its \ neighbors))$



Issues?

### Review: Amdahl's law

$$Speedup = \frac{\text{serial run time}}{\text{parallel run time}}$$

$$Speedup(\#CPUs) = \frac{T_{serial}}{T_{parallel}} = \frac{1}{\frac{A}{\#CPUs} + (1 - A)}$$

### Amdahl Action Zone



# Strong Scaling vs Weak Scaling Amdahl vs. Gustafson





- N = #CPUs,  $S = serial\ portion = 1 A$
- Amdahl's law:  $Speedup(N) = \frac{1}{\frac{A}{N} + S}$ 
  - Strong scaling: Speedup (N) calculated given total amount of work is fixed
  - Solve same problems faster when problem size is fixed and #CPU grows
  - · Assuming parallel portion is fixed, speedup soon seizes to increase
- Gustafson's law: Speedup(N) = S + (S-1)\*N
  - Weak scaling: Speedup(N) calculated given work per CPU is fixed
  - Work/CPU fixed when adding more CPUs keeps granularity fixed
  - Problem size grows: solve larger problems
  - Consequence: speedup upper bound is much higher
  - Given work W on n CPUs, with  $\alpha$  serial
    - Incremental work W' on (n+1) CPUs:  $W'=\alpha W+(1-\alpha)nW$
    - Speedup based on case where  $(1-\alpha)$  scales perfectly:

$$S(n) = rac{lpha W + (1-lpha)nW}{lpha W + rac{(1-lpha)nW}{n}}$$
S(n)= $lpha$ +(1- $lpha$ )n



When is Gustavson's law a better metric? When is Amdahl's law a better metric?

### Super-linear speedup

- Possible due to cache
- But usually just poor methodology
- Baseline: \*best\* serial algorithm
- Example:

#### Efficient **bubble sort**

- *Serial: 150s*
- Parallel 40s
- Speedup:  $\frac{150}{40} = 3.75$ ?

#### NO NO NO!

- Serial quicksort: 30s
- *Speedup = 30/40 = 0.75X*



Why insist on best serial algorithm as baseline?

# Programming and Machines: a mental model



```
struct machine_state{
  uint64 pc;
  uint64 Registers[16];
  uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
} machine;
while(1) {
  fetch_instruction(machine.pc);
  decode_instruction(machine.pc);
  execute_instruction(machine.pc);
void execute_instruction(i) {
  switch(opcode) {
  case add_rr:
   machine.Registers[i.dst] += machine.Registers[i.src];
   break;
```

### Parallel Machines: a mental model



```
program.exe
Source code
                     Compiler
                                                      instruction1
                                                                     instruction1
                                                                                   instruction1
                                       instruction1
                                                     instruction2
                                                                     instruction2
                                                                                   instruction2
                                       instruction2
a = "hello";
                                                     instruction3
                                                                     instruction3
                                                                                   instruction3
                                       instruction3
b = a + "!";
                                                                                   instruction4
                                                     instruction4
                                                                     instruction4
                                       instruction4
```

```
struct machine_state{
  uint64 pc;
  uint64 Registers[16];
  uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
...
} machine;
while(1) {
  fetch_instruction(machine.pc);
  decode_instruction(machine.pc);
  execute_instruction(machine.pc);
}
void execute_instruction(i) {
  switch(opcode) {
   case add_rr:
    machine.Registers[i.dst] += machine.Registers[i.src];
    break;
}
```

```
struct machine_state{
  uint64 pc;
  uint64 Registers[16];
  uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
...
} machine;
while(1) {
  fetch_instruction(machine.pc);
  decode_instruction(machine.pc);
  execute_instruction(machine.pc);
}
void execute_instruction(i) {
  switch(opcode) {
  case add_rr:
   machine.Registers[i.dst] += machine.Registers[i.src];
   break;
}
```

### Processes and Threads and Fibers...

- Abstractions
- Containers
- State
  - Where is shared state?
  - How is it accessed?
  - Is it mutable?







### Processes

#### Model



- Multiprogramming of four programs
- Conceptual model of 4 independent, sequential processes
- Only one program active at any instant

#### Implementation

| Process management        | Memory management        | File management   |
|---------------------------|--------------------------|-------------------|
| Registers                 | Pointer to text segment  | Root directory    |
| Program counter           | Pointer to data segment  | Working directory |
| Program status word       | Pointer to stack segment | File descriptors  |
| Stack pointer             |                          | User ID           |
| Process state             |                          | Group ID          |
| Priority                  |                          |                   |
| Scheduling parameters     |                          |                   |
| Process ID                |                          |                   |
| Parent process            |                          |                   |
| Process group             |                          |                   |
| Signals                   |                          |                   |
| Time when process started |                          |                   |
| CPU time used             |                          |                   |
| Children's CPU time       |                          |                   |
| Time of next alarm        |                          |                   |



### The Thread Model

#### Per process items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

- Items shared by all threads in a process
- Items private to each thread

#### Per thread items

Program counter

Registers

Stack

State

| Process management        | Memory management        | File management   |
|---------------------------|--------------------------|-------------------|
| Registers                 | Pointer to text segment  | Root directory    |
| Program counter           | Pointer to data segment  | Working directory |
| Program status word       | Pointer to stack segment | File descriptors  |
| Stack pointer             |                          | User ID           |
| Process state             |                          | Group ID          |
| Priority                  |                          |                   |
| Scheduling parameters     |                          |                   |
| Process ID                |                          |                   |
| Parent process            |                          |                   |
| Process group             |                          |                   |
| Signals                   |                          |                   |
| Time when process started |                          |                   |
| CPU time used             |                          |                   |
| Children's CPU time       |                          |                   |
| Time of next alarm        |                          |                   |

# Using threads



Parent Process: launchd (1)

Microsoft Word (446)

User: rossbach (501)

### Where to Implement Threads:

#### **User Space**



A user-level threads package

Kernel Space



A threads package managed by the kernel

### Threads vs Fibers

- Like threads, just an abstraction for flow of control
- Lighter weight than threads
  - In Windows, just a stack, subset of arch. registers, non-preemptive
  - \*Not\* just threads without exception support
  - stack management/impl has interplay with exceptions
  - Can be completely exception safe
- *Takeaway*: diversity of abstractions/containers for execution flows

# x86\_64 Architectural Registers



• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

```
switch_to(x,y) should switch_tasks_toom_x to y.
                    * This could still be optimized:
                    * - fold all the options into a flag word and test it with a single test
                    * - could test fs/gs bitsliced
                    * Kprobes not supported here. Set the probe on schedule inst
                    * Function graph tracer not supported too.
                   __visible __notrace_funcgraph struct task_struct *
                   __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
                          struct thread_struct *prev = &prev p->thread;
                          struct thread_struct *next = &next p->thread;
                          struct fpu *prev_fpu = &prev->fpu;
                          struct fpu *next_fpu = &next->fpu;
                          int cpu = smp_processor_id()
                          struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
                          WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
                                       this_cpu_read(irq_count) != -1);
                           switch_fpu_prepare(prev_fpu, cpu);
                           /* We must save %fs and %gs before load_TLS() because
                           * %fs and %gs may be cleared by load_TLS().
                           * (e.g. xen load tls())
                          save_fsgs(prev_p);
                           * Load TLS before restoring any segments so that segment loads
                           * reference the correct GDT entries.
                                                                                                    CW
                          load_TLS(next, cpu);
                                                                                                    SW
                           * Leave lazy mode, flushing any hypercalls made here. This
                                                                                                    TW
                           * must be done after loading TLS entries in the GDT but before
                            * loading segments that might reference them, and and it must
                           * be done before fpu_restore(), so the TS bit is up to
                                                                                                   P_DS
ZMM16|| ZMN
ZMM24|| ZMM
                          arch_end_context_switch(next_p);
                          /* Switch DS and ES.
                           * Reading them only returns the selectors, but writing them (if
                           * nonzero) loads the full descriptor from the GDT or LDT. The
                           * LDT for next is loaded in switch mm, and the GDT is loaded
                           * above.
                           * We therefore need to write new values to the seament
                           * reaisters on every context switch unless both the new and old
                           * values are zero.
                           * Note that we don't need to do anything for CS and SS, as
                           * those are saved and restored as part of pt_regs.
                          savesegment(es, prev->es);
                          if (unlikely(next->es | prev->es))
                                  loadsegment(es, next->es);
                          savesegment(ds. prev->ds);
                          if (unlikely(next->ds | prev->ds))
                                  loadsegment(ds, next->ds);
                          load_seg_legacy(prev->fsindex, prev->fsbase
                                          next->fsindex, next->fsbase, FS);
                          load_seg_legacy(prev->gsindex, prev->gsbase,
```

next->asindex. next->asbase. GS

Complete fiber context switch on **Unix and Windows** CR<sub>0</sub> CR4 CR1 CR5 CR<sub>2</sub> CR6



Linux x86 64 context

switch *excerpt* 

wn work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

ZMM0

ZMM2

ZMM4

ZMM6

ZMM8

ZMM10

ZMM12

ZMM14

# x86\_64 Registers and Threads



 $switch_to(x,y)$  should switch tasks from x to y. Is could still be optimized: fold all the options into a flag word and test it with a single test could test f(x) a bits[lead

truct tas struct "tas = Sper cou(cou tas rw. cou)

switch\_fpu\_prepare(prev\_fpu, cpu);

save fses(prey p):

• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

# x86\_64 Registers and Fibers



\* 128-bit SSE registers, overlapping with 8 legacy 80-bit x87 floating point

Third argument Stack pointer, must be preserved Frame pointer, must be preserved

xmm6-15 fpcsr Non volatile

• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

### Pthreads

- POSIX standard thread model,
- Specifies the API and call semantics.
- Popular most thread libraries are Pthreads-compatible

### Preliminaries

- Include pthread.h in the main file
- Compile program with -lpthread
  - gcc -o test test.c -lpthread
  - may not report compilation errors otherwise but calls will fail
- Good idea to check return values on common functions

### Thread creation

- Types: pthread t type of a thread
- Some calls:

- No explicit parent/child model, except main thread holds process info
- Call pthread exit in main, don't just fall through;
- When do you need pthread join ?
  - status = exit value returned by joinable thread
- Detached threads are those which cannot be joined (can also set this at creation)

### Creating multiple threads

```
#include <stdio.h>
#include <pthread.h>
#define NUM THREADS 4
void *hello (void *arg) {
      printf("Hello Thread\n");
main() {
  pthread t tid[NUM THREADS];
  for (int i = 0; i < NUM THREADS; i++)
    pthread create(&tid[i], NULL, hello, NULL);
  for (int i = 0; i < NUM THREADS; i++)
    pthread join(tid[i], NULL);
```

# Can you find the bug here?

```
What is printed for myNum?
 void *threadFunc(void *pArg) {
   int* p = (int*)pArg;
   int myNum = *p;
   printf( "Thread number %d\n", myNum);
    from main():
 for (int i = 0; i < numThreads; i++) {
    pthread_create(&tid[i], NULL, threadFunc, &i);
```

### Pthread Mutexes

• Type: pthread mutex t

- Attributes: for shared mutexes/condition vars among processes, for priority inheritance, etc.
  - use defaults
- Important: Mutex scope must be visible to all threads!

### Pthread Spinlock

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);
int pthread_spin_trylock(pthread_spinlock_t *lock);
```

Wait...what's the difference?

```
int pthread_mutex_init(pthread_mutex_t *mutex,...);
int pthread_mutex_destroy(pthread_mutex_t *mutex);
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex);
```

### Concurrency and Correctness

If two threads execute this program concurrently, how many different final values of X are there?

#### Initially, X == 0.

#### Thread 1

```
void increment() {
   int temp = X;
   temp = temp + 1;
   X = temp;
}
```

#### Thread 2

```
void increment() {
  int temp = X;
  temp = temp + 1;
  X = temp;
}
```

#### **Answer:**

- **A.** (
- **B.**
- C. 2
- D. More than 2

# Schedules/Interleavings

#### Model of concurrent execution

- Interleave statements from each thread into a single thread
- If any interleaving yields incorrect results, synchronization is needed



### Locks fix this with Mutual Exclusion

```
void increment() {
   lock.acquire();
   int temp = X;
   temp = temp + 1;
   X = temp;
   lock.release();
}
```

Mutual exclusion ensures only safe interleavings

• But it limits concurrency, and hence scalability/performance

Is mutual exclusion a good abstraction?

# Why are Locks "Hard?"

```
Coarse-grain locksSimple to develop
```

- Easy to avoid deadlock
- Few data races
- Limited concurrency

```
// WITH FINE-GRAIN LOCKS
void move(T s, T d, Obj key){
  LOCK(s);
  LOCK(d);
  tmp = s.remove(key);
  d.insert(key, tmp);
  UNLOCK(d);
  UNLOCK(s);
}
```

- Fine-grain locks
  - Greater concurrency
  - Greater code complexity
  - Potential deadlocks
    - Not composable
  - Potential data races
    - Which lock to lock?

```
Thread 0 Thread 1
move(a, b, key1);
move(b, a, key2);

DEADLOCK!
```

### Correctness conditions

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.
  - If a thread i is in entry section, then there is a bound on the number of times that other threads are allowed to enter the critical section before thread i's request is granted

Mutex, spinlock, etc. are ways to implement

Did we get all the important conditions? Why is correctness defined in terms of locks? Theorem: Every property is a combination of a safety property and a liveness property.

-Bowen Alpern & Fred Schneider https://www.cs.cornell.edu/fbs/publications/defliveness.pdf

while (1)Entry section Critical section Exit section Non-critical section

# Implementing Locks

```
int lock_value = 0;
int* lock = &lock_value;
```

Completely and utterly broken. How can we fix it?

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

# HW Support for Read-Modify-Write (RMW)

```
IDEA: hardware implements something like:
```

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

#### Preview of Techniques:

- Bus locking
- Single Instruction ISA extensions
  - Test&Set
  - CAS: Compare & swap
  - Exchange, locked increment, locked decrement (x86)
- Multi-instruction ISA extensions:
  - LLSC: (PowerPC, Alpha, MIPS)
  - Transactional Memory (x86, PowerPC)

# Implementing Locks with Test&set

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
while (test&set(lock) == 1)
  ; //spin
}
```



```
(test & set ~= CAS ~= LLSC)
TST: Test&set
```

- Reads a value from memory
- Write "1" back to memory location

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- > B. Memory usage
- ➤ C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

More on this later...

#### Review: mutual exclusion model

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region

## Multiprocessor Cache Coherence

Physics Concurrency

F = ma ~ coherence

# Multiprocessor Cache Coherence



• P1: read X

• P2: read X

• P2: X++

• P3: read X

## Multiprocessor Cache Coherence



Each cache line has a state (M, E, S, I)

- Processors "snoop" bus to maintain states
- Initially → 'I' → Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads  $\rightarrow$  'S'  $\rightarrow$  multiple copies possible
- Write → 'M' → single copy → lots of cache coherence traffic



# Cache Coherence: single-thread



bnz try

store lock, 1



### Cache Coherence Action Zone

P1



```
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
    test R0
    bnz try
    store lock, 1
}
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
    test R0
    bnz try
    store lock, 1
}
```

### Cache Coherence Action Zone II

cache



 $P_2$ 

cache

cache

PrWr/ BusRdX

> PrRd/-BusRd(S)

# Read-Modify-Write (RMW)

- Implementing locks requires read-modify-write operations
- Required effect is:
  - An atomic and isolated action
    - 1. read memory location AND
    - 2. write a new value to the location
  - RMW is *very tricky* in multi-processors
  - Cache coherence alone doesn't solve it

```
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
          test R0
          bnz try
          store lock, 1
}
```



# Essence of HW-supported RMW

```
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
test R0
bnz try
store lock, 1
}
Make this into a single
(atomic hardware instruction)
```

# HW Support for Read-Modify-Write (RMW)

| Test & Set                                                                                                    | CAS                                                                                                                                             | Exchange, locked increment/decrement,                                                                | LLSC: load-linked store-conditional                                                                                                              |
|---------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Most architectures                                                                                            | Many architectures                                                                                                                              | x86                                                                                                  | PPC, Alpha, MIPS                                                                                                                                 |
| <pre>int TST(addr) {    atomic {     ret = *addr;     if(!*addr)        *addr = 1;    return ret;   } }</pre> | <pre>bool cas(addr, old, new) {    atomic {     if(*addr == old) {        *addr = new;        return true;     }     return false;    } }</pre> | <pre>int XCHG(addr, val) {    atomic {     ret = *addr;     *addr = val;     return ret;   } }</pre> | <pre>bool LLSC(addr, val) {   ret = *addr;   atomic {     if(*addr == ret) {       *addr = val;       return true;     }   return false; }</pre> |

```
void CAS_lock(lock) {
  while(CAS(&lock, 0, 1) != true);
}
```

## HW Support for RMW: LL-SC

#### **LLSC:** load-linked store-conditional

```
PPC, Alpha, MIPS
bool LLSC(addr, val) {
  ret = *addr;
  atomic {
   if(*addr == ret) {
     *addr = val;
     return true;
   }
  return false;
}
```

```
void LLSC_lock(lock) {
  while(1) {
    old = load-linked(lock);
    if(old == 0 && store-cond(lock, 1))
      return;
  }
}
```

- load-linked is a load that is "linked" to a subsequent store-conditional
- Store-conditional only succeeds if value from linked-load is unchanged

### LLSC Lock Action Zone



```
P1_
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```

```
P2
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```

#### LLSC Lock Action Zone II



```
Store
       conditional
          fails
lock(lock) {
 while(1) {
    old = 11(lock);
    if(old == 0)
      if(sc(lock, 1))
        return;
```

# Implementing Locks with Test&set

```
int lock_value = 0;
int* lock = &lock_value;

Lock::Acquire() {
  while (test&set(lock) == 1)
    ; //spin
}

(test & set ~ CAS ~ LLSC)
```

```
Lock::Release() {
    *lock = 0;
}
```

- What is the problem with this?
  - ➤ A. CPU usage B. Memory usage C. Lock::Acquire() latency
  - > D. Memory bus usage E. Does not work

# Test & Set with Memory Hierarchies

Initially, lock already held by some other CPU—A, B busy-waiting What happens to lock variable's cache line when different cpu's contend?



### TTS: Reducing busy wait contention

#### Test&Set

```
Lock::Acquire() {
while (test&set(lock) == 1);
}
```

Busy-wait on in-memory copy

```
Lock::Release() {
    *lock = 0;
}
```

#### Test&Test&Set

```
Lock::Acquire() {
  while(1) {
    while (*lock == 1); // spin just reading
    if (test&set(lock) == 0) break;
}
```

Busy-wait on cached copy

```
Lock::Release() {
*lock = 0;
}
```

- What is the problem with this?
  - A. CPU usage B. Memory usage C. Lock::Acquire() latency
  - D. Memory bus usage E. Does not work

# Test & Test & Set with Memory Hierarchies

What happens to lock variable's cache line when different cpu's contend for the same lock?



# Test & Test & Set with Memory Hierarchies

What happens to lock variable's cache line when different cpu's contend for the same lock?



# How can we improve over busy-wait?

```
Lock::Acquire() {
while(1) {
  while (*lock == 1); // spin just reading
  if (test&set(lock) == 0) break;
}
```

#### Mutex

- Same abstraction as spinlock
- But is a "blocking" primitive
  - Lock available → same behavior
  - Lock held → yield/block
- Many ways to yield
- Simplest case of semaphore

```
void cm3_lock(u8_t* M) {
  u8_t LockedIn = 0;
  do {
   if (__LDREXB(Mutex) == 0) {
     // unlocked: try to obtain lock
     if (__STREXB(1, Mutex)) { // got lock
       __CLREX(); // remove __LDREXB() lock
       LockedIn = 1;
     else task_yield(); // give away cpu
   else task_yield(); // give away cpu
} while(!LockedIn);
```

- Is it better to use a spinlock or mutex on a uni-processor?
- Is it better to use a spinlock or mutex on a multi-processor?
- How do you choose between spinlock/mutex on a multiprocessor?

# Priority Inversion

```
A(prio-0) \rightarrow enter(I);
B(prio-100) \rightarrow enter(I); \rightarrow must wait.
```

Solution?

**Priority inheritance:** A runs at B's priority

MARS pathfinder failure:

http://wiki.csie.ncku.edu.tw/embedded/priority-inversion-on-Mars.pdf

Other ideas?

# Dekker's Algorithm

```
variables
    wants_to_enter : array of 2 booleans
    turn : integer

wants_to_enter[0] ← false
wants_to_enter[1] ← false
turn ← 0 // or 1
```

```
wants to enter[0] ← true
                                                          wants to enter[1] ← true
while wants_to_enter[1] {
                                                          while wants_to_enter[0] {
  if turn ≠ 0 {
                                                             if turn ≠ 1 {
      wants_to_enter[0] \leftarrow false
                                                                wants_to_enter[1] ← false
      while turn ≠ 0 {
                                                                while turn ≠ 1 {
       // busy wait
                                                                  // busy wait
      wants to enter[0] ← true
                                                                wants to enter[1] ← true
// critical section
                                                          // critical section
                                                          turn ← 0
wants to enter[0] ← false
                                                          wants to enter[1] ← false
// remainder section
                                                          // remainder section
```



#### Lab #1

- Basic synchronization
- <a href="http://www.cs.utexas.edu/~rossbach/cs378/lab/lab0.html">http://www.cs.utexas.edu/~rossbach/cs378/lab/lab0.html</a>

• Start early!!!

# Questions?