GPUs going once... GPUs going twice... you get the idea

Chris Rossbach

cs378

|                                | INT<br>INT | INT<br>INT |           | FP32<br>FP32 |           |           |     |  |  |  |
|--------------------------------|------------|------------|-----------|--------------|-----------|-----------|-----|--|--|--|
|                                | INT        | INT        | FP32      | FP32         | Ħ         |           |     |  |  |  |
|                                | INT        | INT        | FP32      | FP32         | $\vdash$  |           |     |  |  |  |
| D/<br>ST                       | LD/<br>ST  | LD/<br>ST  | LD/<br>ST | LD/<br>ST    | LD/<br>ST | LD/<br>ST | SFU |  |  |  |
| L0 Instruction Cache           |            |            |           |              |           |           |     |  |  |  |
| Warp Scheduler (32 thread/clk) |            |            |           |              |           |           |     |  |  |  |

Dispatch Unit (32 thread/clk)

### Register File (16,384 x 32-bit)

| INT | INT | FP32 FP32 |  |
|-----|-----|-----------|--|
| INT | INT | FP32 FP32 |  |
| INT | INT | FP32 FP32 |  |

### FPG Outline for Today

Questions?

**FP64** 

FF

LD

ST

- Administrivia
  - Start thinking about Projects!
  - Exam not quite done...Tuesday for sure!
- Agenda
  - GPU performance
  - GPU advanced topics
    - Divergence
    - Device APIs vs Dataflow
    - Coherence

#### Acknowledgements:

- http://developer.download.nvidia.com/compute/developertrainingmaterials/presentatio ns/cuda\_language/Introduction\_to\_CUDA\_C.pptx
- http://www.seas.upenn.edu/~cis565/LECTURES/CUDA%20Tricks.pptx
- http://www.cs.utexas.edu/~pingali/CS378/2015sp/lectures/GPU%20Programming.pptx
- Tor Aamodt's 2013 paper

| FP64 | INT INT | FP32 FP32 |   |
|------|---------|-----------|---|
| FP64 | INT INT | FP32 FP32 |   |
| FP64 | INT INT | FP32 FP32 | 2 |

# Faux Quiz Questions

- How is occupancy defined (in CUDA nomenclature)?
- What's the difference between a block scheduler (e.g. Giga-Thread Engine) and a warp scheduler?
- Modern CUDA supports UVM to eliminate the need for cudaMalloc and cudaMemcpy\*. Under what conditions might you want to use or not use it and why?
- What is control flow divergence? How does it impact performance?
- What is a bank conflict?
- What is work efficiency?
- What is the difference between a thread block scheduler and a warp scheduler?
- How are atomics implemented in modern GPU hardware?
- How is \_\_\_\_\_shared\_\_\_ memory implemented by modern GPU hardware?
- Why is <u>shared</u> memory necessary if GPUs have an L1 cache? When will an L1 cache provide all the benefit of <u>shared</u> memory and when will it not?
- Is cudaDeviceSynchronize still necessary after copyback if I have just one CUDA stream?

## How many threads/blocks?

#### // Copy inputs to device

```
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);
```

// Launch add() kernel on GPU
add<<<N/THREADS\_PER\_BLOCK, THREADS\_PER\_BLOCK>>>(d\_a, d\_b, d\_c);

#### // Copy result back to host

```
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
```

#### // Cleanup

}

```
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```

4

# How many threads/blocks?

### // Copy inputs to device

```
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);
```



### // Copy result back to host

cudaMemcpy(c, d\_c, size, cudaMemcpyDeviceToHost);

### // Cleanup

```
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```



# How many threads/blocks?

### // Copy inputs to device

```
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);
```



### // Copy result back to host

cudaMemcpy(c, d\_c, size, cudaMemcpyDeviceToHost);

### // Cleanup

```
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```

Usually things are correct if grid\*block dims >= input size

• Getting good performance is another matter



### Review: Internals

### **Review:** Internals



### Review: Internals







Suppose one TB (threadblock) has 64 threads (2 warps)

### **Thread Blocks**



.....



### <u>SMs</u>



 $SM_0$ 



SM\_1

....



SM\_12

- SMs split blocks into warps
- Unit of HW scheduling for SM
- 32 threads each

Suppose one TB (threadblock) has 64 threads (2 warps)

### **Thread Blocks**

.....



### SMs



Cache/Shared Memory

SM\_1

SM\_0



.....



### SM\_12

- SMs split blocks into warps
- Unit of HW scheduling for SM
- 32 threads each •



- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters



- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel Shouldn't we just create as many
  - Enables device-specific online tuning of kernel parameter. threads as possible?



### A Taco Bar



### A Taco Bar



### A Taco Bar



• Where is the parallelism here?

















































### • Where is the parallelism here?













































• Where is the parallelism here?













- There's none!
- This only works if you can keep every lane full at every step
- Throughput == Performance
- Goal: Increase Occupancy!



































- There's none!
- This only works if you can keep every lane full at every step
- Throughput == Performance
- Goal: Increase Occupancy!









- There's none!
- This only works if you can keep every lane full at every step
- Throughput == Performance
- Goal: Increase Occupancy!







- There's none!
- This only works if you can keep every lane full at every step
- Throughput == Performance
- Goal: Increase Occupancy!





• Where is the parallelism here?



• Where is the parallelism here?



- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters



- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters





- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters





- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel
  - Enables device-specific online tuning of kernel parameters





- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel Shouldn't we just create as many
  - Enables device-specific online tuning of kernel parameter. threads as possible?

















• Scheduler slots



• Scheduler slots

What is the performance impact of varying kernel resource demands?





Example: v100:

max active warps/SM == 64 (limit: warp context)



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow$



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow 4$



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow 4$
  - With 128 threads/block?  $\rightarrow$



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow 4$
  - With 128 threads/block?  $\rightarrow$  16



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow 4$
  - With 128 threads/block?  $\rightarrow$  16
- Consider HW limit of 32 thread blocks/SM @ 32 threads/block:
  - Blocks are maxed out, but max active threads = 32\*32 = 1024
  - Occupancy = .5 (1024/2048)



- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow 4$
  - With 128 threads/block?  $\rightarrow$  16
- Consider HW limit of 32 thread blocks/SM @ 32 threads/block:
  - Blocks are maxed out, but max active threads = 32\*32 = 1024
  - Occupancy = .5 (1024/2048)
- To maximize utilization, thread block size should balance
  - Limits on active thread blocks vs.
  - Limits on active warps



#### Kernel Distributor Limits the #thread blocks TB 0 SM Scheduler Warp Schedulers Impact of #Registers Per Thread Limits the #threads Warp Context SM SM SM SM SP DRAM SP SP SP SP

Limits the #threads

Limits the #thread blocks

Thread Block Control

Register File

L1/Shared Memory

SM - Stream Multiprocessor

SP - Stream Processor

Registers/thread can limit number of active threads!



Registers/thread can limit number of active threads! V100:



Registers/thread can limit number of active threads! V100:

• Registers per thread max: 255



Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM



Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256



Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256

- Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM
  - Uses all 2048 thread slots (8 blocks \* 256 threads/block)
  - 8192 regs/block \* 8 block/SM = 64k registers
  - FULLY Occupied!



## Impact of #Registers Per Thread

Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256

- Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM
  - Uses all 2048 thread slots (8 blocks \* 256 threads/block)
  - 8192 regs/block \* 8 block/SM = 64k registers
  - FULLY Occupied!
- What is the impact of increasing number of registers by 2?



## Impact of #Registers Per Thread

Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256

- Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM
  - Uses all 2048 thread slots (8 blocks \* 256 threads/block)
  - 8192 regs/block \* 8 block/SM = 64k registers
  - FULLY Occupied!
- What is the impact of increasing number of registers by 2?
  - Recall: granularity of management is a thread block!



# Impact of #Registers Per Thread

Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256

- Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM
  - Uses all 2048 thread slots (8 blocks \* 256 threads/block)
  - 8192 regs/block \* 8 block/SM = 64k registers
  - FULLY Occupied!
- What is the impact of increasing number of registers by 2?
  - Recall: granularity of management is a thread block!
  - Loss of concurrency of 256 threads!
  - 34 regs/thread \* 256 threads/block \* 7 blocks/SM = 60k registers,
  - 8 blocks would over-subscribe register file
  - Occupancy drops to .875!



# Impact of Shared Memory

- Shared memory is allocated per thread block
  - Can limit the number of thread blocks executing concurrently per SM
  - Shared mem/block \* # blocks <= total shared mem per SM</li>
- gridDim and blockDim parameters impact demand for
  - shared memory
  - number of thread slots
  - number of thread block slots

## Balance



- Navigate the tradeoffs
  - maximize core utilization and memory bandwidth utilization
  - Device-specific
- Goal: Increase occupancy until one or the other is saturated

## Balance

template < class T >

\_\_host\_\_<u>cudaError\_t</u> cudaOccupancyMaxActiveBlocksPerMultiprocessor ( int\* numBlocks, T func, int\_blockSize, size\_t dynamicSMemSize ) [inline]

Returns occupancy for a device function.

#### Parameters

#### numBlocks

- Returned occupancy

#### func

- Kernel function for which occupancy is calulated

#### blockSize

- Block size the kernel is intended to be launched with

dynamicSMemSize

- Per-block dynamic shared memory usage intended, in bytes

- Navigate the tradeoffs
  - maximize core utilization and memory bandwidth utilization
  - Device-specific
- Goal: Increase occupancy until one or the other is saturated

## Parallel Memory Accesses

- Coalesced main memory access (16/32x faster)
  - HW combines multiple warp memory accesses into a single coalesced access
- Bank-conflict-free shared memory access (16/32)
  - No alignment or contiguity requirements
    - CC 2.x+3.0 : 32 different banks + 1-word broadcast each

## Parallel Memory Architecture

- In a parallel machine, many threads access memory
  - Therefore, memory is divided into banks
  - Essential to achieve high bandwidth
- Each bank can service one address per cycle
  - A memory can service as many simultaneous accesses as it has banks
- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting accesses are serialized



## Coalesced Main Memory Accesses

NVIDIA

#### single coalesced access



one and two coalesced accesses\*



NVIDIA

20

# Bank Addressing Examples



# Bank Addressing Examples



# Linear Addressing

• Given:

\_\_shared\_\_ float shared[256];
float foo =
 shared[baseIndex + s \*
 threadIdx.x];

- This is only bank-conflict-free if s shares no common factors with the number of banks
  - 16 on G80, so s must be odd





#### Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

#### Read-Modify-Write – atomic

| atomicAdd() | atomicInc()  |
|-------------|--------------|
| atomicSub() | atomicDec()  |
| atomicMin() | atomicExch() |
| atomicMax() | atomicCAS()  |

Implemented as write-through to L2

Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

```
// Add "val" to "*data". Return old value.
double atomicAdd(double *data, double val)
{
    while(atomicExch(&locked, 1) != 0)
        ; // Retry lock
    double old = *data;
    *data = old + val;
```

locked = 0;

return old;

## Read-Modify-Write – atomic

| atomicInc()  |
|--------------|
| atomicDec()  |
| atomicExch() |
| atomicCAS()  |
|              |

Implemented as write-through to L2

#### Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

```
// Add "val" to "*data". Return old value.
double atomicAdd(double *data, double val)
{
    while(atomicExch(&locked, 1) != 0)
        ; // Retry lock
    double old = *data;
    *data = old + val;
    locked = 0;
    return old;
}
```

Is this a good idea?

## Read-Modify-Write – atomic

| atomicAdd() | atomicInc()             |
|-------------|-------------------------|
| atomicSub() | atomicDec()             |
| atomicMin() | <pre>atomicExch()</pre> |
| atomicMax() | atomicCAS()             |

#### Implemented as write-through to L2

#### Race conditions -

atomicAdd()

atomicSub()

All active

Some active

Others active

All active

а

а

•

- Traditional locks: avoid!
- How do we synchronize?

## Read-Modify-Write – atomic

Warp of Threads



#### Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

```
// Add "val" to "*data". Return old value.
double atomicAdd(double *data, double val)
{
    while(atomicExch(&locked, 1) != 0)
        ; // Retry lock
    double old = *data;
    *data = old + val;
    locked = 0;
    return old;
}
```

Is this a good idea?

## Read-Modify-Write – atomic

| atomicAdd() | atomicInc()             |
|-------------|-------------------------|
| atomicSub() | atomicDec()             |
| atomicMin() | <pre>atomicExch()</pre> |
| atomicMax() | atomicCAS()             |

#### Implemented as write-through to L2

#### Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

# // Add "val" to "\*data". Return old value. double atomicAdd(double \*data, double val) { while(atomicExch(&locked, 1) != 0) ; // Retry lock double old = \*data; \*data = old + val; locked = 0; return old; }

Is this a good idea?

## Read-Modify-Write – atomic



#### Race conditions -

- Traditional locks: avoid!
- How do we synchronize?

#### Read-Modify-Write – atomic

atomicAdd()
atomicSub()
atomicMin()
atomicMax()

Implemented as write-throug

• "Fire-and-forget"



// Add "val" to "\*data". Return old value.
double atomicAdd(double \*data, double val)
{

```
while(atomicExch(&locked, 1) != 0)
  ; // Retry lock
```

Is this a good

idea?

double old = \*data; \*data = old + val; locked = 0;

return old;

## Advanced Topic: GPU Programming Models



10/11/23









\* 1:1 correspondence between OS-level and user-level abstractions \* Diverse HW support enabled HAL













## No OS support $\rightarrow$ No isolation

GPU benchmark throughput



- Image-convolution in CUDA
- Windows 7 x64 8GB RAM
- Intel Core 2 Quad 2.66GHz
- nVidia GeForce GT230

## No OS support $\rightarrow$ No isolation

GPU benchmark throughput



CPU+GPU schedulers not integrated! ...other pathologies abundant ge-convolution in CUDA dows 7 x64 8GB RAM I Core 2 Quad 2.66GHz dia GeForce GT230

## Composition: Gestural Interface





10/11/23

## Composition: Gestural Interface



















- Requires OS mediation
- High data rates
- Abundant data parallelism ...use GPUs!



### What We'd Like To Do

- Modular design
  - flexibility, reuse
- Utilize heterogeneous hardware
  - ► Data-parallel components → GPU
  - Sequential components  $\rightarrow$  CPU
- Using OS provided tools
  - processes, pipes

### What We'd Like To Do

#> capture | xform | filter | detect &
 CPU GPU GPU CPU

- Modular design
  - flexibility, reuse
- Utilize heterogeneous hardware
  - ► Data-parallel components → GPU
  - Sequential components  $\rightarrow$  CPU
- Using OS provided tools
  - processes, pipes

# GPU Execution model

- GPUs cannot run OS:
  - different ISA
  - Memories have different coherence guarantees
    - (disjoint, or require fence instructions)

#### Host CPU must "manage" GPU execution

- Program inputs explicitly transferred/bound at runtime
- Device buffers pre-allocated



# GPU Execution model

- GPUs cannot run OS:
  - different ISA
  - Memories have different coherence guarantees
    - (disjoint, or require fence instructions)

#### Host CPU must "manage" GPU execution

- Program inputs explicitly transferred/bound at runtime
- Device buffers pre-allocated







#> capture | xform | filter | detect &





















#> capture | xform | filter | detect &



#> capture | xform | filter | detect &



#> capture | xform | filter | detect &



#> capture | xform | filter | detect &





#### Device-centric APIs considered harmful

```
Matrix
gemm(Matrix A, Matrix B) {
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

#### Device-centric APIs considered harmful

```
Matrix
gemm(Matrix A, Matrix B) {
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

What happens if I want the following? Matrix D = A x B x C

```
Matrix
AxBxC(Matrix A, B, C) {
    Matrix AxB = gemm(A,B);
    Matrix AxBxC = gemm(AxB,C);
    return AxBxC;
}
```

```
Matrix
AxBxC(Matrix A, B, C) {
   Matrix AxB = gemm(A,B);
   Matrix AxBxC = gemm(AxB,C);
   return AxBxC;
}
```

Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C;

}

```
gemm(Matrix A, Matrix B) {
               AxB copied from
                                       copyToGPU(A);
                                       copyToGPU(B);
               GPU memory...
Matrix
                                       return C;
AxBxC(Matrix A, B, C) {
   Matrix(AXB) = gemm(A,B);
   Matrix AxBxC = gemm(AxB,C);
   return AxBxC;
}
```

invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C);

Matrix

```
Matrix
                                         gemm(Matrix A, Matrix B) {
                                           copyToGPU(A);
                                           copyToGPU(B);
                                           invoke_SPU();
                                           Matrix C = new Matrix();
                                           copyFromGPU(C);
Matrix
                                                С;
                                           returr
AxBxC(Matrix A, B, C) {
                                         }
    Matrix AxB = gemm(A,B);
    Matrix AxBxC = gemm(AXB,C);
    return AxBxC;
}
                                        ...only to be copied
                                        right back!
```

### What if I have many GPUs?

```
Matrix
gemm(Matrix A, Matrix B) {
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

#### What if I have many GPUs?

Matrix
gemm(GPU dev, Matrix A, Matrix B) {
 copyToGPU(dev, A);
 copyToGPU(dev, B);
 invokeGPU(dev);
 Matrix C = new Matrix();
 copyFromGPU(dev, C);
 return C;
}

#### What if I have many GPUs?

Matrix
gemm(GPU dev, Matrix A, Matrix B) {
 copyToGPU(dev, A);
 copyToGPU(dev, B);
 invokeGPU(dev);
 Matrix C = new Matrix();
 copyFromGPU(dev, C);
 return C;
}

What happens if I want the following? Matrix D = A x B x C

#### Composition with many GPUs

```
Matrix
gemm(GPU dev, Matrix A, Matrix B)
{
     copyToGPU(A);
     copyToGPU(B);
     invokeGPU();
     Matrix C = new Matrix();
     copyFromGPU(C);
     return C;
}
```

```
Matrix
AxBxC(Matrix A,B,C) {
    Matrix AxB = gemm(???, A,B);
    Matrix AxBxC = gemm(???, AxB,C);
    return AxBxC;
}
```

```
Matrix
gemm(GPU dev, Matrix A, Matrix B)
{
     copyToGPU(A);
     copyToGPU(B);
     invokeGPU();
     Matrix C = new Matrix();
     copyFromGPU(C);
     return C;
}
```

```
Matrix
AxBxC(GPU dev, Matrix A,B,C) {
    Matrix AxB = gemm(dev, A,B);
    Matrix AxBxC = gemm(dev, AxB,C);
    return AxBxC;
}
```



```
Matrix
AxBxC(GPU dev, Matrix A,B,C) {
    Matrix AxB = gemm(dev, A,B);
    Matrix AxBxC = gemm(dev, AxB,C);
    return AxBxC;
```

```
Matrix
gemm(GPU dev, Matrix A, Matrix B)
{
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

```
Matrix
AxBxC(GPU devA, GPU devB, Matrix A,B,C) {
    Matrix AxB = gemm(devA, A,B);
    Matrix AxBxC = gemm(devB, AxB,C);
    return AxBxC;
}
```



```
Matrix
AxBxC(GPU devA, GPU devB, Matrix A,B,C) {
    Matrix AxB = gemm(devA, A,B);
    Matrix AxBxC = gemm(devB, AxB,C);
    return AxBxC;
```



# Matrix AxBxC(GPU devA, GPU devB, Matrix A,B,C) { Matrix AxB = gemm(devA, A,B); Matrix AxBxC = gemm(devB, AxB,C); return AxBxC;

Why don't we have this problem with CPUs?

#### Dataflow: a better abstraction



Minimal specification of data movement: runtime does it.

- asynchrony is a runtime concern (not programmer concern)
- No specification of compute  $\rightarrow$  device mapping: like threads!

### Advanced Topic: GPU Coherence

43











Each cache line has a state (M, E, S, I)

Processors "snoop" bus to maintain states





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads  $\rightarrow$  'S'  $\rightarrow$  multiple copies possible





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads  $\rightarrow$  'S'  $\rightarrow$  multiple copies possible
- Write  $\rightarrow$  'M'  $\rightarrow$  single copy  $\rightarrow$  lots of cache coherence traffic





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads  $\rightarrow$  'S'  $\rightarrow$  multiple copies possible
- Write  $\rightarrow$  'M'  $\rightarrow$  single copy  $\rightarrow$  lots of cache coherence traffic



## GPU Cache Coherence Challenges

Challenge 1: Coherence traffic



### GPU Cache Coherence Challenges

- Challenge 2: Tracking in-flight requests
  - Significant % of L2



## Background: Directory Protocol

- For each block: centralized
   "directory" for state in caches
- Directory is co-located with some global view of memory
- Requests are no longer seen by everyone
  - Writes are serialized through directory



### GPU-VI

- Directory-Based
  - Different from snoop-model
  - Global directory metadata at L2
- Two states
  - Valid
  - Invalid
- Writes invalidate other copies



## Temporal Coherence (TC)



## TC-Strong vs TC-Weak



