# Parallel Architectures

Chris Rossbach

## Outline for Today

- Questions?
- Administrivia
  - Exam soon
- Agenda
  - Parallel Architectures (GPU background)

### Faux Quiz questions

- What is hardware multi-threading; what problem does it solve?
- What is the difference between a vector processor and a scalar?
- Implement a parallel scan or reduction
- How are GPU workloads different from GPGPU workloads?
- How does SIMD differ from SIMT?
- List and describe some pros and cons of vector/SIMD architectures.
- GPUs historically have elided cache coherence. Why? What impact does it have on the programmer?
- List some ways that GPUs use concurrency but not necessarily parallelism.











- 80 SMs
  - Streaming Multiprocessor





Also: CU or ACE

- 80 SMs
  - Streaming Multiprocessor





- 80 SMs
  - Streaming Multiprocessor



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS



SMs

Roughly: all of pfxsum 1,000s X/sec

- 80 SMs
  - Streaming Muniprocessor
  - 64 cores/SM
  - 5210/threads!
  - 15.7 TFLOPS



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores
- HBM2 memory
  - 4096-bit bus
  - No cache coherence!



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores
- HBM2 memory
  - 4096-bit bus
  - No cache coherence!
- 16 GB memory
  - PCle-attached



- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores
- HBM2 memory
  - 4096-bit bus
  - No cache coherence!
- 16 GB memory
  - PCle-attached





- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores
- HBM2 memory
  - 4096-bit bus
  - No cache coherence!
- 16 GB memory
  - PCle-attached





- 80 SMs
  - Streaming Multiprocessor
  - 64 cores/SM
  - 5210 threads!
  - 15.7 TFLOPS
- 640 Tensor cores
- HBM2 memory
  - 4096-bit bus
  - No cache coherence!
- 16 GB memory
  - PCle-attached



How do you program a machine like this? pthread\_create()?

#### GPUs: Outline

- Background from many areas
  - Architecture
    - Vector processors
    - Hardware multi-threading
  - Graphics
    - Graphics pipeline
    - Graphics programming models
  - Algorithms
    - parallel architectures → parallel algorithms
- Programming GPUs
  - CUDA
  - Basics: getting something working
  - Advanced: making it perform

```
main() {
  while(true)
    do_next_instruction();
}
```

```
Processor algorithm:
   main() {
     while(true)
       do next instruction();
 do_next_instruction() {
   instruction = fetch();
   ops, regs = decode(instruction);
   execute_calc_addrs(ops, regs);
   access_memory(ops, regs);
   write back(regs);
```

```
main() {
        main() {
          pthread create(do instructions);
          pthread create(do decode);
          pthread create(do execute);
do n
          pthread join(...);
 inst
 ops
 exe
 access_memory(ops, regs);
 write back(regs);
```

```
main() {
         main() {
          pthread create(do instructions);
          pthread create(do decode);
          pthread create(do execute);
do_n
          pthread join(...);
 inst
 ops
 exe
 access_memory(ops, regs);
 write back(regs);
```

```
do instructions() {
 while(true) {
   instruction = fetch();
   enqueue(DECODE, instruction);
}}
do decode() {
 while(true) {
   instruction = dequeue();
   ops, regs = decode(instruction);
   enqueue(EX, instruction);
}}
do execute() {
 while(true) {
   instruction = dequeue();
   execute_calc_addrs(ops, regs);
   enqueue(MEM, instruction);
}}
```

```
main() {
  while(true) {
    do_next_instruction();
}
```

```
main() {
  while(true) {
    do_next_instruction();
}
```

```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```

```
main() {
    while(true) {
        do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```

```
main() {
  while(true) {
    do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```



| Instr No.      | Pipeline Stage |    |    |     |     |     |     |
|----------------|----------------|----|----|-----|-----|-----|-----|
| 1              | IF             | ID | EX | MEM | WB  |     |     |
| 2              |                | IF | ID | EX  | MEM | WB  |     |
| 3              |                |    | IF | ID  | EX  | MEM | WB  |
| 4              |                |    |    | IF  | ID  | EX  | MEM |
| 5              |                |    |    |     | IF  | ID  | EX  |
| Clock<br>Cycle | 1              | 2  | 3  | 4   | 5   | 6   | 7   |

#### Processor algorithm:

```
main() {
  while(true) {
    do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```



| Instr No.      | Pipeline Stage |    |    |     |     |     |     |  |
|----------------|----------------|----|----|-----|-----|-----|-----|--|
| 1              | IF             | ID | EX | MEM | WB  |     |     |  |
| 2              |                | IF | ID | EX  | MEM | WB  |     |  |
| 3              |                |    | IF | ID  | EX  | MEM | WB  |  |
| 4              |                |    |    | IF  | ID  | EX  | MEM |  |
| 5              |                |    |    |     | IF  | ID  | EX  |  |
| Clock<br>Cycle | 1              | 2  | 3  | 4   | 5   | 6   | 7   |  |

#### Processor algorithm:

```
main() {
  while(true) {
    do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```

| Instr No. | Pipeline Stage |    |    |     |     |     |     |
|-----------|----------------|----|----|-----|-----|-----|-----|
| 1         | IF             | ID | EX | MEM | WB  |     |     |
| 2         |                | IF | ID | EX  | MEM | WB  |     |
| 3         |                |    | IF | ID  | EX  | MEM | WB  |
| 4         |                |    |    | IF  | D   | EX  | MEM |
|           |                |    |    |     |     |     |     |

Works well if pipeline is kept full What kinds of things cause "bubbles"/stalls?

#### Processor algorithm:

```
main() {
  while(true) {
    do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```

| Instr No. | Pipeline Stage |    |    |     |     |    |  |
|-----------|----------------|----|----|-----|-----|----|--|
| 1         | IF             | ID | EX | MEM | WB  |    |  |
| 2         |                | IF | ID | EX  | МЕМ | WB |  |

How can we get \*more\* parallelism?

Works well if pipeline is kept full What kinds of things cause "bubbles"/stalls?



#### Processor algorithm:

```
main() {
  while(true) {
    do_next_instruction();
}
```



```
do_next_instruction() {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute_calc_addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}
```

| Instr No. | Pipeline Stage |    |    |     |     |    |  |
|-----------|----------------|----|----|-----|-----|----|--|
| 1         | IF             | ID | EX | MEM | WB  |    |  |
| 2         |                | IF | ID | EX  | МЕМ | WB |  |

How can we get \*more\* parallelism?

Works well if pipeline is kept full What kinds of things cause "bubbles"/stalls?







## Multi-core/SMPs



```
main() {
 for(i=0; i<CORES; i++) {</pre>
  pthread create(
    do instructions());
do instructions() {
 while(true) {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute calc addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}}
```

### Multi-core/SMPs



- Pros: Simple
- Cons: programmer has to find the parallelism!

```
main() {
 for(i=0; i<CORES; i++) {</pre>
  pthread create(
    do instructions());
do instructions() {
 while(true) {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute calc addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
}}
```

### Multi-core/SMPs



- Pros: Simple
- Cons: programmer has to find the parallelism!

```
main() {
 for(i=0; i<CORES; i++) {
  pthread create(
    do instructions());
do instructions() {
 while(true) {
  instruction = fetch();
  ops, regs = decode(instruction);
  execute calc addrs(ops, regs);
  access_memory(ops, regs);
  write_back(regs);
```

Other techniques extract parallelism here, try to let the machine find parallelism







```
main() {
for(i=0; i<CORES; i++)</pre>
  pthread create(decode exec);
 while(true) {
   instruction = fetch();
   enqueue(instruction);
decode_exec() {
 instruction = dequeue();
 ops, regs = decode(instruction);
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write back(regs);
```



```
main() {
for(i=0; i<CORES; i++)</pre>
  pthread create(decode exec);
  while(true) {
   instruction = fetch();
   enqueue(instruction);
decode exec() {
 instruction = dequeue();
 ops, regs = decode(instruction);
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write back(regs);
```

Doesn't look that different does it? Why do it?



```
main() {
for(i=0; i<CORES; i++)</pre>
  pthread create(decode exec);
  while(true) {
   instruction = fetch();
   enqueue(instruction);
decode_exec() {
 instruction = dequeue();
 ops, regs = decode(instruction);
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write back(regs);
```

Doesn't look that different does it? Why do it?

**Enables independent instruction parallelism.** 

```
$t0,20($s2)
lw
addu $t1,$t0,$t2
     $s4,$s4,$t3
sub
slti $t5,$s4,20
```

```
main() {
for(i=o; i<CORES; i++)
  pthread create(decode exec);
 while(true) {
   instruction = fetch();
   enqueue(instruction);
decode exec() {
 instruction = dequeue();
 ops, regs = decode(instruction);
 execute calc addrs(ops, regs);
 access_memory(ops, regs);
 write back(regs);
```

Doesn't look that different does it? Why do it?

**Enables independent instruction parallelism.** 



```
main() {
for(i=o; i<CORES; i++)
  pthread create(decode exec);
 while(true) {
   instruction = fetch();
   enqueue(instruction);
decode exec() {
 instruction = dequeue();
 ops, regs = decode(instruction);
 execute calc addrs(ops, regs);
 access_memory(ops, regs);
 write back(regs);
```

Doesn't look that different does it? Why do it?

**Enables independent instruction parallelism.** 

```
# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];

# Scalar Code
LI R4, 64
loop:
L.D F0, 0(R1)
L.D F2, 0(R2)
ADD.D F4, F2, F0
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop
```









```
main() {
for(i=0; i<CORES; i++)</pre>
 pthread create(exec);
 while(true) {
   ops, regs = fetch_decode();
   enqueue(ops, regs);
exec() {
 ops, regs = dequeue();
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write_back(regs);
```



```
main() {
for(i=0; i<CORES; i++)</pre>
  pthread create(exec);
 while(true) {
   ops, regs = fetch_decode();
   enqueue(ops, regs);
exec() {
 ops, regs = dequeue();
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write_back(regs);
```

Single instruction stream, multiple computations



```
main() {
for(i=0; i<CORES; i++)</pre>
 pthread_create(exec);
 while(true) {
   ops, regs = fetch_decode();
   enqueue(ops, regs);
exec() {
 ops, regs = dequeue();
 execute_calc_addrs(ops, regs);
 access_memory(ops, regs);
 write_back(regs);
```

Single instruction stream, multiple computations
But now all my instructions need multiple operands!

- Process multiple data elements simultaneously.
- Common in supercomputers of the 1970's 80's and 90's.
- Modern CPUs support some vector processing instructions
  - Usually called SIMD
- Can operate on a few vectors elements per clock cycle in a pipeline or,
  - SIMD operate on all per clock cycle

- Process multiple data elements simultaneously.
- Common in supercomputers of the 1970's 80's and 90's.
- Modern CPUs support some vector processing instructions
  - Usually called SIMD
- Can operate on a few vectors elements per clock cycle in a pipeline or,
  - SIMD operate on all per clock cycle
- 1962 University of Illinois Illiac IV completed 1972 → 64 ALUs 100-150 MFlops
- (1973) TI's Advance Scientific Computer (ASC) 20-80 MFlops
- (1975) Cray-1 first to have vector registers instead of keeping data in memory



- Process multiple data elements simultaneously.
- Common in supercomputers of the 1970's 80's and 90's.
- Modern CPUs support some vector processing instructions
  - Usually called SIMD
- Can operate on a few vectors elements per clock cycle in a pipeline or,
  - SIMD operate on all per clock cycle
- 1962 University of Illinois Illiac IV completed 1972 → 64 ALUs 100-150 MFlops
- (1973) TI's Advance Scientific Computer (ASC) 20-80 MFlops
- (1975) Cray-1 first to have vector registers instead of keeping data in memory



Single instruction stream, multiple data >
Programming model has to change



- Instruction fetch control logic shared
- Same instruction stream executed on
- Multiple pipelines
- Multiple different operands in parallel





- Instruction fetch control logic shared
- Same instruction stream executed on
- Multiple pipelines
- Multiple different operands in parallel





- Instruction fetch control logic shared
- Same instruction stream executed on
- Multiple pipelines
- Multiple different operands in parallel



#### GPUs: same basic idea

### Vector Processors





- Instruction fetch control logic shared
- Same instruction stream executed on
- Multiple pipelines
- Multiple different operands in parallel



# When does vector processing help?





# When does vector processing help?





What are the potential bottlenecks here? When can it improve throughput?

## When does vector processing help?





What are the potential bottlenecks here? When can it improve throughput?

Only helps if memory can keep the pipeline busy!

Address memory bottleneck

- Address memory bottleneck
- Share exec unit across
  - Instruction streams
  - Switch on stalls

- Address memory bottleneck
- Share exec unit across
  - Instruction streams
  - Switch on stalls



- Address memory bottleneck
- Share exec unit across
  - Instruction streams
  - Switch on stalls



- Address memory bottleneck
- Share exec unit across
  - Instruction streams
  - Switch on stalls
- Looks like multiple cores to the OS



- Address memory bottleneck
- Share exec unit across
  - Instruction streams
  - Switch on stalls
- Looks like multiple cores to the OS
- Three variants:
  - Coarse
  - Fine-grain
  - Simultaneous



### Running example



- Colors → pipeline full

  - White → stall

- Single thread runs until a costly stall
  - E.g. 2nd level cache miss

- Single thread runs until a costly stall
  - E.g. 2nd level cache miss
- Another thread starts during stall
  - Pipeline fill time requires several cycles!

- Single thread runs until a costly stall
  - E.g. 2nd level cache miss
- Another thread starts during stall
  - Pipeline fill time requires several cycles!



- Single thread runs until a costly stall
  - E.g. 2nd level cache miss
- Another thread starts during stall
  - Pipeline fill time requires several cycles!
- Does not cover short stalls



- Single thread runs until a costly stall
  - E.g. 2nd level cache miss
- Another thread starts during stall
  - Pipeline fill time requires several cycles!
- Does not cover short stalls
- Hardware support required
  - PC and register file for each thread
  - little other hardware
  - Looks like another physical CPU to OS/software



- Single thread runs until a costly stall
  - E.g. 2nd level cache miss
- Another thread starts during stall
  - Pipeline fill time requires several cycles!
- Does not cover short stalls
- Hardware support required
  - PC and register file for each thread
  - little other hardware
  - Looks like another physical CPU to OS/software



- Threads interleave instructions
  - Round-robin
  - Skip stalled threads

- Threads interleave instructions
  - Round-robin
  - Skip stalled threads



- Threads interleave instructions
  - Round-robin
  - Skip stalled threads
- Hardware support required
  - Separate PC and register file per thread
  - Hardware to control alternating pattern



- Threads interleave instructions
  - Round-robin
  - Skip stalled threads
- Hardware support required
  - Separate PC and register file per thread
  - Hardware to control alternating pattern
- Naturally hides delays
  - Data hazards, Cache misses
  - Pipeline runs with rare stalls



- Threads interleave instructions
  - Round-robin
  - Skip stalled threads
- Hardware support required
  - Separate PC and register file per thread
  - Hardware to control alternating pattern
- Naturally hides delays
  - Data hazards, Cache misses
  - Pipeline runs with rare stalls
- Doesn't make full use of multi-issue



- Threads interleave instructions
  - Round-robin
  - Skip stalled threads
- Hardware support required
  - Separate PC and register file per thread
  - Hardware to control alternating pattern
- Naturally hides delays
  - Data hazards, Cache misses
  - Pipeline runs with rare stalls
- Doesn't make full use of multi-issue



- Instructions from multiple threads issued on same cycle
  - Uses register renaming
  - dynamic scheduling facility of multiissue architecture

- Instructions from multiple threads issued on same cycle
  - Uses register renaming
  - dynamic scheduling facility of multiissue architecture



- Instructions from multiple threads issued on same cycle
  - Uses register renaming
  - dynamic scheduling facility of multiissue architecture
- Hardware support:
  - Register files, PCs per thread
  - Temporary result registers pre commit
  - Support to sort out which threads get results from which instructions



- Instructions from multiple threads issued on same cycle
  - Uses register renaming
  - dynamic scheduling facility of multiissue architecture
- Hardware support:
  - Register files, PCs per thread
  - Temporary result registers pre commit
  - Support to sort out which threads get results from which instructions
- Maximal util. of execution units



- Instructions from multiple threads issued on same cycle
  - Uses register renaming
  - dynamic scheduling facility of multiissue architecture
- Hardware support:
  - Register files, PCs per thread
  - Temporary result registers pre commit
  - Support to sort out which threads get results from which instructions
- Maximal util. of execution units





## Why Vector and Multithreading Background?

#### GPU:

- A very wide vector machine
- Massively multi-threaded to hide memory latency
- Originally designed for graphics pipelines...

Inputs

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterization)
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterization)
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry
- Light locations and properties
  - Attempt to model surtface/light interactions with modeled objects/materials

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterization)
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry
- Light locations and properties
  - Attempt to model surtface/light interactions with modeled objects/materials
- View point

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterization)
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry
- Light locations and properties
  - Attempt to model surtface/light interactions with modeled objects/materials
- View point

#### Output

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterization)
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry
- Light locations and properties
  - Attempt to model surtface/light interactions with modeled objects/materials
- View point

#### Output

2D projection seen from the view-point

#### Inputs

- 3D world model(objects, materials)
  - Geometry modeled w triangle meshes, surface normals
  - GPUs subdivide triangles into "fragments" (rasterizat
  - Materials modeled with "textures"
  - Texture coordinates, sampling "map" textures → geometry
- Light locations and properties
  - Attempt to model surtface/light interactions with modeled objects/materials
- View point

#### Output

2D projection seen from the view-point



foreach(vertex v in model)

foreach(vertex v in model)

map  $v_{model} \rightarrow v_{view}$ 

foreach(vertex v in model) map  $v_{model} \rightarrow v_{view}$ 



foreach(vertex v in model) map  $v_{model} \rightarrow v_{view}$ fragment[] frags = {};



foreach(vertex v in model)

map  $v_{model} \rightarrow v_{view}$ 

fragment[] frags = {};

foreach triangle t  $(v_0, v_1, v_2)$ 



```
foreach(vertex v in model)

map v_{model} \rightarrow v_{view}

fragment[] frags = {};

foreach triangle t (v_{0,}, v_{1,}, v_{2})

frags.add(rasterize(t));
```



```
foreach(vertex v in model)
```

map  $v_{model} \rightarrow v_{view}$ 

fragment[] frags = {};

foreach triangle t  $(v_0, v_1, v_2)$ 

frags.add(rasterize(t));

foreach fragment f in frags



```
foreach(vertex v in model)
       map v_{model} \rightarrow v_{view}
fragment[] frags = {};
foreach triangle t (v_0, v_1, v_2)
       frags.add(rasterize(t));
foreach fragment f in frags
       choose color(f);
```



```
foreach(vertex v in model)
       map v_{model} \rightarrow v_{view}
fragment[] frags = {};
foreach triangle t (v_0, v_1, v_2)
       frags.add(rasterize(t));
foreach fragment f in frags
       choose_color(f);
display(visible fragments(frags));
```



```
foreach(vertex v in model)
       map v_{model} \rightarrow v_{view}
fragment[] frags = {};
foreach triangle t (v_0, v_1, v_2)
       frags.add(rasterize(t));
foreach fragment f in frags
       choose_color(f);
display(visible fragments(frags));
```







foreach(vertex v in model) map  $v_{model} \rightarrow v_{view}$ fragment[] frags = {}; foreach triangle t  $(v_0, v_1, v_2)$ frags.add(rasterize(t)); foreach fragment f in frags choose color(f); display(visible fragments(frags));



OpenGL pipeline

foreach(vertex v in model)

 $\boxed{\text{map } v_{\text{model}} \rightarrow v_{\text{view}}}$ 

fragment[] frags = {};

foreach triangle t  $(v_{0_i} v_{1_i} v_2)$ 

frags.add(rasterize(t));

foreach fragment f in frags

choose\_color(f);

display(visible\_fragments(frags));



OpenGL pipeline

foreach(vertex v in model) map  $v_{model} \rightarrow v_{view}$ fragment[] frags = {}; foreach triangle t  $(v_0, v_1, v_2)$ frags.add(rasterize(t)); foreach fragment f in frags choose color(f); display(visible fragments(frags));



OpenGL pipeline

foreach(vertex v in model)

map  $v_{model} \rightarrow v_{view}$ 

fragment[] frags = {};

foreach triangle t  $(v_0, v_1, v_2)$ 

frags.add(rasterize(t));

foreach fragment f in frags

choose\_color(f);
display(visible\_fragments(frags));



OpenGL pipeline

foreach(vertex v in model) map  $v_{model} \rightarrow v_{view}$ fragment[] frags = {}; foreach triangle t  $(v_0, v_1, v_2)$ frags.add(rasterize(t)); foreach fragment f in frags choose color(f); display(visible\_fragments(frags));



OpenGL pipeline



Limited "programmability" of shaders:
Minimal/no control flow
Maximum instruction count



Limited "programmability" of shaders:
Minimal/no control flow
Maximum instruction count



Limited "programmability" of shaders:
Minimal/no control flow
Maximum instruction count



Limited "programmability" of shaders:
Minimal/no control flow
Maximum instruction count



Limited "programmability" of shaders:
Minimal/no control flow
Maximum instruction count

#### Late Modernity: unified shaders



Mapping to Graphics pipeline no longer apparent Processing elements no longer specialized to a particular role Model supports *real* control flow, larger instr count

# Mostly Modern: Pascal



# Definitely Modern: Turing





### Modern Enough: Pascal SM





#### Cross-generational observations

#### GPUs designed for parallelism in graphics pipeline:

- Data
  - Per-vertex
  - Per-fragment
  - Per-pixel
- Task
- Vertex processing
- Fragment processing
- Rasterization
- Hidden-surface elimination
- MLP
- HW multi-threading for hiding memory latency

#### Cross-generational observations

#### GPUs designed for parallelism in graphics pipeline:

- Data
  - Per-vertex
  - Per-fragment
  - Per-pixel
- Task
- Vertex processing
- Fragment processing
- Rasterization
- Hidden-surface elimination
- MLP
- HW multi-threading for hiding memory latency

Even as GPU architectures become more general, certain assumptions persist:

- 1. Data parallelism is *trivially* exposed
- **2. All** problems look like painting a box with colored dots

#### Cross-generational observations

#### GPUs designed for parallelism in graphics pipeline:

- Data
  - Per-vertex
  - Per-fragment
  - Per-pixel
- Task
- Vertex processing
- Fragment processing
- Rasterization
- Hidden-surface elimination
- MLP

HW multi-threading for hiding memory latency

Even as GPU architectures become more general, certain assumptions persist:

- 1. Data parallelism is *trivially* exposed
- **2. All** problems look like painting a box with colored dots

But what if my problem isn't painting a box?!!?!

### The big ideas still present in GPUs

- Simple cores
- Single instruction stream
  - Vector instructions (SIMD) OR
  - Implicit HW-managed sharing (SIMT)
- Hide memory latency with HW multi-threading

### Programming Model

- GPUs are I/O devices, managed by user-code
- "kernels" == "shader programs"
- 1000s of HW-scheduled threads per kernel
- Threads grouped into independent blocks.
  - Threads in a block can synchronize (barrier)
  - This is the \*only\* synchronization
- "Grid" == "launch" == "invocation" of a kernel
  - a group of blocks (or warps)

### Parallel Algorithms

- Sequential algorithms often do not permit easy parallelization
  - Does not mean there work has no parallelism
  - A different approach can yield parallelism
  - but often changes the algorithm
  - Parallelizing != just adding locks to a sequential algorithm
- Parallel Patterns
  - Map
  - Scatter, Gather
  - Reduction
  - Scan
  - Search, Sort

#### Parallel Algorithms

- Sequential algorithms often do not permit easy parallelization
  - Does not mean there work has no parallelism
  - A different approach can yield parallelism
  - but often changes the algorithm
  - Parallelizing != just adding locks to a sequential algorithm
- Parallel Patterns
  - Map
  - Scatter, Gather
  - Reduction
  - Scan
  - Search, Sort

If you can express your algorithm using these patterns, an apparently fundamentally sequential algorithm can be made parallel

#### Map

- Inputs
  - Array A
  - Function f(x)
- map(A, f)  $\rightarrow$  apply f(x) on all elements in A
- Parallelism trivially exposed
  - f(x) can be applied in parallel to all elements, in principle

#### Map

- Inputs
  - Array A
  - Function f(x)
- map(A, f)  $\rightarrow$  apply f(x) on all elements in A
- Parallelism trivially exposed
  - f(x) can be applied in parallel to all elements, in principle

```
for(i=0; i<numPoints; i++) {
    labels[i] = findNearestCenter(points[i]);
}</pre>
map(points, findNearestCenter)
```

#### Scatter and Gather

- Gather:
  - Read multiple items to single location
- Scatter:
  - Write single data item to multiple locations



Scatter



Gather

#### Scatter and Gather

- Gather:
  - Read multiple items to single location
- Scatter:
  - Write single data item to multiple locations





Scatter



Gather

- Input
  - Associative operator op
  - Ordered set s = [a, b, c, ... z]
- Reduce(op, s) returns a op b op c ... op z

- Input
  - Associative operator op
  - Ordered set s = [a, b, c, ... z]
- Reduce(op, s) returns a op b op c ... op z

```
for(i=0; i<N; ++i) {
  accum += (map(sqr, point[i]))
}
accum = reduce(+, map(sqr, point))</pre>
```

- Input
  - Associative operator op
  - Ordered set s = [a, b, c, ... z]
- Reduce(op, s) returns a op b op c ... op z

```
for(i=0; i<N; ++i) {
  accum += (map(sqr, point[i]))
}
accum = reduce(+, map(sqr, point))</pre>
```

Why must op be associative?

- Input
  - Associative operator op
  - Ordered set s = [a, b, c, ... z]
- Reduce(op, s) returns a op b op c ... op z





```
for(i=0; i<N; ++i) {
  accum += (map(sqr, point[i]))
}
accum = reduce(+, map(sqr, point))</pre>
```

Why must op be associative?

## Scan (prefix sum)

- Input
  - Associative operator op
  - Ordered set s = [a, b, c, ... z]
  - Identity I
- scan(op, s) = [I, a, (a op b), (a op b op c) ...]
- Scan is the workhorse of parallel algorithms:
  - Sort, histograms, sparse matrix, string compare, ...



#### Summary

 Re-expressing apparently sequential algorithms as combinations of parallel patterns is a common technique when targeting GPUs