### **GPU Programming**

### Keshav Pingali

Some slides borrowed from David Kirk and Wen-Mei Hwu, and from Ruetsch and Oster

### **Terminology**

- Graphics Processing Unit (GPU)
  - special processors (accelerators) designed to speed up graphics applications
- General-purpose GPUs (GPGPU)
  - GPUs that have been massaged so that they can be used for both graphics and general-purpose applications
  - we will just refer to them as GPU's
- Compute Unified Device Architecture (CUDA)
  - NVIDIA programming model for their GPU's
- Open Computing Language (OpenCL)
  - emerging standard for programming heterogeneous processors: multicores + GPUs + other accelerators
- Kernel
  - a function/loop that is executed on GPU
  - a program will usually consist of a sequence of kernels interspersed with code that is executed on the host device (CPU)

### **Key features of GPUs**

- Lots of threads
  - (eg) NVIDIA Fermi streaming processor has
    - 512 cores
  - lightweight threads: managed by hardware, start-up cost is small
- SIMT execution
  - groups of threads (warp) operate in SIMD
  - Siamese twins: 32 threads joined at hip
  - threads in warp are co-scheduled for execution
- Latency-tolerant architecture
  - processor time-slices between warps to mask memory and synchronization latencies
  - cf: time-sharing, dataflow







### **Exposed Memory Hierarchy** Global memory: Read/written by host Read/written by all GPU threads Used to transfer data back and forth between host and GPU Relatively slow: 400-800 cycles **Constant memory:** Read/written by host Read by GPU threads Used to transfer read-only information **Shared memory:** Read/written by groups of threads called thread blocks or just blocks Like a software managed L1 cache Faster than global memory:1-4 cycles Registers: In principle, global memory + registers are enough. Shared-memory: intermediate level of memory hierarchy Read/written by thread

Private to each thread

### 



### First CUDA example

- Let us write a program to compute
  - -C = A+B
  - where A,B,C are arrays
- Thread i will compute
  - -C[i] = A[i] + B[i]

# int main (int argc, char \*\*argv ) { 1. Allocate memory space in device (GPU) for input and output data 2. Create input data in host (CPU) 3. Copy input data to GPU 4. Call "kernel" routine to execute on GPU (with CUDA syntax that defines no of threads and their logical organization) 5. Transfer output data from GPU to CPU 6. Free memory space in device (GPU) 7. Free memory space in host (CPU) return value; } Kernel routine: marching orders for one thread (cf. pThreads) use threadld and blockld to give different work to different threads Call to a kernel function is asynchronous from CUDA 1.0 on. Synchronization for blocking host till all previous CUDA calls complete: cudaThreadSynchronize()

### **CUDA Function Declarations**

|                             | Executed on the: | Only callable from the: |
|-----------------------------|------------------|-------------------------|
| _device_ float DeviceFunc() | device           | device                  |
| _global_ void KernelFunc()  | device           | host                    |
| _host_ float HostFunc()     | host             | host                    |

- Executed on host, callable from device: not supported
- \_global\_ defines a kernel function, must return void
- \_device\_ and \_host\_ can be used together

```
__global__ void vecAdd(int *A, int *B, int *C) {
    int i = threadldx.x;
    C[i] = A[i] + B[i];
}
int main (int argc, char **argv ) {
   int size = N *sizeof( int);
int *a, *b, *c, *devA, *devB, *devC;
   cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
    a = (int*)malloc(size); b = (int*)malloc(size);c = (int*)malloc(size);
    cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice); cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
    vecAdd<<<1, N>>>(devA, devB, devC);
    cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);
    cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
free( a ); free( b ); free( c );
   return (0);
```

### 1. Allocating memory in **GPU** global memory for data

### Use CUDA malloc routines:

```
int size = N *sizeof( int); // space for N integers
```

```
int *devA, *devB, *devC; // devA, devB, devC ptrs
cudaMalloc((void**)&devA, size);
cudaMalloc((void**)&devB, size );
cudaMalloc((void**)&devC, size );
    11
```

### 2. Creating input data in "host" (CPU)

Use regular C malloc routines or static variables and compute:

```
int *a, *b, *c;
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
```



### 4. Calling "kernel" routine to execute on device (GPU)

CUDA introduces a syntax addition to C:

Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters:

myKernel<<< n, m >>>(arg1, ...);

n and m will define organization of thread blocks, and threads in a block.

For now, we will set n=1, which say one block and m=N, which says N threads in this block.

**arg1**, ... , -- arguments to routine **myKernel** typically pointers to device memory obtained previously from **cudaMalloc**.

14

### **Declaring a Kernel Routine** Example - Adding to vectors A and B #define N 256 \_\_global\_\_ void vecAdd(int \*A, int \*B, int \*C) { // Kernel definition int i = threadIdx.x; C[i] = A[i] + B[i];Each of the N threads performs one pairwise addition: int main() { Thread 0: devC[0] = devA[0] + devB[0]; Thread 1: devC[1] = devA[1] + devB[1]; // allocate device memory & // copy data to device Thread N-1: devC[N-1] = devA[N-1]+devB[N-1]; // device mem. ptrs devA,devB,devC vecAdd<<<1, N>>>(devA,devB,devC); Grid of one block, block has N threads 15

### 5. Transferring data from device (GPU) to host (CPU)

Use CUDA routine cudaMemcpy

cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost);

where devC is a pointer in device and C is a pointer in host.

## G. Free memory in device and host Use CUDA cudaFree routine: cudaFree( devA); cudaFree( devB); cudaFree( devC); free( a ); free( b ); free( c );

```
#define N 256

__global__ void vecAdd(int *A, int *B, int *C) {
    int i = threadidx.x;
        C(i) = A(i) + B(i);
    }

int main (int argc, char **argv ) {
    int size = N *sizeo(i int);
    int *a,*b,*c, *devA, *devB, *devC;

    cudaMalloct (void**)&devA, size) };
    cudaMalloct (void**)&devC, size );
    cudaMalloct (void**)&devC, size );
    cudaMalloct (void**)&devC, size);
    cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
    vecAdd<<1, N>>>(devA, devB, devC);

    cudaMemcpy( &c, devC size, cudaMemcpyHostToDevice);
    vecAdd<<1, N>>>(devA, devB, devC);

    cudaFree (devA;
    cudaFree (devA;
    cudaFree (devB);
    cudaFree (devB);
    return (0);
    }

18
```

### Note about cudaMemcpy

- cudaMemcpy is synchronous
  - begins after all previous CUDA calls are complete
  - return control to host after copy is complete
- cudaMemcpyAsync: asynchronous version

```
// copy data from host to device cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);

// execute the kernel inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>{a_d, N};

// run independent CPU code run_cpu_stuff();

// copy data from device back to host cudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);
```

### General grid/block specification





### **CUDA Variable Type Qualifiers** Lifetime Variable declaration Memory Scope int LocalVar; thread thread local int SharedVar; \_device\_\_ \_shared\_\_ shared block block \_device\_ int GlobalVar; global grid application \_\_constant\_\_ int ConstantVar; \_device\_\_ constant grid application \_device\_\_ is optional when used with \_\_local\_ \_\_shared\_\_, or \_\_constant\_ Automatic variables without any qualifier reside in a register Except arrays that reside in local memory - Thread-local memory and spilled automatic variables is allocated in global memory

### Synchronization

- void \_\_syncthreads();
- Synchronizes all threads in a block
- Once all threads have reached this point, execution resumes normally
- Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory
- Allowed in conditional constructs only if the conditional is uniform across the entire thread block

### **GPU Atomic Integer Operations**

- · Atomic operations on integers in global and shared memory:
  - Associative operations on signed/unsigned ints
    - add, sub, min, max, ...
    - and, or, xor
    - increment, decrement
  - Exchange, compare and swap
- Requires hardware with compute capability 1.1 and above.

**Performance Programming** 

25

### Summary of C extensions

- global, device, shared, local, constant
- Keywords
  - threadidx, blockidx
- gridDim, blockDim Intrinsics
  - \_\_syncthreads
- Runtime API
  - Memory, symbol, execution management
- **Function launch**

```
__device__ float filter[N];
__global__ void convolve (float *image) {
  __shared__ float region[M];
  region[threadIdx] = image[i];
  __syncthreads()
  image[j] = result;
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);
```

### Key ideas for performance

- Usual requirements
  - algorithm must have lot of parallelism
- Locality
  - use tiling to exploit shared-memory: copy blocks of data from global memory to shared-memory and operate on them
- New issue: exploiting SIMT
  - thread divergence: what happens when threads in a warp follow different control paths?
  - memory access optimization:
    - · global memory: memory coalescing
    - shared memory: avoiding bank conflicts



### Thread divergence

- Nothing in the programming model requires all threads within a warp to execute the same sequence of instructions!
- Conflict with SIMT: what happens if two threads within a warp want to take different execution paths?
  - common case: conditional branch which evaluates to different values in different threads
- Hardware: different execution paths are serialized
  - different control paths taken by the threads in a warp are traversed one at a time until they all converge again
- Thread divergence: performance problem at warp level

### Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity Branch granularity Warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: If (threadIdx.x / WARP\_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path If possible, sort the work so that all the work done by a warp is similar (see example later)

Avoiding thread divergence

# Global memory (on-chip): warp can send 32 distinct addresses in parallel Global memory (off-chip): one memory address per transaction

### **Optimizing memory accesses**

- Optimizing accesses from a warp to shared and global memory is critical
- Different techniques needed
  - shared: avoid bank conflicts
  - global: memory coalescing
- Difference arises from architectural features
  - shared memory (on-chip): 32 different addresses can be presented in parallel from warp to memory
  - global memory (off-chip DRAM): only one address per memory bus transaction

### **Bank conflicts**

- In a parallel machine, many threads access memory
  - Therefore, memory is divided into banks
  - Essential to achieve high bandwidth
- Each bank can service one address per cycle
  - A memory can service as many simultaneous accesses as it has banks
- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting accesses are serialized

Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7

Bank 15

3.4

### Pank Addressing Examples No Bank Conflicts - Linear addressing stride == 1 Thread 0 Thread 1 Thread 2 Thread 3 Thread 3 Thread 4 Thread 5 Thread 6 Thread 6 Thread 7 Bank 1 Thread 6 Thread 7 Bank 1 Thread 6 Thread 7 Bank 1 Thread 6 Thread 6 Thread 7 Bank 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 6 Thread 7 Bank 1 Thread 1 Bank 2 Thread 3 Bank 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 1 Thread 1 Bank 1 Thread 5 Thread 6 Thread 7 Bank 1 Bank 2 Thread 3 Bank 3 Thread 4 Thread 5 Thread 6 Thread 6 Thread 7 Bank 1 Bank 1 Thread 1 Bank 1 Bank 2 Thread 3 Bank 3 Thread 4 Thread 5 Thread 6 Bank 1 Thread 1 Bank 1 Bank 1 Bank 1 Bank 1 Bank 2 Bank 3 Thread 4 Thread 5 Bank 1 Bank 2 Bank 3 Bank 3 Bank 4 Bank 1 Bank 2 Bank 1 Bank 1



### How addresses map to banks on G80

- Each bank has a bandwidth of 32 bits per clock cycle
- Successive 32-bit words are assigned to successive banks
- G80 has 16 banks
  - So bank = address % 16
  - Same as the size of a half-warp
    - No bank conflicts between different half-warps, only within a single half-warp

37

### Shared memory bank conflicts

- Shared memory is as fast as registers if there are no bank conflicts
- The fast case:
  - If all threads of a half-warp access different banks, there is no bank conflict
  - If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)
- The slow case:
  - Bank Conflict: multiple threads in the same half-warp access the same bank
  - Must serialize the accesses
  - Cost = max # of simultaneous accesses to a single bank

38

### • Given: \_\_shared\_\_ float shared[256]; float foo = shared[baseIndex + s \* threadIdx.x]; • This is only bank-conflict-free if s shares no common factors with the number of banks - 16 on G80, so s must be odd



### **Coalescing: detail**

- Global memory requests from warp can be loads, stores or atomics
- Two or more threads in warp can specify the same address: what should happen?
- Load requests to same address: multicast
- Stores to same address: one thread will win
- Atomics to same address: serialize

### **CUDA** programming examples

### **Parallel Reduction**

- Given an array of values, "reduce" them to a single value in parallel
- Examples
  - Sum reduction: sum of all values in the array
  - Max reduction: maximum of all values in the array
- Typically parallel implementation:
  - Recursively halve # threads, add two values per thread
  - Takes log(n) steps for n elements, requires n/2 threads

43

### **A Vector Reduction Example**

- Assume an in-place reduction using shared memory
  - The original vector is in device global memory
  - The shared memory used to hold a partial sum vector
  - Each iteration brings the partial sum vector closer to the final sum
  - The final solution will be in element 0

### A simple implementation

• Assume we have already loaded array into

- \_\_shared\_\_ float partialSum[]
unsigned int t = threadIdx.x;
for (unsigned int stride = 1;
 stride < blockDim.x; stride \*= 2)</pre>

```
stride < blockDim.x; stride *= 2)
{
   __syncthreads();
   if (t % (2*stride) == 0)
      partialSum[t] += partialSum[t+stride];
}</pre>
```





### **Some Observations**

- In each iteration, two control flow paths will be sequentially traversed for each warp
  - Threads that perform addition and threads that do not
  - Threads that do not perform addition may cost extra cycles depending on the implementation of divergence
- No more than half of threads will be executing at any time
  - All odd index threads are disabled right from the beginning!
  - On average, less than ¼ of the threads will be activated for all warps over time
  - After the S<sup>th</sup> iteration, entire warps in each block will be disabled, poor resource utilization but no divergence
    - This can go on for a while, up to 4 more iterations (512/32=16= 2<sup>4</sup>), where each iteration only has one thread activated until all warps retire

### Shortcomings of the implementation

Assume we have already loaded array into
 \_\_shared\_\_ float partialSum[]

```
unsigned int t = threadIdx.x;
for (unsigned int stride = 1;
    stride < blockDim.x; str

{
    __syncthreads();
    if (t % (2*stride) == 0)
        partialSum[t] += partialSum[t+stride];
}</pre>
```

### A better implementation

Assume we have already loaded array into

```
- __shared__ float partialSum[]
unsigned int t = threadIdx.x;
for (unsigned int stride = blockDim.x / 2;
    stride > 1; stride /= 2)
{
    __syncthreads();
    if (t < stride)
        partialSum[t] += partialSum[t+stride];
}</pre>
```

# No Divergence until < 16 sub-sums Thread 0 0 1 2 3 ... 13 14 15 16 17 18 19 1 0-16 15+31 ... 15+31 ... 15 4 51

### Some Observations About the New Implementation

- Only the last 5 iterations will have divergence
- Entire warps will be shut down as iterations progress
  - For a 512-thread block, 4 iterations to shut down all but one warps in each block
  - Better resource utilization, will likely retire warps and thus blocks faster
- Recall, no bank conflicts either

### **Matrix Multiplication**

- A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
  - Local, register and shared memory usage
  - Thread ID usage
  - Memory data transfer API between host and device
  - Assume square matrix for simplicity

53



## Memory Layout of a Matrix in C | Mago | Mag



### Step 2: Input Matrix Data Transfer (Host-side Code) void MatrixMulOnDevice(float\* M, float\* N, float\* P, int Width) { int size = Width \* Width \* sizeof(float); float\* Md, Nd, Pd; ... 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device

cudaMalloc(&Pd, size);

# Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later ... 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Pd); }

### Step 4: Kernel Function // Matrix multiplication kernel – per thread code \_\_global\_\_ void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;



# Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);







### Revised Matrix Multiplication Kernel using Multiple Blocks

```
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
    Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
```

### **G80 Block Granularity Considerations**

- For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
  - For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, because each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
  - For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
  - For 32X32, we have 1024 threads per Block. Not even one can fit into an SMI.

66

### How about performance on G80? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-ado 4B/s of memory bandwidth/FLOS 4B/s of memory bandwidth/FLOS 4B/s of memory bandwidth/FLOS A\*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Host Host Constant Memory Constant Memory

### Matrix Multiplication using Shared Memory

### 



### **G80 Shared Memory and Threading**

- Each SM in G80 has 16KB shared memory
  - SM size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
- Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 86.4B/s bandwidth can now support (86.4/4)\*16 = 347.6 GFLOPS!



### **GPU Applications**

- Graphics
- Regular algorithms
  - CUBLAS, CUFFT, ....
- Barnes-Hut
  - Burtscher et al: NVIDIA GPU Gem
- Graph algorithms
  - Nasre et al
- See NVIDIA website for longer list

### **Summary**

- Key features of GPUs
  - lots of threads: lightweight startup/shutdown
    - algorithms: should have lots of parallelism
  - warp-based execution: SIMT
    - not part of programming model
  - latency-tolerance: run lots of warps simultaneously to mask memory latency
  - exposed memory hierarchy managed by software
    - programming model: thread blocks and explicit data movement
    - performance:
      - memory coalescing: global memory
         bank conflicts: shared-memory