CS 377P: Programming for Performance

Assignment 6: Matrix Multiplication with CUDA

Due date: April 27th, 2026, 9:00 PM

The goal of this assignment is to implement several CUDA kernels for N X N matrix multiplication and use NVIDIA Nsight Compute to understand GPU performance behavior. You will write a different kernel for each task and compare the execution time of the different matrix multiplication implementations. We provide the skeleton code matmul_sample.cu here. You will use TACC Lonestar GPU node, and the access guideline is here.

Use several matrix sizes for evaluation, including at least one large matrix size up to ​10,000 × 10,000​. You do not need to present results for every matrix size you test; presenting results for a few representative small sizes and one large size is sufficient.

Task 1

Start by running matrix multiplication with a ​single thread​. To run the kernel with only one thread, use kernel launch parameters <<<1, 1>>>. Measure the execution time of this single-thread kernel. You can run it with array size of up to 1K x 1K, since single thread execution is really slow. The skeleton code shows how to measure execution time of CUDA programs.

Task 2

Write a kernel in which ​each thread computes one element of the output matrix​. We call this version naive matrix multiplication because it does not include any memory optimizations. To run the kernel, use kernel launch parameters <<<num_blocks, num_threads>>>. Measure the execution time of this kernel.

Task 3

Implement ​tiled matrix multiplication using shared memory​. Refer to slide 52 from the course slides. Run the tiled kernel with different tile sizes, such as ​8, 16​ and ​32​, and measure the execution time for each version separately. Draw a graph comparing the performance of the different tile sizes, and explain the observed behavior. Your plot should have a different line for each matrix size. What is the best speedup you obtain for the 10k x 10k matrix?

Task 4

Use the profiling tool NVIDIA Nsight Compute (NCU) to study the performance behavior of your matrix multiplication kernels. Use the following command to profile your program:

module load cuda
ncu --section SpeedOfLight ./program

This command collects a summary including compute-related and memory-related throughput information. Nsight Compute CLI supports profiling from the command line using sections such as SpeedOfLight.

Report the following metrics:

Compare these metrics across:

Explain why memory throughput and SM throughput change as tile size increases. Does Memory [%] continue to increase as tile size grows? If not, explain why.

You may also consult the official Nsight Compute documentation and explore additional metrics, such as shared-memory behavior, L2 cache behavior, and other relevant profiling results. Nsight Compute’s profiling documentation includes memory workload analysis and other detailed sections for this purpose.

Report

Write a report (PDF) summarizing your implementation and conclusions from your experiments. Submit both your code and report in Canvas. Please write how many hours you spent on this assignment.