CS 378: Programming for Performance

Assignment 2: Cache Measurement

Due date: October 1

You can do this assignment alone or with someone else from class. Each group can have a maximum of two students. Each group should turn in one submission.

1. Miss ratio measurement in simulator (50 points)

Consider the different permutations of the matrix multiplication pseudocode shown below:

for I = 1, N

   for J = 1, N
     for K = 1, N
       C[I,J] = C[I,J] + A[I,K]*B[K,J]

2. Miss ratio measurement on real hardware (50 points)

In part 1, you wrote 6 variations of matrix-matrix multiply, generated address traces, then ran those traces through a cache simulator to generate plots of miss ratios as a function of N. Your results from that part should validate the model presented in class. All models are approximations. In this part, you will perform the same experiment on real hardware. Real hardware has more going on than the model accounts for or than the cache simulator simulators, so we expect some divergence from the model predicted behaviour. The implementation notes give code for using papi and some critical information to collect valid numbers.

Implementation notes:


You only need to model the L1 data cache in Dinero. We don't care about the instruction cache and it would be a lot more work to generate an address trace for that cache anyway.

Don't forget that *C += ... is a read, then a write of C.

For both parts, run the experiments with N = 1 .. 512.

The L1 cache on lonestar is not 12MB, /proc/cpuinfo only shows the L3 cache. That file has other information that you can use to find the L1 cache parameters and there are other ways to do it.

There is no requirement that you wrap the matrix in a class. The amount of syntactic sugar you want to apply is up to you, as long as it doesn't negatively impact performance (say by requiring more pointer dereferences than necessary). The memory representation is up to you as long as it is sensible for a dense matrix and is row-major. One can think of two representations (with a couple variations) that are sensible under these constraints and they should both give you about the same answer.