CS 378: Programming for Performance

Assignment 2: Cache optimizations

Due date: March 4th, 2008

You can do this assignment alone or with someone else from class.
Each group can have a maximum of two students.
Each group should turn in one submission.

1. Miss ratio measurement (20 points)

Consider the different permutations of the matrix multiplication pseudocode shown below:

for I = 1, N
   for J = 1, N
     for K = 1, N
       C[I,J] = C[I,J] + A[I,K]*B[K,J]

(a) Write C code for implementing these six versions of matrix multiplication. All three arrays should contain floats. Use the code shown in the implementation notes below to allocate storage for the arrays.

(b) Using mambo, determine the miss ratios of your programs for different problem sizes. Make sure you read the measurement protocol given at the end of this assignment before doing your experiments. Warning: for some of these permutations, each matrix multiplication can take an hour or more.

(c) Plot the miss ratio as a function of N for all six permutations.

2. Computing miss ratios (20 points)

Consider a two-level memory hierarchy consisting of a cache and main memory. Assume that the cache has capacity C, and that it is fully associative, so there are no conflict misses. Assume that arrays are stored in row-major order in memory. At the start of program execution, all array elements are stored in memory, and are brought into the cache as needed when a cache miss occurs. Each cache line holds b array elements. Finally, assume that the code was compiled with a simple compiler that does not register-allocate array elements, so the processor always issues loads and stores whenever it needs to read and write array locations.

(a) Write down the cache miss ratio as a function of N when the problem size is relatively small compared to the cache (so there are no capacity misses). At what value of N do you expect to start seeing capacity misses? Explain briefly why your answers are independent of which permutation of MMM is used. Are your answers consistent with your experimental results?

(b) As explained in class, a program that accesses mostly consecutive elements of an array (in the order in which they are stored in memory) is said to use unit-stride access to that array. Explain briefly why unit-stride access is good for the cache performance of a program.

(c) In the same spirit, let us say that a program uses zero-stride access to the elements of an array if it accesses each element of that array a large number of times consecutively before accessing a different array element. Explain briefly why zero-stride access is good for the cache performance of a program.

(d) Consider each of the six permutations of this loop nest. Considering only the innermost loop in each case, write down which arrays are accessed with (i) zero-stride, (ii) unit stride, and (iii) neither zero-stride nor unit-stride. Based on this analysis, which permutation(s) will have the smallest cache miss ratio for large values of the problem size N? Which permutation(s) will have the largest cache miss ratio? Do your answers agree with your experimental results?

(e) Here is a simple heuristic for computing the miss ratio for different permutations of MMM in the large problem size domain:

Array is accessed with zero-stride: all references to that array are hits.
Array is accessed with unit-stride: 1/b of all references to that array are misses

Explain briefly why this heuristic is reasonable, and use it to calculate the asymptotic miss ratios for the different versions of MMM. Do your answers agree with your experimental results?

3. Optimizing MMM for memory hierarchies (60 points)

In class, we described the structure of the optimized MMM code produced by ATLAS. The "computational heart" of this code is the mini-kernel that multiplies an NBxNB block of matrix A with an NBxNB block of matrix B into an NBxNB block of matrix C, where NB is chosen so that the working set of this computation fits in the L1 cache. The mini-kernel itself is performed by repeatedly calling a micro-kernel that multiplies an MUx1 column vector of matrix A with a 1xNU row vector of matrix B into an MUxNU block of matrix C. The values of MU and NU are chosen so that the micro-kernel can be performed out of the registers. Pseudocode for the mini-kernel is shown below (note that this code assumes that NB is a multiple of MU and NU).

//mini-kernel

for (int j = 0; j < NB; j += NU)

for (int i = 0; i < NB; i += MU)

load C[i..i+MU-1, j..j+NU-1] into registers

for (int k = 0; k < NB; k++)
//micro-kernel

load A[i..i+MU-1,k] into registers

load B[k,j..j+NU-1] into registers

multiply A’s and B’s and add to C’s

store C[i..i+MU-1, j..j+NU-1]

(a) (10 points) Write a function that takes three NxN matrices as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code on mambo to multiply matrices of size 512x512 (for this size, the working set will not fit in the L2 cache), and estimate the MFlops you obtain. What fraction of peak performance do you get? Note: this may require many hours of simulation.

(b) (25 points) To measure the impact of register-blocking without cache-blocking, implement register-blocking by writing a function for performing MMM, using the mini-kernel code with NB = N (you should verify that this implements MMM correctly). In our code, we used MU=NU=4 since the processor has 32 floating-point registers. Run your code on mambo to multiply 32x32 matrices (these are small enough that all three matrices will fit in the L1 data cache, barring conflict misses). If you use different values of MU and NU than we did, use a value of N that is a multiple of your values for MU and NU; otherwise, you will have to write some clean-up code to handle leftover pieces of the matrices. Estimate the MFlops and fraction of peak performance you get (in our experiments, we got close to 90% of peak). Since the processor issues instructions in-order, you may be able to improve performance by scheduling by hand the statements in the loop body.

Measure the MFlops you obtain for a range of matrix sizes, and plot of graph of MFlops vs. matrix size (make sure you only use matrices whose size is a multiple of MU and NU). Explain your results briefly.

(c) (25 points) Repeat (b), but this time, implement both register-blocking and L1 cache-blocking. You will have wrap three loops around the mini-kernel to get a full MMM. In our experiments, we used NB = 88 (which is divisible by 4 =MU = NU). Measure the MFlops you obtain for a range of matrix sizes, and plot of graph of MFlops vs. matrix size (make sure you only use matrices whose size is a multiple of your choice of NB). Explain your results briefly.

Implementation notes:

0) Make sure you run mambo in "cycle mode". Otherwise you will not get accurate cycle counts.

1) In the C programming language, a 2-dimensional array of floats is usually implemented as a 1-dimensional array of 1-dimensional arrays of floats. For our computations, it is better to allocate one big contiguous block of storage to hold all the floats, and then create a 1-dimensional row vector that holds points to the start of each row. You can use the following code for this purpose: it takes the number of rows (x) and columns (y) as parameters, and it returns a pointer to the row vector.

float **Allocate2DArray_Offloat(int x, int y)
    {
           int TypeSize = sizeof(float);
           float **ppi            = malloc(x*sizeof(float*));
           float *pool            = malloc(x*y*TypeSize);
           unsigned char *curPtr = pool;
           int i;
           if (!ppi || !pool)
           { /* Quit if either allocation failed */
                   if (ppi) free(ppi);
                   if (pool) free(pool);
                   return NULL;
           }
           /* Create row vector */
           for(i = 0; i < x; i++)
           {
                   *(ppi + i) = curPtr;
                   curPtr += y*TypeSize;
           }
           return ppi;
    }

2) Cache miss measurement protocol: Since mambo gives you the total number of cache misses that occur during the execution of a program, the experiments in Problem 1 are relatively straight-forward to perform. However, there is a common mistake that you should watch out for. Your code probably looks something like this:

main(..)
{
1) allocate arrays A,B,C
2) initialize arrays A,B,C
3) perform MMM
}

The natural thing to do is to measure the total number of cache misses that this program suffers with and without step 3, and attribute the difference to the MMM computation. However, notice that both steps 1 and 2 touch the three arrays, so elements of these arrays may already be in the L1 and L2 caches when step 3 begins. In the extreme case when all three arrays are small enough to fit in the L1 cache, step 3 may appear to suffer no cache misses at all! If we want to masure cache misses for MMM starting with all the arrays in memory, we must "flush" all the caches before we begin the MMM. This can be accomplished by allocating a large 1-dimensional array (distinct from A,B,C), and walking over that array once before starting step 3. Because of LRU replacement, this ensures that all elements of A,B,C are flushed from the cache before MMM starts. The size of this array must be larger than the capacity of the L2 cache, so for our processor, you can use a 128K array of floats.

Your code therefore should like the following:

main(..)
{
0) allocate 1-d array Flush
1) allocate 2-d arrays A,B,C
2) initialize 2-d arrays A,B,C
3) touch all elements of Flush
4) perform MMM
}

Now, the difference in the number of cache misses with and without step 4 gives you the number of cache misses suffered by MMM when all three arrays are in memory to begin with.

3) Paper on Cell engine (has description of Power processor)

Here are some relevant facts about the Power processor simulated in mambo:

frequency of processor: 3.2 GHz
latency of simple integer operations: 2 cycles
processor has a fused multiply-add: latency is 10 cycles
L1 cache latency: 2 cycles
L2 cache latency: 40 cycles
memory latency: 500 cycles
L1 instruction and data cache capacities: 32 KB
L1 line size: 128 bytes
L2 cache capacity: 512 KB
L2 line size: 128 bytes