CS 377P: Programming for Performance

Assignment 3: Cache-optimized MMM

Due: March 8, 2018, 11:59 PM

You can work in groups of two. Each group should do the assignment independently.
Measure your performance on the orcrists.
Late policy: Submissions can be at most 1 day late with 10% penalty.

In class, we described the structure of the optimized MMM code produced by ATLAS. The "computational heart" of this code is the mini-kernel that multiplies an NBxNB block of matrix A with an NBxNB block of matrix B into an NBxNB block of matrix C, where NB is chosen so that the working set of this computation fits in the L1 cache. The mini-kernel itself is performed by repeatedly calling a micro-kernel that multiplies an MUx1 column vector of matrix A with a 1xNU row vector of matrix B into an MUxNU block of matrix C. The values of MU and NU are chosen so that the micro-kernel can be performed out of the registers. Pseudocode for the mini-kernel is shown below (note that this code assumes that NB is a multiple of MU and NU).

//mini-kernel
for(int j = 0; j < NB; j += NU)
  for (int i = 0; i < NB; i += MU)
    load C[i..i+MU-1, j..j+NU-1] into registers
      for (int k = 0; k < NB; k++)
        //micro-kernel
        load A[i..i+MU-1,k] into registers
        load B[k,j..j+NU-1] into registers
        multiply A's and B's and add to C's
        store C[i..i+MU-1, j..j+NU-1]

(a) (10 points) Write a function that takes three NxN matrices as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code to multiply matrices of various sizes (we recommend not going above 200x200), and draw a graph of GFlops vs. matrix size. Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. What fraction of peak performance do you get for large matrices? Explain your results briefly.

(b) (20 points) To measure the impact of register-blocking without cache-blocking, implement register-blocking by writing a function for performing MMM, using the mini-kernel code with NB = N (you should verify that this implements MMM correctly). You can use the results in the Yotov et al study of the ATLAS system to determine good values for MU and NU, or you can experiment with different values to find good values. In this part, do not use vector operations in your code. Run the code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size (the graph can be superimposed on the graph from part (a)). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Explain your results briefly. Note for experiments: use values of N that are multiples of your values for MU and NU; otherwise, you will have to write some clean-up code to handle leftover pieces of the matrices.

(c) (20 points) Repeat part (b) but this time, use vector intrinsics to implement the MMM code. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (as before, superimpose this graph on the graphs from (a) and (b)). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Explain your results briefly.

(d) (30 points) Repeat (c), but this time, implement both register-blocking and L1 cache-blocking. You will have wrap three loops around the mini-kernel to get a full MMM. Use any method you wish to determine a good value for NB. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (as before, superimpose this graph on the graphs from the previous parts). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Make sure you only use matrices whose size is a multiple of your choice of NB so you do not need to write clean-up code to handle fractional tiles.. Explain your results briefly.

(e) (20 points) You can improve the performance of your kernel by copying blocks of data into contiguous storage as explained in the Yotov et al paper. Implement data copying and produce the performance graph for this optimized code, superimposing it on the graphs from all the previous parts. Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size.

Submission

Submit in a .tar.gz file your code for all subproblems, the write-up and the performance graphs. Include a README.txt to explain how to run your program and what the outputs will be. You can submit a single graph with all the plots.

Implementation notes:

1) In the C programming language, a 2-dimensional array of doubles is usually implemented as a 1-dimensional array of 1-dimensional arrays of doubles. For our computations, it is better to allocate one big contiguous block of storage to hold all the doubles, and then create a 1-dimensional row vector that holds points to the start of each row. You can use the following code for this purpose: it takes the number of rows (x) and columns (y) as parameters, and it returns a pointer to the row vector.

double **Allocate2DArray_Ofdouble(int x, int y)
    {
           int TypeSize = sizeof(double);
           double **ppi   = malloc(x*sizeof(double*));
           double *pool   = malloc(x*y*TypeSize);
           unsigned char *curPtr = pool;
           int i;
           if (!ppi || !pool)
           { /* Quit if either allocation failed */
                   if (ppi) free(ppi);
                   if (pool) free(pool);
                   return NULL;
           }
           /* Create row vector */
           for(i = 0; i < x; i++)
           {
                   *(ppi + i) = curPtr;
                   curPtr += y*TypeSize;
           }
           return ppi;
    }

TA comments:
For N*M matrix, I think "auto mat = new double[N*M]", "delete[] mat" and "mat[i * M + j]" are actually the normal ways to go (or malloc/free equivalents if you're using C).
Don't worry about the multiplication overhead or even use pointer for "speedup".
Modern compiler can automatically detect patterns like "[i * M + j]" and replace with something faster.

2) Here is a nice paper on coding guidelines for writing "portable assembly language programs" in the C language.

3) You can also read an article about the GotoBLAS by Goto and van de Geijn.

4) To prevent compiler from doing things that is "too smart", please use gcc or clang.
Do not use icc for this assignment.
You can use "-O2" but don't go up to "-O3".

5) When you copy a large chunk of memory, use memcpy() rather than a hand-written loop.