You can work in groups. Each group should not have more than 2 students. Each group should do the assignment independently.

//mini-kernel for(int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) //micro-kernel load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A's and B's and add to C's store C[i..i+MU-1, j..j+NU-1](a) (10 points) Write a function that takes three NxN matrices as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code to multiply matrices of various sizes (we recommend not going above 200x200), and draw a graph of GFlops vs. matrix size. Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. What fraction of peak performance do you get for large matrices? Explain your results briefly.

(b) (20 points) To measure the impact of register-blocking without cache-blocking, implement register-blocking by writing a function for performing MMM, using the mini-kernel code with NB = N (you should verify that this implements MMM correctly). You can use the results in the Yotov et al study of the ATLAS system to determine good values for MU and NU, or you can experiment with different values to find good values. In this part, do not use vector operations in your code. Run the code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size (the graph can be superimposed on the graph from part (a)). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Explain your results briefly. Note for experiments: use values of N that are multiples of your values for MU and NU; otherwise, you will have to write some clean-up code to handle leftover pieces of the matrices.

(c) (20 points) Repeat part (b) but this time, use vector intrinsics to implement the MMM code. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (as before, superimpose this graph on the graphs from (a) and (b)). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Explain your results briefly.

(d) (30 points) Repeat (c), but this time, implement both register-blocking and L1 cache-blocking. You will have wrap three loops around the mini-kernel to get a full MMM. Use any method you wish to determine a good value for NB. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (as before, superimpose this graph on the graphs from the previous parts). Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size. Make sure you only use matrices whose size is a multiple of your choice of NB so you do not need to write clean-up code to handle fractional tiles.. Explain your results briefly.

(e) (20 points) You can improve the performance of your kernel by copying blocks of data into contiguous storage as explained in the Yotov et al paper. Implement data copying and produce the performance graph for this optimized code, superimposing it on the graphs from all the previous parts. Using PAPI counters, measure the L1 data/inst-cache misses for the largest matrix size.

Implementation notes:

1) In the C programming language, a 2-dimensional array of floats is usually implemented as a 1-dimensional array of 1-dimensional arrays of floats. For our computations, it is better to allocate one big contiguous block of storage to hold all the floats, and then create a 1-dimensional row vector that holds points to the start of each row. You can use the following code for this purpose: it takes the number of rows (x) and columns (y) as parameters, and it returns a pointer to the row vector.

float **Allocate2DArray_Offloat(int x, int y)

For N*M matrix, I think

Don't worry about the multiplication overhead or even use pointer for "speedup".

Modern compiler can automatically detect patterns like "[i * M + j]" and replace with something faster.

2) Here is a nice paper on coding guidelines for writing "portable assembly language programs" in the C language.

3) You can also read an article about the GotoBLAS by Goto and van de Geijn.

4) To prevent compiler from doing things that is "too smart", please use gcc or clang.

Do not use icc for this assignment.

You can use "-O2" but don't go up to "-O3".

5) When you copy a large chunk of memory, use memcpy() rather than a hand-written loop.