You can work in groups. Each group should not have more than 2 students. Each group should do the assignment independently.

(10 points) Consider the following loop-nest, which is referred
to as
"wavefront" in the compiler literature. Assume A stores doubles.

for I = 2,N

for J = 1, N-1

A(I,J) = A(I,J) + A(I-1,J+1)

- If the array is stored in column-major order, what are the miss ratios in the large and small cache models, assuming the line size is b?
- If the capacity of the cache is C, how large is N when the transition out of the large cache model takes place (i.e., when capacity misses start to occur)?
- In class, we saw how loop transformations can improve locality. In some programs, it is possible to perform data transformations (i.e., change the layout of data in memory) to improve locality. Repeat parts 1 and 2 for the case when the array is stored in row-major order.

//mini-kernel for(int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) //micro-kernel load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A's and B's and add to C's store C[i..i+MU-1, j..j+NU-1](a) (10 points) Write a function that takes three NxN matrices of doubles as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size. What fraction of peak performance do you get for large matrices? Explain your results briefly.

(b) (30 points) To measure the impact of register-blocking without cache-blocking, implement register-blocking by writing a function for performing MMM, using the mini-kernel code with NB = N (you should verify that this implements MMM correctly). You can use the results in the Yotov et al study of the ATLAS system to determine good values for MU and NU, or you can experiment with different values to find good values. Run the code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size. Explain your results briefly. Note: use values of N that are multiples of your values for MU and NU; otherwise, you will have to write some clean-up code to handle leftover pieces of the matrices.

(c) (30 points) Repeat (b), but this time, implement both register-blocking and L1 cache-blocking. You will have wrap three loops around the mini-kernel to get a full MMM. Use any method you wish to determine a good value for NB. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (make sure you only use matrices whose size is a multiple of your choice of NB). Explain your results briefly.

(d) (20 points) You can improve the performance of your kernel by copying blocks of data into contiguous storage as explained in the Yotov et al paper. Implement data copying and produce the performance graph for this optimized code.

Competition notes:

We will care about matrix multiply in the range of problems starting with those that do not fit in L2 to those which do not fit in L3. The exact sizes will not be given, but do expect that some tests will be on odd sized matrixes.

All matrixes will be square.

For the competition, we will have a test harness which will
call your code. Thus you MUST provide a function called matmult with the following signature:

void matmult(double* A, double* B, double* C, unsigned N)which multiplies A by B and puts the result in C. A,B, and C are square matrixes of size N.

We will test on lonestar.

You are free to use any vector intrinsics available. You are free to prefetch.

We will compile with gcc -O3.

Implementation notes:

1) In the C programming language, a 2-dimensional array of
doubles is
implemented as a flattened 2-dimensional array of
doubles. When the compiler knows the sizes of the dimentions it
generates row-major indexing expressions for that. Thus for
A[anything][8], it will generate Aflat[r*8 + c] for A[r][c]. Note
that you can always elide the left most dimention off of a type in
C since the compiler doesn't need it. Unfortunately, C does not
pass variable sized multidimentional arrays to functions well
since the number of colums is part of the type. Note that the
type A[][] denotes an array of arrays and not a multidimentional
array. The simple solution is to simple deal with an explicitly
flattened array and generate the index expressions yourself. This
will make the part of the assignment dealing with column-major
easy to do as you will just change the indexing function. For
example, assuming square matrixes:

double* alloc(int N) { return malloc(sizeof(double)*N*N); } double* index(double* matrix, int row, int column, int N) { return &matrix[row*N + column]; }

2) Here is a nice paper on coding guidelines for writing "portable assembly language programs" in the C language.

3) You can also read an article about the GotoBLAS by Goto and van de Geijn.

4) You can try using vectors in gcc to achieve peak performance.

GCC vector type extentions

Links to many resources for vector instrinsics.

5) You might try unrolling by 2. This may allow gcc to use vector instructions without anymore effort on your part.

Below is a program and reference runtimes generated from Intel's MKL (Math Kernel Library)

These numbers are for your reference.

//compile using: //module load intel //module load boost //module load mkl //icpc -I $TACC_BOOST_INC -L $TACC_BOOST_LIB -O3 -mkl -lboost_timer -DNDEBUG mkl_test.cpp #include#include #include #include using namespace boost::numeric::ublas; void benchmark(int size) { matrix m(size, size); matrix n(size, size); for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { m(i,j) = 2*i-j; } } matrix r(size, size); std::cout << "Trying " << size << " "; { boost::timer::auto_cpu_timer t; r = prod(m,n); } } int main(int argc, char *argv[]) { mkl_set_num_threads ( 1 ); for (int x = 100; x < 2000; x += 100) benchmark(x); return 0; } ====================== results: size time (s) 100 0.0008 200 0.0085 300 0.027987 400 0.06655 500 0.13358 600 1.1599 700 2.68 800 4.015 900 5.87 1000 8.15 1100 10.97 1200 14.18 1300 18.632 1400 23.088 1500 29.432 1600 36.54 1700 43.67