CSE 392/CS 378 Assignment 3

Due: October 24th, 2012

Introduction

In this assignment, you will use the cache optimization techniques covered in class to optimize a dense matrix-multiplication kernel.

You can work in groups. Each group should not have more than 2 students. Each group should do the assignment independently.

Problem 1 -- Miss Ratio of Wavefront Code

(10 points) Consider the following loop-nest, which is referred to as "wavefront" in the compiler literature. Assume A stores doubles.

for I = 2,N
for J = 1, N-1
A(I,J) = A(I,J) + A(I-1,J+1)

If the array is stored in column-major order, what are the miss ratios in the large and small cache models, assuming the line size is b?
If the capacity of the cache is C, how large is N when the transition out of the large cache model takes place (i.e., when capacity misses start to occur)?
In class, we saw how loop transformations can improve locality. In some programs, it is possible to perform data transformations (i.e., change the layout of data in memory) to improve locality. Repeat parts 1 and 2 for the case when the array is stored in row-major order.

Problem 2 -- Optimization of MMM for the Memory Hierarchy

In class, we described the structure of the optimized MMM code produced by ATLAS. The "computational heart" of this code is the mini-kernel that multiplies an NBxNB block of matrix A with an NBxNB block of matrix B into an NBxNB block of matrix C, where NB is chosen so that the working set of this computation fits in the L1 cache. The mini-kernel itself is performed by repeatedly calling a micro-kernel that multiplies an MUx1 column vector of matrix A with a 1xNU row vector of matrix B into an MUxNU block of matrix C. The values of MU and NU are chosen so that the micro-kernel can be performed out of the registers. Pseudocode for the mini-kernel is shown below (note that this code assumes that NB is a multiple of MU and NU).

//mini-kernel
for(int j = 0; j < NB; j += NU)
  for (int i = 0; i < NB; i += MU)
    load C[i..i+MU-1, j..j+NU-1] into registers
      for (int k = 0; k < NB; k++)
        //micro-kernel
        load A[i..i+MU-1,k] into registers
        load B[k,j..j+NU-1] into registers
        multiply A's and B's and add to C's
        store C[i..i+MU-1, j..j+NU-1]

(a) (10 points) Write a function that takes three NxN matrices of doubles as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size. What fraction of peak performance do you get for large matrices? Explain your results briefly.

(b) (30 points) To measure the impact of register-blocking without cache-blocking, implement register-blocking by writing a function for performing MMM, using the mini-kernel code with NB = N (you should verify that this implements MMM correctly). You can use the results in the Yotov et al study of the ATLAS system to determine good values for MU and NU, or you can experiment with different values to find good values. Run the code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size. Explain your results briefly. Note: use values of N that are multiples of your values for MU and NU; otherwise, you will have to write some clean-up code to handle leftover pieces of the matrices.

(c) (30 points) Repeat (b), but this time, implement both register-blocking and L1 cache-blocking. You will have wrap three loops around the mini-kernel to get a full MMM. Use any method you wish to determine a good value for NB. Measure the GFlops you obtain for a range of matrix sizes, and plot of graph of GFlops vs. matrix size (make sure you only use matrices whose size is a multiple of your choice of NB). Explain your results briefly.

(d) (20 points) You can improve the performance of your kernel by copying blocks of data into contiguous storage as explained in the Yotov et al paper. Implement data copying and produce the performance graph for this optimized code.

Competition notes:

We will care about matrix multiply in the range of problems starting with those that do not fit in L2 to those which do not fit in L3. The exact sizes will not be given, but do expect that some tests will be on odd sized matrixes.

All matrixes will be square.

For the competition, we will have a test harness which will call your code. Thus you MUST provide a function called matmult with the following signature:

void matmult(double* A, double* B, double* C, unsigned N)

which multiplies A by B and puts the result in C. A,B, and C are square matrixes of size N.

We will test on lonestar.

You are free to use any vector intrinsics available. You are free to prefetch.

We will compile with gcc -O3.
Implementation notes:

1) In the C programming language, a 2-dimensional array of doubles is implemented as a flattened 2-dimensional array of doubles. When the compiler knows the sizes of the dimentions it generates row-major indexing expressions for that. Thus for A[anything][8], it will generate Aflat[r*8 + c] for A[r][c]. Note that you can always elide the left most dimention off of a type in C since the compiler doesn't need it. Unfortunately, C does not pass variable sized multidimentional arrays to functions well since the number of colums is part of the type. Note that the type A[][] denotes an array of arrays and not a multidimentional array. The simple solution is to simple deal with an explicitly flattened array and generate the index expressions yourself. This will make the part of the assignment dealing with column-major easy to do as you will just change the indexing function. For example, assuming square matrixes:

      double* alloc(int N) {
        return malloc(sizeof(double)*N*N);
      }
      double* index(double* matrix, int row, int column, int N) {
        return &matrix[row*N + column];
      }

2) Here is a nice paper on coding guidelines for writing "portable assembly language programs" in the C language.

3) You can also read an article about the GotoBLAS by Goto and van de Geijn.
4) You can try using vectors in gcc to achieve peak performance.
GCC vector type extentions
Links to many resources for vector instrinsics.
5) You might try unrolling by 2. This may allow gcc to use vector instructions without anymore effort on your part.

Submission

Submit a writeup for both problems, including your discussions and miss ratio plots.

Reference Data - Intel MKL

Baseline performance from INtel's MKL:
Below is a program and reference runtimes generated from Intel's MKL (Math Kernel Library)
These numbers are for your reference.

//compile using:
//module load intel
//module load boost
//module load mkl
//icpc -I $TACC_BOOST_INC -L $TACC_BOOST_LIB -O3 -mkl -lboost_timer -DNDEBUG mkl_test.cpp


#include 
#include 
#include 
#include 

using namespace boost::numeric::ublas;

void benchmark(int size) {
    matrix m(size, size);
    matrix n(size, size);
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            m(i,j) = 2*i-j;
        }
    }
    matrix r(size, size);

    std::cout << "Trying " << size << " ";
    {
      boost::timer::auto_cpu_timer t;
      r = prod(m,n);
    }
}

int main(int argc, char *argv[]) {
mkl_set_num_threads ( 1 );
 for (int x = 100; x < 2000; x += 100)
    benchmark(x);
    return 0;
}


======================

results:
size    time (s)
100    0.0008
200    0.0085
300    0.027987
400    0.06655
500    0.13358
600    1.1599
700    2.68
800    4.015
900    5.87
1000    8.15
1100    10.97
1200    14.18
1300    18.632
1400    23.088
1500    29.432
1600    36.54
1700    43.67