You can work in groups. Each group should not have more than 2 students. Each group should do the assignment independently.
(10 points) Consider the following loop-nest, which is referred
      to as
      "wavefront" in the compiler literature.  Assume A stores doubles.
    
for I = 2,N
        for J = 1, N-1
         A(I,J) = A(I,J) + A(I-1,J+1)
    
//mini-kernel
for(int j = 0; j < NB; j += NU)
  for (int i = 0; i < NB; i += MU)
    load C[i..i+MU-1, j..j+NU-1] into registers
      for (int k = 0; k < NB; k++)
        //micro-kernel
        load A[i..i+MU-1,k] into registers
        load B[k,j..j+NU-1] into registers
        multiply A's and B's and add to C's
        store C[i..i+MU-1, j..j+NU-1]
    (a) (10 points) Write a function that takes three NxN matrices of doubles as
    input
    and uses the straight-forward ikj
    3-nested
    loop version of MMM to perform matrix multiplication. Use the
    allocation routine given in the implementation notes to allocate
    storage for the three arrays. Run
    this code to multiply matrices of various sizes, and draw a graph of
    GFlops vs. matrix size. What fraction of peak performance do you get
    for large matrices?
    Explain your results briefly.We will care about matrix multiply in the range of problems starting with those that do not fit in L2 to those which do not fit in L3. The exact sizes will not be given, but do expect that some tests will be on odd sized matrixes.
All matrixes will be square.
For the competition, we will have a test harness which will
    call your code.  Thus you MUST provide a function called matmult with the following signature:
      
void matmult(double* A, double* B, double* C, unsigned N)which multiplies A by B and puts the result in C. A,B, and C are square matrixes of size N.
We will test on lonestar.
You are free to use any vector intrinsics available. You are free to prefetch.
We will compile with gcc -O3.
      
    
    Implementation notes:
    
 1) In the C programming language, a 2-dimensional array of
    doubles is
 implemented as a flattened 2-dimensional array of
    doubles.  When the compiler knows the sizes of the dimentions it
    generates row-major indexing expressions for that.  Thus for
    A[anything][8], it will generate Aflat[r*8 + c] for A[r][c].  Note
    that you can always elide the left most dimention off of a type in
    C since the compiler doesn't need it.  Unfortunately, C does not
    pass variable sized multidimentional arrays to functions well
    since the number of colums is part of the type.  Note that the
    type A[][] denotes an array of arrays and not a multidimentional
    array.  The simple solution is to simple deal with an explicitly
    flattened array and generate the index expressions yourself.  This
    will make the part of the assignment dealing with column-major
    easy to do as you will just change the indexing function.  For
    example, assuming square matrixes:
    
    
      double* alloc(int N) {
        return malloc(sizeof(double)*N*N);
      }
      double* index(double* matrix, int row, int column, int N) {
        return &matrix[row*N + column];
      }
    
    //compile using: //module load intel //module load boost //module load mkl //icpc -I $TACC_BOOST_INC -L $TACC_BOOST_LIB -O3 -mkl -lboost_timer -DNDEBUG mkl_test.cpp #include#include #include #include using namespace boost::numeric::ublas; void benchmark(int size) { matrix m(size, size); matrix n(size, size); for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { m(i,j) = 2*i-j; } } matrix r(size, size); std::cout << "Trying " << size << " "; { boost::timer::auto_cpu_timer t; r = prod(m,n); } } int main(int argc, char *argv[]) { mkl_set_num_threads ( 1 ); for (int x = 100; x < 2000; x += 100) benchmark(x); return 0; } ====================== results: size time (s) 100 0.0008 200 0.0085 300 0.027987 400 0.06655 500 0.13358 600 1.1599 700 2.68 800 4.015 900 5.87 1000 8.15 1100 10.97 1200 14.18 1300 18.632 1400 23.088 1500 29.432 1600 36.54 1700 43.67