CSE 392/CS 378 Assignment 3

Due: October 30th, 2011

Introduction

In this assignment, you will implement Matrix-Matrix Multiplication(MMM) and try to optimize it using the techniques seen in class, mentioned in the references or any other tricks you can come up with! You can work in pairs.

Instructions

Visit http://z.cs.utexas.edu/wiki/LA.wiki/OptimizingGemm/
Follow the instructions under "Setup" on your favorite X86 (Intel) machine that runs Linux. It would be good to do so on a machine that has matlab installed.
Try the optimizations that are suggested, following the directions.
Next, try to optimize beyond what is on that wiki page. Target just one core (no OpenMP, pthreads, etc.)
Some possibilities:
- play with the block sizes
- try different levels of unrolling
- learn about how to call vector intrinsic operations from C
- see what the Intel C compiler (icc) does
Finally, turn in your best implementation.
Turn in one file named _MMult.c with in it a routine named _MMult with a calling sequence identical to that of MY_MMult in MMult0.c. So, I would turn in RVDG_MMult.c with in it the routine "void RVDG_MMult( etc )"
Your implementation needs only get the right answer for matrix sizes that are multiples of 8. So, you don't need to worry about what to do when the matrices are not nice multiples of some block size for which you optimized.
The best implementations of everyone in the class will be run by Sarat on a single core of longhorn (the machine at TACC on which you have an account) and a plot will be made showing how people did.

Note

If you optimize for one machine, it may not be the best for another machine. So, at some point you should do your development on longhorn.
Your code will be compiled using the gcc compiler with optimization level -O3. So you cannot rely on any other specific compiler flags/optimizations to give you good performance.
This makes things a bit difficult with graphing results. Deal with it! If you find a slick way of presenting graphs, share it with the class to make the lives of others easier.
If you are timing your implementations on a machine that other people are also using, your timings may not be reliable.

References

Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
Chellappa, S., Franchetti, F., and Pü, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196.259.
Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991.The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63.74.
Is Search Really Necessary to Generate High-Performance BLAS?

Submission

You should submit your code( a single .c file) as described in the Instructions section and a pdf document(about half a page in length) describing the optimizations that you used. Create a zip file and email it to saratrallapalli@gmail.com before 11.59 P.M. on 30th October. Subject of the email should be [hpc-hw3] Last Name/ First Name