CS 377P: Programming for Performance

Assignment 1: Performance counters

Due date: February 7, 2018, 9:00PM

You should do this assignment in teams of two.
Late submission policy: Submissions can be at most 1 day late. There will be a 10% penalty for late submissions.

Description

Write C code for the 6 variants of matrix-matrix multiply (MMM) you can generate by permuting loops in the standard three-nested loop version of MMM. The data type in the matrix should be doubles.

Instrument your implementations to use PAPI to measure:
- Total cycles
- Total instructions
- Total Load Store Instructions
- Total Floating Point Instructions
- L1 data cache accesses and misses
- L2 data cache accesses and misses
Compile your code in ICC with flags '-O3 -fp-model precise'.
Collect these measurements for the following 8 matrix sizes: 50x50, 100x100, 200x200, 400x400, 800x800, 1200x1200, 1600x1600 and 2000x2000. To ensure that there is no interference, make sure that you are the only one running experiments on the machine, and do one measurement at a time.
You can collect measurements on the following 10 CS machines (with your CS login): orcrist-20.cs.utexas.edu, orcrist-21, orcrist-22, orcrist-24, orcrist-25, orcrist-26, orcrist-27, orcrist-28, orcrist-29 and orcrist-30.
Create a table in which the rows correspond to the loop-order variant (i-j-k, j-i-k, j-k-i, k-j-i, i-k-j, k-i-j) and the columns correspond to the matrix sizes, and fill in each position in the table with four values: L1 and L2 miss rate, total load and store instructions, and number of committed floating point instructions.You can create four separate tables if you prefer.
Answer the following questions.

For the smallest matrix size, do the L1 and L2 miss rates vary for the different loop-order variants? Do they vary for the larger matrix sizes? Is there any difference in behavior between the different problem sizes? Can you explain intuitively the reasons for this behavior?
Re-instrument your code by removing PAPI calls, and using clock_gettime with CLOCK_THREAD_CPUTIME_ID to measure the execution times for the six versions of MMM and the 8 matrix sizes specified above. How do your timing measurements compare to the execution times you obtained from using PAPI? Repeat this study using CLOCK_REALTIME. Explain your results briefly.

Hint: To check cache sizes on the machine, run: lscpu

Deliverables

Submit (in canvas) the following two files:

A .tar.gz file with your code, a README.txt and a Makefile.

The README.txt describes how to run your program and what the output will be. A reasonable output will be pairs of "name of measured event, value".
With the Makefile, your code should be compiled on the 10 CS machines by running only "make".

A report (in .pdf) containing the tables, and the answers to the questions. Clearly list both teammates' names in the report.

Grading

Code: 40 points

Measurements (plots): 40 points

Explanation: 20 points

PAPI:

To see which papi counters are available on a host, run:

papi_avail

To see which papi counters can be collected at the same time, run:

papi_event_chooser

Read the PAPI manual http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:EventSets for more information, including example code.

"Warning! num_cntrs is more than num_mpx_cntrs" can be ignored.

ICC:

To run ICC on the indicated CS machines, run:

export PATH=$PATH:/opt/intel/bin
icc [compiler commands]

To check the availability of icc, run:

icc -v