PfHP Opening Remarks

Section 1.1 Opening Remarks

Unit 1.1.1 Launch

Homework 1.1.1

Compute \(\left(\begin{array}{rrr} 1 \amp -2 \amp 2\\ -1 \amp 1 \amp 3\\ -2 \amp 2 \amp -1 \end{array}\right) \left(\begin{array}{rr} -2 \amp 1\\ 1 \amp 3\\ -1 \amp 2 \end{array}\right) + \left(\begin{array}{rr} 1 \amp 0\\ -1 \amp 2\\ -2 \amp 1 \end{array}\right) = \)

Answer

\(\left(\begin{array}{rrr} 1 \amp -2 \amp 2\\ -1 \amp 1 \amp 3\\ -2 \amp 2 \amp -1 \end{array}\right) \left(\begin{array}{rr} -2 \amp 1\\ 1 \amp 3\\ -1 \amp 2 \end{array}\right) + \left(\begin{array}{rr} 1 \amp 0\\ -1 \amp 2\\ -2 \amp 1 \end{array}\right) = \left(\begin{array}{cc} -5 \amp -1\\ -1 \amp 10\\ 5 \amp 3 \end{array}\right)\)

Let \(A \text{,}\) \(B \text{,}\) and \(C \) be \(m \times k \text{,}\) \(k \times n \text{,}\) and \(m \times n \) matrices, respectively. We can expose their individual entries as

\begin{equation*} A = \left(\begin{array}{c c c c} \alpha_{0,0} \amp \alpha_{0,1} \amp \cdots \amp \alpha_{0,k-1} \\ \alpha_{1,0} \amp \alpha_{1,1} \amp \cdots \amp \alpha_{1,k-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \alpha_{m-1,0} \amp \alpha_{m-1,1} \amp \cdots \amp \alpha_{m-1,k-1} \end{array} \right), \end{equation*}

\begin{equation*} B = \left(\begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \cdots \amp \beta_{0,n-1} \\ \beta_{1,0} \amp \beta_{1,1} \amp \cdots \amp \beta_{1,n-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \beta_{k-1,0} \amp \beta_{k-1,1} \amp \cdots \amp \beta_{k-1,n-1} \end{array} \right), \end{equation*}

and

\begin{equation*} C = \left(\begin{array}{c c c c} \gamma_{0,0} \amp \gamma_{0,1} \amp \cdots \amp \gamma_{0,n-1} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \cdots \amp \gamma_{1,n-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \gamma_{m-1,0} \amp \gamma_{m-1,1} \amp \cdots \amp \gamma_{m-1,n-1} \end{array} \right). \end{equation*}

The computation \(C := A B + C \text{,}\) which adds the result of matrix-matrix multiplication \(A B \) to a matrix \(C \text{,}\) is defined as

\begin{equation*} \gamma_{i,j} = \sum_{p=0}^{k-1} \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \end{equation*}

for all \(0 \leq i \lt m \) and \(0 \leq j \lt n \text{.}\) This leads to the following pseudo-code for computing \(C := A B + C \text{:}\)

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , m-1 \\ ~~~ {\bf for~} j := 0, \ldots , n-1 \\ ~~~~~~ {\bf for~} p := 0, \ldots , k-1 \\ ~~~~~~~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\ ~~~~~~{\bf end} \\ ~~~{\bf end} \\ {\bf end} \end{array} \end{equation*}

The outer two loops visit each element of \(C \text{,}\) and the inner loop updates \(\gamma_{i,j} \) with the dot product of the \(i\)th row of \(A \) with the \(j \)th column of \(B \text{.}\)

#include <stdio.h>
#include <stdlib.h>

#define alpha( i,j ) A[ (j)*ldA + i ]   // map alpha( i,j ) to array A 
#define beta( i,j )  B[ (j)*ldB + i ]   // map beta( i,j )  to array B
#define gamma( i,j ) C[ (j)*ldC + i ]   // map gamma( i,j ) to array C

void MyGemm( int m, int n, int k, double *A, int ldA,
	     double *B, int ldB, double *C, int ldC )
{
  for ( int i=0; i<m; i++ )
    for ( int j=0; j<n; j++ )
      for ( int p=0; p<k; p++ )
        gamma( i,j ) += alpha( i,p ) * beta( p,j );
}

Figure 1.1.1 C implementation of IJP ordering for computing matrix-matrix multiplication.

Homework 1.1.2

In the file Assignments/Week1/C/Gemm_IJP.c you will find a simple implementation given in Figure 1.1.1 that computes \(C := A B + C \text{.}\) In the directory Assignments/Week1/C execute

make IJP

to compile, link, and execute it. You can view the performance attained on your computer with the Matlab Live Script in Assignments/Week1/C/data/Plot_IJP.mlx (Alternatively, read and execute Assignments/Week1/C/data/Plot_IJP_m.m.)

Figure 1.1.2 Performance comparison of a simple triple-nested loop and a high-performance implementation of the matrix-matrix multiplication \(C := A B + C \text{.}\) Top: Execution time. Bottom: Rate of computation, in billions of floating point operations per second (GFLOPS).

On Robert's laptop, Homework 1.1.2 yields the two graphs given in Figure 1.1.2 (top) as the curve labeled with IJP. In the first, the time required to compute matrix-matrix multiplication as a function of the matrix size is plotted, where \(m = n = k \) (each matrix is square). The "dips" in the time required to complete can be attributed to a number of factors, including that other processes that are executing on the same processor may be disrupting the computation. One should not be too concerned about those. In the graph, we also show the time it takes to complete the same computations with a highly optimized implementation. To the uninitiated in high-performance computing (HPC), the difference may be a bit of a shock. It makes one realize how inefficient many of the programs we write are.

The performance of a matrix-matrix multiplication implementation is measured in billions of floating point operations (flops) per second (GFLOPS). The idea is that we know that it takes \(2 m n k \) flops to compute \(C := A B + C \) where \(C \) is \(m \times n \text{,}\) \(A\) is \(m \times k \text{,}\) and \(B \) is \(k \times n \text{.}\) If we measure the time it takes to complete the computation, \(T( m, n, k ) \text{,}\) then the rate at which we compute is given by

\begin{equation*} \frac{2 m n k}{T(m,n,k)} \times 10^{-9} {\rm ~GFLOPS}. \end{equation*}

For our implementation and the reference implementation, this is reported in Figure 1.1.2 (Bottom). Again, don't worry too much about the dips in the curves. If we controlled the environment in which we performed the experiments, (for example, by making sure few other programs are running at the time of the experiments) these would largely disappear.

Unit 1.1.2 Outline Week 1

Under construction.

Unit 1.1.3 What you should know

Under construction

Unit 1.1.4 What you will learn

In this week, we not only review matrix-matrix multiplication, but also get you to think about this operation in different ways.

Upon completion of this week, you should be able to

Map matrices to memory;
Apply conventions to describe how to index into arrays that store matrices;
Observe the effects of loop order on performance;
Recognize that simple implementations may not provide the performance that can be achieved;
Realize that compilers don't automagically do all the optimization for you.