## Section1.1Opening Remarks

### Unit1.1.1Launch

###### Homework1.1.1

Compute $\left(\begin{array}{rrr} 1 \amp -2 \amp 2\\ -1 \amp 1 \amp 3\\ -2 \amp 2 \amp -1 \end{array}\right) \left(\begin{array}{rr} -2 \amp 1\\ 1 \amp 3\\ -1 \amp 2 \end{array}\right) + \left(\begin{array}{rr} 1 \amp 0\\ -1 \amp 2\\ -2 \amp 1 \end{array}\right) =$

$\left(\begin{array}{rrr} 1 \amp -2 \amp 2\\ -1 \amp 1 \amp 3\\ -2 \amp 2 \amp -1 \end{array}\right) \left(\begin{array}{rr} -2 \amp 1\\ 1 \amp 3\\ -1 \amp 2 \end{array}\right) + \left(\begin{array}{rr} 1 \amp 0\\ -1 \amp 2\\ -2 \amp 1 \end{array}\right) = \left(\begin{array}{cc} -5 \amp -1\\ -1 \amp 10\\ 5 \amp 3 \end{array}\right)$

Let $A \text{,}$ $B \text{,}$ and $C$ be $m \times k \text{,}$ $k \times n \text{,}$ and $m \times n$ matrices, respectively. We can expose their individual entries as

\begin{equation*} A = \left(\begin{array}{c c c c} \alpha_{0,0} \amp \alpha_{0,1} \amp \cdots \amp \alpha_{0,k-1} \\ \alpha_{1,0} \amp \alpha_{1,1} \amp \cdots \amp \alpha_{1,k-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \alpha_{m-1,0} \amp \alpha_{m-1,1} \amp \cdots \amp \alpha_{m-1,k-1} \end{array} \right), \end{equation*}
\begin{equation*} B = \left(\begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \cdots \amp \beta_{0,n-1} \\ \beta_{1,0} \amp \beta_{1,1} \amp \cdots \amp \beta_{1,n-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \beta_{k-1,0} \amp \beta_{k-1,1} \amp \cdots \amp \beta_{k-1,n-1} \end{array} \right), \end{equation*}

and

\begin{equation*} C = \left(\begin{array}{c c c c} \gamma_{0,0} \amp \gamma_{0,1} \amp \cdots \amp \gamma_{0,n-1} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \cdots \amp \gamma_{1,n-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \gamma_{m-1,0} \amp \gamma_{m-1,1} \amp \cdots \amp \gamma_{m-1,n-1} \end{array} \right). \end{equation*}

The computation $C := A B + C \text{,}$ which adds the result of matrix-matrix multiplication $A B$ to a matrix $C \text{,}$ is defined as

\begin{equation*} \gamma_{i,j} = \sum_{p=0}^{k-1} \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \end{equation*}

for all $0 \leq i \lt m$ and $0 \leq j \lt n \text{.}$ This leads to the following pseudo-code for computing $C := A B + C \text{:}$

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , m-1 \\ ~~~ {\bf for~} j := 0, \ldots , n-1 \\ ~~~~~~ {\bf for~} p := 0, \ldots , k-1 \\ ~~~~~~~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\ ~~~~~~{\bf end} \\ ~~~{\bf end} \\ {\bf end} \end{array} \end{equation*}

The outer two loops visit each element of $C \text{,}$ and the inner loop updates $\gamma_{i,j}$ with the dot product of the $i$th row of $A$ with the $j$th column of $B \text{.}$

###### Homework1.1.2

In the file Assignments/Week1/C/Gemm_IJP.c you will find a simple implementation given in Figure 1.1.1 that computes $C := A B + C \text{.}$ In the directory Assignments/Week1/C execute

make IJP


to compile, link, and execute it. You can view the performance attained on your computer with the Matlab Live Script in Assignments/Week1/C/data/Plot_IJP.mlx (Alternatively, read and execute Assignments/Week1/C/data/Plot_IJP_m.m.)

On Robert's laptop, Homework 1.1.2 yields the two graphs given in Figure 1.1.2 (top) as the curve labeled with IJP. In the first, the time required to compute matrix-matrix multiplication as a function of the matrix size is plotted, where $m = n = k$ (each matrix is square). The "dips" in the time required to complete can be attributed to a number of factors, including that other processes that are executing on the same processor may be disrupting the computation. One should not be too concerned about those. In the graph, we also show the time it takes to complete the same computations with a highly optimized implementation. To the uninitiated in high-performance computing (HPC), the difference may be a bit of a shock. It makes one realize how inefficient many of the programs we write are.

The performance of a matrix-matrix multiplication implementation is measured in billions of floating point operations (flops) per second (GFLOPS). The idea is that we know that it takes $2 m n k$ flops to compute $C := A B + C$ where $C$ is $m \times n \text{,}$ $A$ is $m \times k \text{,}$ and $B$ is $k \times n \text{.}$ If we measure the time it takes to complete the computation, $T( m, n, k ) \text{,}$ then the rate at which we compute is given by

\begin{equation*} \frac{2 m n k}{T(m,n,k)} \times 10^{-9} {\rm ~GFLOPS}. \end{equation*}

For our implementation and the reference implementation, this is reported in Figure 1.1.2 (Bottom). Again, don't worry too much about the dips in the curves. If we controlled the environment in which we performed the experiments, (for example, by making sure few other programs are running at the time of the experiments) these would largely disappear.

### Unit1.1.2Outline Week 1

Under construction.

### Unit1.1.3What you should know

Under construction

### Unit1.1.4What you will learn

In this week, we not only review matrix-matrix multiplication, but also get you to think about this operation in different ways.

Upon completion of this week, you should be able to

• Map matrices to memory;

• Apply conventions to describe how to index into arrays that store matrices;

• Observe the effects of loop order on performance;

• Recognize that simple implementations may not provide the performance that can be achieved;

• Realize that compilers don't automagically do all the optimization for you.