###### Homework 1.2.1

Let the following picture represent data stored in memory starting at address A:

and let \(A \) be the \(2 \times 3 \) matrix stored there in column-major order. Then

In this section, learners find out

How to map matrices to memory.

Conventions we will use to describe how we index into the arrays that store matrices.

That loop order affects performance.

The learner is left wondering "why?"

Matrices are stored in two-dimensional arrays while computer memory is inherently one-dimensional in the way it is addressed. So, we need to agree on how we are going to store matrices in memory.

Consider the matrix

\begin{equation*}
\left(\begin{array}{rrr} 1 \amp -2 \amp 2\\ -1 \amp 1 \amp 3\\ -2 \amp 2 \amp -1 \end{array}\right)
\end{equation*}

from the opener of this week. In memory, this may be stored in an array A by columns:

\begin{equation*}
\begin{array}{ l r r}
{\tt A} \amp\longrightarrow\amp 1 \\
{\tt A[1]}\amp \longrightarrow\amp -1 \\
{\tt A[2]} \amp\longrightarrow\amp -2
\\
{\tt A[3]}\amp \longrightarrow\amp -2 \\
~~\vdots\amp\amp 1 \\
\amp\amp 2 \\
\amp\amp 2 \\
\amp\amp 3 \\
\amp\amp -1 \\
\end{array}
\end{equation*}

which is known as column-major ordering.

More generally, consider the matrix

\begin{equation*}
\left(\begin{array}{c c c c}
\alpha_{0,0} \amp \alpha_{0,1} \amp \cdots \amp \alpha_{0,k-1} \\
\alpha_{1,0} \amp \alpha_{1,1} \amp \cdots \amp \alpha_{1,k-1} \\
\vdots \amp \vdots \amp \amp \vdots \\
\alpha_{m-1,0} \amp \alpha_{m-1,1} \amp \cdots \amp \alpha_{m-1,k-1}
\end{array}
\right).
\end{equation*}

Column-major ordering would store this in array A as illustrated in

Obviously, one could use the alternative known as row-major ordering.Let the following picture represent data stored in memory starting at address A:

\begin{equation*}
\begin{array}{ l r r}
{\tt A} \amp\longrightarrow\amp 1 \\
\amp\amp -1 \\
\amp\amp -2
\\
\amp\amp -2 \\
\amp\amp 1 \\
\amp\amp 2 \\
\amp\amp 2 \\
\end{array}
\end{equation*}

and let \(A \) be the \(2 \times 3 \) matrix stored there in column-major order. Then

\begin{equation*}
A =
\end{equation*}

Answer

\begin{equation*}
A =
\left(\begin{array}{rrr}
1 \amp -2 \amp 1 \\
-1 \amp -2 \amp 2
\end{array}
\right)
\end{equation*}

Let the following picture represent data stored in memory starting at address A.

\begin{equation*}
\begin{array}{ l r r}
{\tt A} \amp\longrightarrow\amp 1 \\
\amp\amp -1 \\
\amp\amp -2
\\
\amp\amp -2 \\
\amp\amp 1 \\
\amp\amp 2 \\
\amp\amp 2 \\
\end{array}
\end{equation*}

and let \(A \) be the \(2 \times 3 \) matrix stored there in row-major order. Then

\begin{equation*}
A =
\end{equation*}

Answer

\begin{equation*}
A =
\left(\begin{array}{rrr}
1 \amp -1 \amp -2 \\
-2 \amp 1 \amp 2
\end{array}
\right)
\end{equation*}

Very frequently, we will work with a matrix that is a submatrix of a larger matrix. Consider Figure 1.2.2.

What we depict there is a matrix that is embedded in a larger matrix. The larger matrix consists of ldA (the leading dimension) rows and some number of columns. If column-major order is used to store the larger matrix, then addressing the elements in the submatrix requires knowledge of the leading dimension of the larger matrix. In the C programming language, if the top-left element of the submatrix is stored at address A, then one can address the \((i,j) \) element as A[ j*ldA + i ]. We typically define a macro that makes addressing the elements more natural:#define alpha(i,j) A[ (j)*ldA + (i) ]where we assume that the variable or constant ldA holds the leading dimension parameter.

Consider the matrix

\begin{equation*}
\left( \begin{array}{r r r r r}
0.0 \amp 0.1 \amp 0.2 \amp 0.3 \\
1.0 \amp \color{red} {1.1} \amp \color{red} {1.2} \amp \color{red} {1.3}
\\
2.0 \amp \color{red} {2.1} \amp \color{red} {2.2} \amp \color{red} {2.3} \\
3.0 \amp 3.1 \amp 3.2 \amp 3.3 \\
4.0 \amp 4.1 \amp 4.2 \amp 4.3 \\
\end{array}
\right)
\end{equation*}

If this matrix is stored in column-major order in a linear array A,

The submatrix highlighted in red starts at A[...].

The row size of the boxed submatrix is ....

The column size of the boxed submatrix is ....

The leading dimension of the boxed submatrix is ....

Solution

The submatrix highlighted in red starts at A[ 6 ].

The row size of the boxed submatrix is 2.

The column size of the boxed submatrix is 3.

The leading dimension of the boxed submatrix is 5.

When we talk about loops for matrix-matrix multiplication, it helps to keep in mind the picture

which illustrates which loop index (variable name) is used for what row or column of the matrices. We try to be consistent in this use, as should you.Consider again a simple algorithm for computing \(C := A B + C \text{.}\)

\begin{equation*}
\begin{array}{l}
{\bf for~} i := 0, \ldots , m-1 \\
~~~ {\bf for~} j := 0, \ldots , n-1 \\
~~~~~~ {\bf for~} p := 0, \ldots , k-1 \\
~~~~~~~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\
~~~~~~ {\bf end} \\
~~~ {\bf end} \\
{\bf end}
\end{array}
\end{equation*}

Given that we now embrace the convention that \(i \) indexes rows of \(C \) and \(A \text{,}\) \(j \) indexes columns of \(C \) and \(B \text{,}\) and \(p \) indexes "the other dimension," we can call this the IJP ordering of the loops around the assignment statement.

The order in which the elements of \(C \) are updated, and the order in which terms of

\begin{equation*}
\alpha_{i,0} \beta_{0,j} +
\cdots + \alpha_{i,k-1} \beta_{k-1,j}
\end{equation*}

are added to \(\gamma_{i,j} \) is mathematically equivalent. (There would be a possible change in the result due to round-off errors when floating point arithmetic occurs, but this is ignored.) What this means is that the three loops can be reordered without changing the result. That is, in exact arithmetic, the result does not change. However, computation is typically performed in floating point arithmetic, in which case roundoff error may accumulate slightly differently, depending on the order of the loops.

The IJP ordering is one possible ordering of the loops. How many distinct reorderings of those loops are there?

Answer

There are three choices for the outer-most loop: \(i \text{,}\) \(j \text{,}\) or \(p \text{.}\) Once a choice is made for the outer-most loop, there are two choices left for the second loop. Once that choice is made, there is only one choice left for the inner-most loop. Thus, there are \(3! = 3 \times 2 \times 1 = 6 \) loop orderings.

In directory Assignments/Week1/C make copies of `Assignments/Week1/C/Gemm_IJP.c`

into files with names that reflect the different loop orderings (Gemm_IPJ.c, etc.). Next, make the necessary changes to the loops in each file to reflect the ordering encoded in its name. Test the implementions by executing

make IPJ make JIP ...for each of the implementations and view the resulting performance by making the indicated changes to the Live Script in

`Assignments/Week1/C/data/Plot_All_Orderings.mlx}`

(Alternatively, use the script in `Assignments/Week1/C/data/Plot_All_Orderings_m.mlx`

If you have implemented them all, you can test them all by executing make All_Orderings

In Figure 1.2.3, the results of Homework 1.2.5 on Robert's laptop are reported. What do the two loop orderings that result in the best performances have in common?

Answer

They both have the loop indexed with \(i \) as the inner-most loop.

What is obvious is that ordering the loops matters. Changing the order changes how data in memory are accessed, and that can have a considerable impact on the performance that is achieved. The bottom graph includes a plot of the performance of the reference implementation and scales the y-axis so that the top of the graph represents the theoretical peak of a single core of the processor. What it demonstrates is that there is a lot of room for improvement.