## Section3.2Leveraging the Caches

### Unit3.2.1Adding cache memory into the mix

We now refine our model of the processor slightly, adding one cache memory into the mix.

• Our processor has only one core.

• That core has three levels of memory: registers, a cache memory, and main memory.

• Moving data between the cache and registers takes time $\beta_{C\leftrightarrow R}$ per double while moving it between main memory and the cache takes time $\beta_{M\leftrightarrow C}$

• The registers can hold 64 doubles.

• The cache memory can hold one or more smallish matrices.

• Performing a flop data in registers takes time $\gamma_R \text{.}$

• Data movement and computation cannot overlap.

The idea now is to figure out how to block the matrices into submatrices and then compute while these submatrices are in cache to avoid having to access memory more than necessary.

A naive approach partitions $C \text{,}$ $A \text{,}$ and $B$ into (roughly) square blocks:

\begin{equation*} C = \left( \begin{array}{c | c | c | c } C_{0,0} \amp C_{0,1} \amp \cdots \amp C_{0,N-1} \\ \hline C_{1,0} \amp C_{1,1} \amp \cdots \amp C_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp C_{M,1} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}
\begin{equation*} A = \left( \begin{array}{c | c | c | c } A_{0,0} \amp A_{0,1} \amp \cdots \amp A_{0,K-1} \\ \hline A_{1,0} \amp A_{1,1} \amp \cdots \amp A_{1,K-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline A_{M-1,0} \amp A_{M,1} \amp \cdots \amp A_{M-1,K-1} \end{array} \right), \end{equation*}

and

\begin{equation*} B = \left( \begin{array}{c | c | c | c } B_{0,0} \amp B_{0,1} \amp \cdots \amp B_{0,N-1} \\ \hline B_{1,0} \amp B_{1,1} \amp \cdots \amp B_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline B_{K-1,0} \amp B_{K,1} \amp \cdots \amp B_{K-1,N-1} \end{array} \right), \end{equation*}

where $C_{i,j}$ is $m_C \times n_C \text{,}$ $A_{i,p}$ is $m_C \times k_C \text{,}$ and $B_{p,j}$ is $k_C \times n_C \text{.}$ Then

\begin{equation*} C_{i,j} := \sum_{p=0}^{K-1} A_{i,p} B_{p,j} + C_{i,j}, \end{equation*}

which can be written as the triple-nested loop

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , M-1 \\[0.05in] ~~~ \left. \begin{array}{l} {\bf for~} j := 0, \ldots , N-1 \\[0.05in] ~~~ ~~~ \left. \begin{array}{l} {\bf for~} p := 0, \ldots , K-1 \\ ~~~ ~~~ C_{i,j} := A_{i,p} B_{p,j} + B_{i,j} \\ {\bf end} \end{array} \right. \\[0.1in] ~~~ {\bf end} \end{array} \right. \\[0.1in] {\bf end} \end{array} \end{equation*}

which is one of $3! = 6$ possible loop orderings.

If we choose $m_C\text{,}$ $n_C \text{,}$ and $k_C$ such that $C_{i,j} \text{,}$ $A_{i,p} \text{,}$ and $B_{p,j}$ all fit in the cache, then we meet our conditions. We can then compute $C_{i,j} := A_{i,p} B_{p,j} + C_{i,j}$ by "bringing these blocks into cache" and computing with them before writing out the result, as before. The difference here is that while one can explicitly load registers, the movement of data into caches is merely encouraged by careful ordering of the computation, since replacement of data in cache is handled by the hardware, which has some cache replacement policy similar to "least recently used" data gets evicted.

For reasons that will become clearer later, we are going to assume that " the cache" in our discussion is the L2 cache, which for the Intel Haswell processor is of size 256Kbytes. If we assume all three (square) matrices fit in that cache during the computation, what size should they be?

In our later discussions 12 becomes a magic number (from the $12 \times 4$ microkernel) and therefore we should pick the size of the block to be the largest multiple of 12 less than that number. What is it?

Later we will further tune this parameter.

Create all six loop orderings by copying Assignments/Week3/C/Gemm_IJP_JI_12x4Kernel.c into

Gemm_IPJ_JI_12x4Kernel.c
Gemm_JIP_JI_12x4Kernel.c
Gemm_JPI_JI_12x4Kernel.c
Gemm_PIJ_JI_12x4Kernel.c
Gemm_PJI_JI_12x4Kernel.c


reordering the loops as indicated by the XYZ in Gemm_XYZ_JI_12x4Kernel.c. Collect performance data by executing

make XYZ_JI_12x4Kernel


for ${\tt XYZ} \in \{ {\tt IPJ},{\tt PIJ},{\tt PJI},{\tt PIJ},{\tt PJI} \} \text{.}$ (If you don't want to do them all, implement at least Gemm_PIJ_JI_12x4Kernel.c.)

The resulting performance can be viewed with Live Script Assignments/Week3/C/data/Plot_XYZ_JI_MRxNR.mlx.

The performance attained by our implementation is shown in Figure~\ref{fig-JI_Kernel_FourxFour}~(top-right).

### Unit3.2.2Streaming submatrices of $C$ and $B$

We illustrate the execution of Gemm_JI_4x4Kernel with one set of three submatrices $C_{i,j} \text{,}$ $A_{i,p} \text{,}$ and $B_{p,j}$ in Figure 3.2.7 for the case where $m_R = n_R = 4$ and $m_C = n_C = k_C = 4 \times m_R \text{.}$ What we notice is that each the $m_R \times n_R$ submatrices of $C_{i,j}$ is not reused again once it has been updated. This means that at any given time only one such matrix needs to be in the cache memory (as well as registers). Similarly, a panel of $B_{p,j}$ is not reused and hence only one such panel needs to be in cache. It is submatrix $A_{i,p}$ that must remain in cache during the entire computation. By not keeping the entire submatrices $C_{i,j}$ and $B_{p,j}$ in the cache memory, the size of the submatrix $A_{i,p}$ can be increased, which can be expected to improve performance. We will refer to this as "streaming" "micro-blocks" of $C_{i,j}$ and "micro-panels" of $B_{p,j} \text{,}$ while keeping all of block $A_{i,p}$ in cache.

###### Remark3.2.8

We could have chosen a different order of the loops, and then we may have concluded that submatrix $B_{p,j}$ needs to stay in cache while streaming a micro-blocks of $C$ and small panels of $A\text{.}$ Obviously, there is a symmetry here and hence we will just focus on the case where $A_{i,p}$ is kept in cache.

The second observation is that now that $C_{i,j}$ and $B_{p,j}$ are being streamed, there is no need for those submatrices to be square. Furthermore, for Gemm_PIJ_JI_12x4Kernel.c and Gemm_IPJ_JI_12x4Kernel.c the larger $n_C \text{,}$ the more effectively the cost of bringing $A_{i,p}$ into cache is amortized over computation: we should pick $n_C = n \text{.}$ (Later, we will reintroduce $n_C \text{.}$) Another way of think of this is that if we choose the PIJ loop around the JI ordering around the microkernel (in other words, if we consider the Gemm_PIJ_JI_12x4Kernel implementation), then we notice that the inner loop of PIJ (indexed with J) matches the outer loop of the JI double loop, and hence those two loops can be combined, leaving us with an implementation that is a double loop (PI) around the double loop JI.

Execute

make PI_JI_MCxKC


and view the performance results with Live Script Assignments/Week3/C/data/Plot_MC_KC_Performance.mlx.

This experiment tries many different choices for $m_C$ and $k_C \text{,}$ and presents them as a scatter graph so that the optimal choice can be determined and visualized. (It will take a long time to execute. Go take a nap!) }

The result of this last homework on Robert's laptop is shown in Figure 3.2.11.

To summarize, our insights suggest

1. Bring an $m_C \times k_C$ submatrix of $A$ into the cache, at a cost of $m_C \times k_C$ memops. (It is the order of the loops and instructions that encourages the submatrix of $A$ to stay in cache.) This cost is amortized over $2 m_C n_C k_C$ flops for a ratio of $2 m_C n k_C / (m_C k_C) = 2 n \text{.}$ The larger $n \text{,}$ the better.

2. The cost of reading an $k_C \times n_R$ submatrix of $B$ is amortized over $2 m_C n_R k_C$ flops, for a ratio of $2 m_C n_R k_C / (2 m_C n_R) = m_C \text{.}$ Obviously, the greater $m_C \text{,}$ the better.

3. The cost of reading and writing an $m_R \times n_R$ submatrix of $C$ is now amortized over $2 m_R n_R k_C$ flops, for a ratio of $2 m_R n_R k_C / (2 m_R n_R) = k_C \text{.}$ Obviously, the greater $k_C \text{,}$ the better.

Item 2 and Item 3 suggest that $m_C \times k_C$ submatrix $A_{i,p}$ be roughly square. (This comes with a caviat. See video!)

If we revisit the performance data plotted in Figure 3.2.11, we notice that matrix $A_{i,p}$ fits in part of the L2 cache, but is too big for the L1 cache. What this means is that the bandwidth between the L2 and registers is enough to allow these blocks of $A$ to reside in the L2 cache. Since the L2 cache is larger than the L1 cache, this benefits performance, since the $m_C$ and $k_C$ can be larger.

###### Remark3.2.12
• A $m_R \times n_R$ submatrix of $C$ that is begin updated we will call a microtile.

• The $m_R \times k_C$ submatrix of $A$ and $k_C \times n_R$ submatrix of $B$ we will call micro-panels.

• The routine that updates a microtile by multiplying two micro-panels we will call the microkernel.

### Unit3.2.3

The last section motivated Figure 3.2.13, which describes one way for blocking for multiple levels of cache. While some details still remain, this brings us very close to how matrix-matrix multiplication is implemented in practice in libraries that are widely used. The approach illustrated there was first suggested by Kazushige Goto~\cite{Goto1} and is referred to as the Goto approach (to matrix-matrix multiplication) or GotoBLAS approach, since he incorporated it in an implementation of the Basic Linear Algebra Subprograms (BLAS) by that name.

Each nested box represents a single loop that partitions one of the three dimensions ($m\text{,}$ $n\text{,}$ or $k$). The submatrices $C_{i,j} \text{,}$ $A_{i,p} \text{,}$ and $B_{p,j}$ are those already discussed in Section~\ref{sec:towards}.

• The roughly square matrix $A_{i,p}$ is "placed" in the L2 cache. If data from $A_{i,p}$ can be fetched from the L2 cache fast enough, it is better to place it there since there is an advantage to making $m_C$ and $k_C$ larger and the L2 cache is larger than the L1 cache.

• Our analysis in Section~\ref{sec:towards} also suggests that $k_C$ be chosen to be large. Since the submatrix $B_{p,j}$ is reused multiple times for many iterations of the "fourth loop around the microkernel" it may be beneficial to choose $n_C$ so that $k_C \times n_C$ submatrix $B_{p,j}$ stays in the L3 cache. Later we will see there are other benefits to this.

• In this scheme, the $m_R \times n_R$ micro-block of $C_{i,j}$ end up in registers and the $k_C \times n_R$ "micro-panel" of $B_{p,j}$ ends up in the L1 cache.

Data in the L1 cache typically are also in the L2 and L3 caches. The update of the $m_R \times n_R$ micro-block of $C$ also brings that in to the various caches. Thus, the described situation is similar to what we described in Section~\ref{sec:towards}, where "the cache" discussed there corresponds to the L2 cache here. As our discussion proceeds, in Week 3, we will try to make this all more precise.

Which of the implementations

• Gemm_IJP_JI_12x4Kernel.c

• Gemm_JPI_JI_12x4Kernel.c

• Gemm_PJI_JI_12x4Kernel.c

best captures the loop structure illustrated in Figure 3.2.13?

We always advocate, when it does not substantially impede performance, to instantiate an algorithm in code in a way that closely resembles how one explains it. In this spirit, the algorithm described in Figure 3.2.13 can be coded by making each loop (box) in the figure a separate routine. An outline of how this might be accomplished is given in Figure 3.2.15 and file Assignments/Week3/C/Gemm_Five_Loops_4x4Kernel.c. Complete the code and execute it with \begin{center} \tt make Five_Loops_4x4Kernel \end{center} Then, view the performance with Live Script Assignments/Week3/C/data/Plot_Five_Loops.mlx. }