Skip to main content

Section 3.2 Leveraging the Caches

Unit 3.2.1 Adding cache memory into the mix

We now refine our model of the processor slightly, adding one cache memory into the mix.

  • Our processor has only one core.

  • That core has three levels of memory: registers, a cache memory, and main memory.

  • Moving data between the cache and registers takes time \(\beta_{C\leftrightarrow R} \) per double while moving it between main memory and the cache takes time \(\beta_{M\leftrightarrow C} \)

  • The registers can hold 64 doubles.

  • The cache memory can hold one or more smallish matrices.

  • Performing a flop data in registers takes time \(\gamma_R \text{.}\)

  • Data movement and computation cannot overlap.

The idea now is to figure out how to block the matrices into submatrices and then compute while these submatrices are in cache to avoid having to access memory more than necessary.

A naive approach partitions \(C \text{,}\) \(A \text{,}\) and \(B \) into (roughly) square blocks:

\begin{equation*} C = \left( \begin{array}{c | c | c | c } C_{0,0} \amp C_{0,1} \amp \cdots \amp C_{0,N-1} \\ \hline C_{1,0} \amp C_{1,1} \amp \cdots \amp C_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp C_{M,1} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}
\begin{equation*} A = \left( \begin{array}{c | c | c | c } A_{0,0} \amp A_{0,1} \amp \cdots \amp A_{0,K-1} \\ \hline A_{1,0} \amp A_{1,1} \amp \cdots \amp A_{1,K-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline A_{M-1,0} \amp A_{M,1} \amp \cdots \amp A_{M-1,K-1} \end{array} \right), \end{equation*}


\begin{equation*} B = \left( \begin{array}{c | c | c | c } B_{0,0} \amp B_{0,1} \amp \cdots \amp B_{0,N-1} \\ \hline B_{1,0} \amp B_{1,1} \amp \cdots \amp B_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline B_{K-1,0} \amp B_{K,1} \amp \cdots \amp B_{K-1,N-1} \end{array} \right), \end{equation*}

where \(C_{i,j} \) is \(m_C \times n_C \text{,}\) \(A_{i,p} \) is \(m_C \times k_C \text{,}\) and \(B_{p,j} \) is \(k_C \times n_C \text{.}\) Then

\begin{equation*} C_{i,j} := \sum_{p=0}^{K-1} A_{i,p} B_{p,j} + C_{i,j}, \end{equation*}

which can be written as the triple-nested loop

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , M-1 \\[0.05in] ~~~ \left. \begin{array}{l} {\bf for~} j := 0, \ldots , N-1 \\[0.05in] ~~~ ~~~ \left. \begin{array}{l} {\bf for~} p := 0, \ldots , K-1 \\ ~~~ ~~~ C_{i,j} := A_{i,p} B_{p,j} + B_{i,j} \\ {\bf end} \end{array} \right. \\[0.1in] ~~~ {\bf end} \end{array} \right. \\[0.1in] {\bf end} \end{array} \end{equation*}

which is one of \(3! = 6 \) possible loop orderings.

If we choose \(m_C\text{,}\) \(n_C \text{,}\) and \(k_C \) such that \(C_{i,j} \text{,}\) \(A_{i,p} \text{,}\) and \(B_{p,j} \) all fit in the cache, then we meet our conditions. We can then compute \(C_{i,j} := A_{i,p} B_{p,j} + C_{i,j} \) by "bringing these blocks into cache" and computing with them before writing out the result, as before. The difference here is that while one can explicitly load registers, the movement of data into caches is merely encouraged by careful ordering of the computation, since replacement of data in cache is handled by the hardware, which has some cache replacement policy similar to "least recently used" data gets evicted.

#define MC 96
#define NC 96
#define KC 96

void MyGemm( int m, int n, int k, double *A, int ldA,
	     double *B, int ldB, double *C, int ldC )
  if ( m % MR != 0 || MC % MR != 0 ){
    printf( "m and MC must be multiples of MR\n" );
    exit( 0 );
  if ( n % NR != 0 || NC % NR != 0 ){
    printf( "n and NC must be multiples of NR\n" );
    exit( 0 );
  for ( int i=0; i<m; i+=MC ) {
    int ib = min( MC, m-i );        /* Last block may not be a full block */
    for ( int j=0; j<n; j+=NC ) {
      int jb = min( NC, n-j );        /* Last block may not be a full block */
      for ( int p=0; p<k; p+=KC ) {
        int pb = min( KC, k-p );        /* Last block may not be a full block */
          ( ib, jb, pb, &alpha( i,p ), ldA, &beta( p,j ), ldB, &gamma( i,j ), ldC );

void Gemm_JI_4x4Kernel( int m, int n, int k, double *A, int ldA,
                                  double *B, int ldB, double *C, int ldC )
  for ( int j=0; j<n; j+=NR ) /* n is assumed to be a multiple of NR */
    for ( int i=0; i<m; i+=MR ) /* m is assumed to be a multiple of MR */
      Gemm_4x4Kernel( k, &alpha( i,0 ), ldA, &beta( 0,j ), ldB, &gamma( i,j ), ldC );
Figure 3.2.1 Triple loop around \(C_{i,j} := A_{i,p} B_{p,j} + C_{i,j} \text{.}\)

For reasons that will become clearer later, we are going to assume that " the cache" in our discussion is the L2 cache, which for the Intel Haswell processor is of size 256Kbytes. If we assume all three (square) matrices fit in that cache during the computation, what size should they be?

In our later discussions 12 becomes a magic number (from the \(12 \times 4 \) microkernel) and therefore we should pick the size of the block to be the largest multiple of 12 less than that number. What is it?

Later we will further tune this parameter.

Create all six loop orderings by copying Assignments/Week3/C/Gemm_IJP_JI_12x4Kernel.c into


reordering the loops as indicated by the XYZ in Gemm_XYZ_JI_12x4Kernel.c. Collect performance data by executing

make XYZ_JI_12x4Kernel

for \({\tt XYZ} \in \{ {\tt IPJ},{\tt PIJ},{\tt PJI},{\tt PIJ},{\tt PJI} \} \text{.}\) (If you don't want to do them all, implement at least Gemm_PIJ_JI_12x4Kernel.c.)

The resulting performance can be viewed with Live Script Assignments/Week3/C/data/Plot_XYZ_JI_MRxNR.mlx.

The performance attained by our implementation is shown in Figure~\ref{fig-JI_Kernel_FourxFour}~(top-right).

Unit 3.2.2 Streaming submatrices of \(C \) and \(B \)

Figure 3.2.7 Illustration of computation in the cache with one block from each of \(A \text{,}\) \(B \text{,}\) and \(C \text{.}\)

We illustrate the execution of Gemm_JI_4x4Kernel with one set of three submatrices \(C_{i,j} \text{,}\) \(A_{i,p} \text{,}\) and \(B_{p,j} \) in Figure 3.2.7 for the case where \(m_R = n_R = 4 \) and \(m_C = n_C = k_C = 4 \times m_R \text{.}\) What we notice is that each the \(m_R \times n_R \) submatrices of \(C_{i,j} \) is not reused again once it has been updated. This means that at any given time only one such matrix needs to be in the cache memory (as well as registers). Similarly, a panel of \(B_{p,j} \) is not reused and hence only one such panel needs to be in cache. It is submatrix \(A_{i,p} \) that must remain in cache during the entire computation. By not keeping the entire submatrices \(C_{i,j} \) and \(B_{p,j} \) in the cache memory, the size of the submatrix \(A_{i,p} \) can be increased, which can be expected to improve performance. We will refer to this as "streaming" "micro-blocks" of \(C_{i,j} \) and "micro-panels" of \(B_{p,j} \text{,}\) while keeping all of block \(A_{i,p} \) in cache.

Remark 3.2.8

We could have chosen a different order of the loops, and then we may have concluded that submatrix \(B_{p,j} \) needs to stay in cache while streaming a micro-blocks of \(C \) and small panels of \(A\text{.}\) Obviously, there is a symmetry here and hence we will just focus on the case where \(A_{i,p} \) is kept in cache.

The second observation is that now that \(C_{i,j} \) and \(B_{p,j} \) are being streamed, there is no need for those submatrices to be square. Furthermore, for Gemm_PIJ_JI_12x4Kernel.c and Gemm_IPJ_JI_12x4Kernel.c the larger \(n_C \text{,}\) the more effectively the cost of bringing \(A_{i,p} \) into cache is amortized over computation: we should pick \(n_C = n \text{.}\) (Later, we will reintroduce \(n_C \text{.}\)) Another way of think of this is that if we choose the PIJ loop around the JI ordering around the microkernel (in other words, if we consider the Gemm_PIJ_JI_12x4Kernel implementation), then we notice that the inner loop of PIJ (indexed with J) matches the outer loop of the JI double loop, and hence those two loops can be combined, leaving us with an implementation that is a double loop (PI) around the double loop JI.


make PI_JI_MCxKC

and view the performance results with Live Script Assignments/Week3/C/data/Plot_MC_KC_Performance.mlx.

This experiment tries many different choices for \(m_C \) and \(k_C \text{,}\) and presents them as a scatter graph so that the optimal choice can be determined and visualized. (It will take a long time to execute. Go take a nap!) }

The result of this last homework on Robert's laptop is shown in Figure 3.2.11.

Figure 3.2.11 Performance when \(m_R = 12 \) and \(n_R = 4 \) and \(m_C \) and \(k_C \) are varied. The best performance is empirically determined by searching the results. We notice that \(\widetilde A \) is sized to fit in the L2 cache. As we apply additional optimizations, the optimal choice for \(m_C \) and \(k_C\) will change. Notice that on Robert's laptop, the optimal choice varies greatly from one experiment to the next. You may observe the same level of variability.

To summarize, our insights suggest

  1. Bring an \(m_C \times k_C \) submatrix of \(A \) into the cache, at a cost of \(m_C \times k_C \) memops. (It is the order of the loops and instructions that encourages the submatrix of \(A \) to stay in cache.) This cost is amortized over \(2 m_C n_C k_C\) flops for a ratio of \(2 m_C n k_C / (m_C k_C) = 2 n \text{.}\) The larger \(n \text{,}\) the better.

  2. The cost of reading an \(k_C \times n_R \) submatrix of \(B \) is amortized over \(2 m_C n_R k_C \) flops, for a ratio of \(2 m_C n_R k_C / (2 m_C n_R) = m_C \text{.}\) Obviously, the greater \(m_C \text{,}\) the better.

  3. The cost of reading and writing an \(m_R \times n_R \) submatrix of \(C \) is now amortized over \(2 m_R n_R k_C \) flops, for a ratio of \(2 m_R n_R k_C / (2 m_R n_R) = k_C \text{.}\) Obviously, the greater \(k_C \text{,}\) the better.

Item 2 and Item 3 suggest that \(m_C \times k_C \) submatrix \(A_{i,p} \) be roughly square. (This comes with a caviat. See video!)

If we revisit the performance data plotted in Figure 3.2.11, we notice that matrix \(A_{i,p} \) fits in part of the L2 cache, but is too big for the L1 cache. What this means is that the bandwidth between the L2 and registers is enough to allow these blocks of \(A \) to reside in the L2 cache. Since the L2 cache is larger than the L1 cache, this benefits performance, since the \(m_C \) and \(k_C \) can be larger.

Remark 3.2.12
  • A \(m_R \times n_R \) submatrix of \(C \) that is begin updated we will call a microtile.

  • The \(m_R \times k_C \) submatrix of \(A \) and \(k_C \times n_R \) submatrix of \(B \) we will call micro-panels.

  • The routine that updates a microtile by multiplying two micro-panels we will call the microkernel.

Unit 3.2.3

Figure adapted from [4].

Figure 3.2.13 Blocking for multiple levels of cache.

The last section motivated Figure 3.2.13, which describes one way for blocking for multiple levels of cache. While some details still remain, this brings us very close to how matrix-matrix multiplication is implemented in practice in libraries that are widely used. The approach illustrated there was first suggested by Kazushige Goto~\cite{Goto1} and is referred to as the Goto approach (to matrix-matrix multiplication) or GotoBLAS approach, since he incorporated it in an implementation of the Basic Linear Algebra Subprograms (BLAS) by that name.

Each nested box represents a single loop that partitions one of the three dimensions (\(m\text{,}\) \(n\text{,}\) or \(k \)). The submatrices \(C_{i,j} \text{,}\) \(A_{i,p} \text{,}\) and \(B_{p,j}\) are those already discussed in Section~\ref{sec:towards}.

  • The roughly square matrix \(A_{i,p} \) is "placed" in the L2 cache. If data from \(A_{i,p} \) can be fetched from the L2 cache fast enough, it is better to place it there since there is an advantage to making \(m_C \) and \(k_C \) larger and the L2 cache is larger than the L1 cache.

  • Our analysis in Section~\ref{sec:towards} also suggests that \(k_C \) be chosen to be large. Since the submatrix \(B_{p,j} \) is reused multiple times for many iterations of the "fourth loop around the microkernel" it may be beneficial to choose \(n_C \) so that \(k_C \times n_C \) submatrix \(B_{p,j} \) stays in the L3 cache. Later we will see there are other benefits to this.

  • In this scheme, the \(m_R \times n_R \) micro-block of \(C_{i,j} \) end up in registers and the \(k_C \times n_R \) "micro-panel" of \(B_{p,j} \) ends up in the L1 cache.

Data in the L1 cache typically are also in the L2 and L3 caches. The update of the \(m_R \times n_R \) micro-block of \(C\) also brings that in to the various caches. Thus, the described situation is similar to what we described in Section~\ref{sec:towards}, where "the cache" discussed there corresponds to the L2 cache here. As our discussion proceeds, in Week 3, we will try to make this all more precise.

Which of the implementations

  • Gemm_IJP_JI_12x4Kernel.c

  • Gemm_JPI_JI_12x4Kernel.c

  • Gemm_PJI_JI_12x4Kernel.c

best captures the loop structure illustrated in Figure 3.2.13?

void LoopFive( int m, int n, int k, double *A, int ldA,
		   double *B, int ldB, double *C, int ldC )
  for ( int j=0; j<n; j+=NC ) {
    int jb = min( NC, n-j );    /* Last loop may not involve a full block */
    LoopFour( m, jb, k, A, ldA, &beta(  ,   ), ldB, &gamma(   ,   ), ldC );

void LoopFour( int m, int n, int k, double *A, int ldA, 
	       double *B, int ldB, double *C, int ldC )
  for ( int p=0; p<k; p+=KC ) {
    int pb = min( KC, k-p );    /* Last loop may not involve a full block */
    LoopThree( m, n, pb, &alpha(  ,   ), ldA, &beta(  ,   ), ldB, C, ldC );

void LoopThree( int m, int n, int k, double *A, int ldA, 
		double *B, int ldB, double *C, int ldC )
  for ( int i=0; i<m; i+=MC ) {
    int ib = min( MC, m-i );    /* Last loop may not involve a full block */
    LoopTwo( ib, n, k, &alpha(  ,   ), ldA, B, ldB, &gamma(   ,   ), ldC );

void LoopTwo( int m, int n, int k, double *A, int ldA,
	      double *B, int ldB, double *C, int ldC )
  for ( int j=0; j<n; j+=NR ) {
    int jb = min( NR, n-j );
    LoopOne( m, jb, k, A, ldA, &beta(   ,   ), ldB, &gamma(   ,   ), ldC );

void LoopOne( int m, int n, int k, double *A, int ldA,
	      double *B, int ldB, double *C, int ldC )
  for ( int i=0; i<m; i+=MR ) {
    int ib = min( MR, m-i );
    Gemm_4x4Kernel( k, &alpha(   ,    ), ldA, B, ldB, &gamma(   ,   ), ldC );
Figure 3.2.15 Blocking for multiple levels of cache.

We always advocate, when it does not substantially impede performance, to instantiate an algorithm in code in a way that closely resembles how one explains it. In this spirit, the algorithm described in Figure 3.2.13 can be coded by making each loop (box) in the figure a separate routine. An outline of how this might be accomplished is given in Figure 3.2.15 and file Assignments/Week3/C/Gemm_Five_Loops_4x4Kernel.c. Complete the code and execute it with \begin{center} \tt make Five_Loops_4x4Kernel \end{center} Then, view the performance with Live Script Assignments/Week3/C/data/Plot_Five_Loops.mlx. }