Skip to main content

Unit 2.2.1 Basic idea

Key to understanding how we are going to optimize matrix-matrix multiplication is to understand blocked matrix-matrix multiplication (the multiplication of matrices that have been partitioned into submatrices).

Consider \(C := A B + C \text{.}\) In terms of the elements of the matrices, this means that each \(\gamma_{i,j} \) is computed by

\begin{equation*} \gamma_{i,j} := \sum_{p=0}^{k-1} \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} . \end{equation*}

Now, partition the matrices into submatrices:

\begin{equation*} C = \left( \begin{array}{c | c | c | c } C_{0,0} \amp C_{0,1} \amp \cdots \amp C_{0,N-1} \\ \hline C_{1,0} \amp C_{1,1} \amp \cdots \amp C_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp C_{M-1,1} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}
\begin{equation*} A = \left( \begin{array}{c | c | c | c } A_{0,0} \amp A_{0,1} \amp \cdots \amp A_{0,K-1} \\ \hline A_{1,0} \amp A_{1,1} \amp \cdots \amp A_{1,K-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline A_{M-1,0} \amp A_{M-1,1} \amp \cdots \amp A_{M-1,K-1} \end{array} \right), \end{equation*}

and

\begin{equation*} B = \left( \begin{array}{c | c | c | c } B_{0,0} \amp B_{0,1} \amp \cdots \amp B_{0,N-1} \\ \hline B_{1,0} \amp B_{1,1} \amp \cdots \amp B_{1,N-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline B_{K-1,0} \amp B_{K-1,1} \amp \cdots \amp B_{K-1,N-1} \end{array} \right), \end{equation*}

where \(C_{i,j} \) is \(m_i \times n_j \text{,}\) \(A_{i,p} \) is \(m_i \times k_p \text{,}\) and \(B_{p,j} \) is \(k_p \times n_j \text{,}\) with \(\sum_{i=0}^{M-1} m_i = m \text{,}\) \(\sum_{j=0}^{N-1} n_i = n \text{,}\) and \(\sum_{p=0}^{K-1} k_i = k \text{.}\) Then

\begin{equation*} C_{i,j} := \sum_{p=0}^{K-1} A_{i,p} B_{p,j} + C_{i,j} . \end{equation*}
Remark 2.2.2.

In other words, provided the partitioning of the matrices is "conformal," matrix-matrix multiplication with partitioned matrices proceeds exactly as does matrix-matrix multiplication with the elements of the matrices except that the individual multiplications with the submatrices do not necessarily commute.

We illustrate how to block \(C \) into \(m_b \times n_b\) submatrices, \(A \) into \(m_b \times k_b \) submatrices, and \(B \) into \(k_b \times n_b \) submatrices in Figure 2.2.3

Figure 2.2.3. An illustration of a blocked algorithm where \(m_b = n_b = k_b = 4 \text{.}\)

The blocks used when updating \(C_{1,2} := A_{1,3} B_{3,2} + C_{1,2}\) is highlighted in that figure.

An implementation of one such algorithm is given in Figure 2.2.4.

void MyGemm( int m, int n, int k, double *A, int ldA,
	     double *B, int ldB, double *C, int ldC )
{
  for ( int j=0; j<n; j+=NB ){
    int jb = min( n-j, NB );    /* Size for "fringe" block */ 
    for ( int i=0; i<m; i+=MB ){
      int ib = min( m-i, MB );    /* Size for "fringe" block */ 
      for ( int p=0; p<k; p+=KB ){ 
        int pb = min( k-p, KB );    /* Size for "fringe" block */ 
        Gemm_PJI( ib, jb, pb, &alpha( i,p ), ldA, &beta( p,j ), ldB,
		              &gamma( i,j ), ldC );
      }
    }
  }
}

void Gemm_PJI( int m, int n, int k, double *A, int ldA, 
               double *B, int ldB,  double *C, int ldC )
{
  for ( int p=0; p<k; p++ )
    for ( int j=0; j<n; j++ )
      for ( int i=0; i<m; i++ )
        gamma( i,j ) += alpha( i,p ) * beta( p,j );
}
Figure 2.2.4. Simple blocked algorithm that calls Gemm_PJI which in this setting updates a \(m_b \times n_b \) submatrix of \(C \) with the result of multiplying an \(m_b \times k_b \) submatrix of \(A \) times a \(k_b \times n_b \) submatrix of \(B \text{.}\)

Homework 2.2.1.2.

Examine the graph created by Plot_Blocked_MMM.mlx. What block sizes MB and NB would you expect to yield better performance? Change these in Gemm_JIP_PJI.c and execute

make JIP_PJI

View the updated performance with Assignments/Week2/C/data/Plot_Blocked_MMM.mlx (Don't worry about trying to pick the best block size. We'll get to that later.)

Solution

When we choose MB=100, NB=100, and KB=100 better performance is maintained as the problem size gets larger:

This is because the subproblems now fit is one of the caches. More on this later in this week.