Unit 2.3.3 Streaming \(A_{i,p} \) and \(B_{p,j} \)
¶Now, if the submatrix \(C_{i,j} \) were larger (i.e., \(m_R \) and \(n_R \) were larger), then the time spent moving data around (overhead),
is reduced:
decreases as \(m_R \) and \(n_R \) increase. In other words, the larger the submatrix of \(C \) that is kept in registers, the lower the overhead (under our model). Before we increase \(m_R \) and \(n_R \text{,}\) we first show how for fixed \(m_R \) and \(n_R \) the number of registers that are needed for storing elements of \(A \) and \(B \) can be reduced, which then allows \(m_R \) and \(n_R \) to (later this week) be increased in size.
Recall from Week 1 that if the \(p \) loop of matrix-matrix multiplication is the outer-most loop, then the inner two loops implement a rank-1 update. If we do this for the kernel, then the result is illustrated by


If \(m_R = n_R = 4 \) this means that \(16 + 4 + 4 = 24 \) doubles in registers. If we now view the entire computation performed by the loop indexed with \(p \text{,}\) this can be illustrated by

void MyGemm( int m, int n, int k, double *A, int ldA,
double *B, int ldB, double *C, int ldC )
{
for ( int i=0; i<m; i+=MR ) /* m assumed to be multiple of MR */
for ( int j=0; j<n; j+=NR ) /* n assumed to be multiple of NR */
for ( int p=0; p<k; p+=KR ) /* k assumed to be multiple of KR */
Gemm_P_Ger( MR, NR, KR, &alpha( i,p ), ldA,
&beta( p,j ), ldB, &gamma( i,j ), ldC );
}
void Gemm_P_Ger( int m, int n, int k, double *A, int ldA,
double *B, int ldB, double *C, int ldC )
{
for ( int p=0; p<k; p++ )
MyGer( m, n, &alpha( 0,p ), 1, &beta( p,0 ), ldB, C, ldC );
}
Homework 2.3.3.1.
In directory Assignments/Week2/C examine the file Assignments/Week2/C/Gemm_JIP_P_Ger.c Execute it with
make JIP_P_Gerand view its performance with
Assignments/Week2/C/data/Plot_register_blocking.mlx.
This is the performance on Robert's laptop, with MB=NB=KB=4:

It is tempting to start playing with the parameters MB, NB, and KB. However, the point of this exercise is to illustrate the discussion about casting the multiplication with the blocks in terms of a loop around rank-1 updates.
Remark 2.3.6.
The purpose of this implementation is to emphasize, again, that the matrix-matrix multiplication with blocks can be orchestratred as a loop around rank-1 updates, as you already learned in Unit 1.4.2.
