## Unit3.4.4Broadcasting elements of $A$ and loading elements of $B$ (tricky and maybe not that worthwhile)

So far, we have focused on choices for $m_R \times n_R$ where $m_R$ is a multiple of four. Before packing was added, this was because one loads into registers from contiguous memory and for this reason loading from $A \text{,}$ four elements at a time, while broadcasting elements from $B$ was natural because of how data was mapped to memory using column-major order. After packing was introduced, one could also contemplate loading from $B$ and broadcasting elements of $A \text{,}$ which then also makes kernels like $6 \times 8$ possible. The problem is that this would mean performing a vector instruction with elements that are in rows of $C \text{,}$ and hence when loading elements of $C$ one would need to perform a transpose operation, and also when storing the elements back to memory.

There are reasons why a $6 \times 8$ kernel edges out even a $8 \times 6$ kernel. On the surface, it may seem like this would require a careful reengineering of the five loops as well as the micro-kernel. However, there are tricks that can be played to make it much simpler than it seems.

Intuitively, if you store matrices by columns, leave the data as it is so stored, but then work with the data as if it is stored by rows, this is like computing with the transposes of the matrices. Now, mathematically, if

\begin{equation*} C := A B + C \end{equation*}

then

\begin{equation*} C^T := ( A B + C )^T = ( A B )^T + C^T = B^T A^T + C^T. \end{equation*}

If you think about it: transposing a matrix is equivalent to viewing the matrix, when stored in column-major order, as being stored in row-major order.

Consider the simplest implementation for computing $C := A B + C$ from Week 1:

#define alpha( i,j ) A[ (j)*ldA + i ]
#define beta( i,j )  B[ (j)*ldB + i ]
#define gamma( i,j ) C[ (j)*ldC + i ]

void MyGemm( int m, int n, int k, double *A, int ldA,
double *B, int ldB, double *C, int ldC )
{
for ( int i=0; i<m; i++ )
for ( int j=0; j<n; j++ )
for ( int p=0; p<k; p++ )
gamma( i,j ) += alpha( i,p ) * beta( p,j );
}


From this, a code that instead computes $C^T := B^T A^T + C^T \text{,}$ with matrices still stored in column-major order, is given by

#define alpha( i,j ) A[ (j)*ldA + i ]
#define beta( i,j )  B[ (j)*ldB + i ]
#define gamma( i,j ) C[ (j)*ldC + i ]

void MyGemm( int m, int n, int k, double *A, int ldA,
double *B, int ldB, double *C, int ldC )
{
for ( int i=0; i<m; i++ )
for ( int j=0; j<n; j++ )
for ( int p=0; p<k; p++ )
gamma( j,i ) += alpha( p,i ) * beta( j,p );
}


Notice how the roles of m and n are swapped and how working with the transpose replaces i,j with j,i, etc. However, we could have been clever: Instead of replacing i,j with j,i, etc., we could have instead swapped the order of these in the macro definitions: #define alpha( i,j ) becomes #define alpha( j,i ):

#define alpha( j,i ) A[ (j)*ldA + i ]
#define beta( j,i )  B[ (j)*ldB + i ]
#define gamma( j,i ) C[ (j)*ldC + i ]

void MyGemm( int m, int n, int k, double *A, int ldA,
double *B, int ldB, double *C, int ldC )
{
for ( int i=0; i<m; i++ )
for ( int j=0; j<n; j++ )
for ( int p=0; p<k; p++ )
gamma( i,j ) += alpha( i,p ) * beta( p,j );
}


Next, we can get right back to our original triple-nested loop by swapping the roles of A and B:

#define alpha( j,i ) A[ (j)*ldA + i ]
#define beta( j,i )  B[ (j)*ldB + i ]
#define gamma( j,i ) C[ (j)*ldC + i ]

void MyGemm( int m, int n, int k, double *B, int ldB,
double *A, int ldA, double *C, int ldC )
{
for ( int i=0; i<m; i++ )
for ( int j=0; j<n; j++ )
for ( int p=0; p<k; p++ )
gamma( i,j ) += alpha( i,p ) * beta( p,j );
}


The point is: Computing $C := A B + C$ with these matrices stored by columns is the same as computing $C := B A + C$ viewing the data as if stored by rows, making appropriate adjustments for the sizes of the matrix. With this insight, any matrix-matrix multiplication implementation we have encountered can be transformed into one that views the data as if stored by rows by redefining the macros that translate indexing into the arrays into how the matrices are stored, a swapping of the roles of $A$ and $B \text{,}$ and an adjustment to the sizes of the matrix.

Because we implement the packing with the macros, they actually can be used as is. What is left is to modify the inner kernel to use $m_R \times n_R = 6 \times 8 \text{.}$