PfHP Implementing the micro-kernel with vector instructions

Unit 2.4.2 Implementing the micro-kernel with vector instructions

Remark 2.4.1.

This unit is one of the most important ones in the course. Take your time to understand it. The syntax of the intrinsic library is less important than the concepts: as new architectures become popular, new vector instructions will be introduced.

In this section, we are going to AVX2 vector instruction set to illustrate the ideas. Vector intrinsic functions support the use of these instructions from within the C programming language. You discover how the intrinsic operations are used by incorporating them into an implemention of the micro-kernel for the case where \(m_R \times n_R = 4 \times 4 \text{.}\) Thus, for the remainder of this unit, \(C \) is \(m_R \times n_R = 4 \times 4 \text{,}\) \(A \) is \(m_R \times k = 4 \times k \text{,}\) and \(B \) is \(k \times n_R = k \times 4 \text{.}\)

The architectures we target have 256 bit vector registers, which mean they can store four double precision floating point numbers. Vector operations with vector registers hence operate with vectors of size four.

Let's recap: Partitioning

\begin{equation*} C = \left( \begin{array}{c | c | c} C_{0,0} \amp \cdots \amp C_{0,N-1} \\ \hline \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}

\begin{equation*} A = \left( \begin{array}{c | c | c} A_{0} \\ \hline \vdots \\ \hline A_{M-1} \end{array} \right), \quad B = \left( \begin{array}{c | c | c} B_{0} \amp \cdots \amp B_{N-1} \\ \end{array} \right), \end{equation*}

the algorithm we discovered in Section 2.3 is given by

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , M-1 \\ ~~~ {\bf for~} j := 0, \ldots , N-1 \\ ~~~ ~~~ {\rm Load~} C_{i,j} \mbox{ into registers} \\ ~~~ ~~~ C_{i,j} := A_{i} B_{j} + C_{i,j} {\rm ~~with~micro-kernel} \\ ~~~ ~~~ {\rm Store~} C_{i,j} \mbox{ to memory} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

and translates into the code in Figure 2.4.2.

void MyGemm( int m, int n, int k, double *A, int ldA, 
		  double *B, int ldB, double *C, int ldC ) 
{
  if ( m % MR != 0 || n % NR != 0 ){
    printf( "m and n must be multiples of MR and NR, respectively \n" ); 
    exit( 0 ); 
  }
  
  for ( int j=0; j<n; j+=NR ) /* n is assumed to be a multiple of NR */
    for ( int i=0; i<m; i+=MR ) /* m is assumed to be a multiple of MR */
      Gemm_MRxNRKernel( k, &alpha( i,0 ), ldA, &beta( 0,j ), ldB, &gamma( i,j ), ldC ); 
}

Figure 2.4.2. General routine for calling a \(m_R \times n_R \) kernel in Assignments/Week2/C/Gemm_JI_MRxNRKernel.c. The constants MR and NR are specified at compile time by passing -D'MR=??' and -D'NR=??' to the compiler, where the ??s equal the desired choices of \(m_R \) and \(n_R \) (see also the Makefile in Assignments/Week2/C).

Let's drop the subscripts, and focus on computing \(C +:= A B \) when \(C \) is \(4 \times 4 \text{,}\) with the micro-kernel. This translates to the computation

\begin{equation*} \begin{array}{l} \left( \begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \gamma_{0,2} \amp \gamma_{0,3} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \gamma_{1,2} \amp \gamma_{1,3} \\ \gamma_{2,0} \amp \gamma_{2,1} \amp \gamma_{2,2} \amp \gamma_{2,3} \\ \gamma_{3,0} \amp \gamma_{3,1} \amp \gamma_{3,2} \amp \gamma_{3,3} \end{array}\right) \\ ~~~ +:= \left( \begin{array}{c | c | c } \alpha_{0,0} \amp \color{red} \alpha_{0,1} \amp \cdots \\ \alpha_{1,0} \amp \color{red} \alpha_{1,1} \amp \cdots \\ \alpha_{2,0} \amp \color{red} \alpha_{2,1} \amp \cdots \\ \alpha_{3,0} \amp \color{red} \alpha_{3,1} \amp \cdots \end{array}\right) \left( \begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \beta_{0,2} \amp \beta_{0,3} \\ \beta_{1,0} \amp \color{blue} \beta_{1,1} \amp \beta_{1,2} \amp \beta_{1,3} \\ \vdots \amp \vdots \amp \vdots \amp \vdots \end{array}\right) \\ ~~~ = \left(\begin{array}{c | c | c | c } \alpha_{0,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{0,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,2} \!\!+\!\! \cdots \\ \alpha_{1,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{1,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,3} \!\!+\!\! \cdots \\ \alpha_{2,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{2,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,3} \!\!+\!\! \cdots \\ \alpha_{3,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{3,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,3} \!\!+\!\! \cdots \end{array} \right) \\ ~~~ = \left( \begin{array}{c} \alpha_{0,0} \\ \alpha_{1,0} \\ \alpha_{2,0} \\ \alpha_{3,0} \end{array}\right) \left( \begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \beta_{0,2} \amp \beta_{0,3} \\ \end{array}\right) + \left( \begin{array}{c c c c} {\color{red} \alpha_{0,1}} \\ {\color{red} \alpha_{1,1}} \\ {\color{red} \alpha_{2,1}} \\ {\color{red} \alpha_{3,1}} \end{array}\right) \left( \begin{array}{c c c c} \beta_{1,0} \amp {\color{blue} \beta_{1,1}} \amp \beta_{1,2} \amp \beta_{1,3} \\ \end{array}\right) + \cdots \end{array} \end{equation*}

Thus, updating \(4 \times 4 \) matrix \(C \) can be implemented as a loop around rank-1 updates:

\begin{equation*} \begin{array}{l} {\bf for~} p = 0, \ldots, k-1\\ ~~~ \left( \begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \gamma_{0,2} \amp \gamma_{0,3} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \gamma_{1,2} \amp \gamma_{1,3} \\ \gamma_{2,0} \amp \gamma_{2,1} \amp \gamma_{2,2} \amp \gamma_{2,3} \\ \gamma_{3,0} \amp \gamma_{3,1} \amp \gamma_{3,2} \amp \gamma_{3,3} \end{array}\right) +:= \left( \begin{array}{c c c c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array}\right) \left( \begin{array}{c c c c} \beta_{p,0} \amp \beta_{p,1} \amp \beta_{p,2} \amp \beta_{p,3} \\ \end{array}\right) \\ {\bf end} \end{array} \end{equation*}

or, equivalently to emphasize computations with vectors,

\begin{equation*} \begin{array}{l} {\bf for~} p = 0, \ldots, k-1 \\ ~~~ \left( \begin{array}{c | c | c | c} \begin{array}{c c c c} \gamma_{0,0} +:= \alpha_{0,p} \times \beta_{p,0} \\ \gamma_{1,0} +:= \alpha_{1,p} \times \beta_{p,0} \\ \gamma_{2,0} +:= \alpha_{2,p} \times \beta_{p,0} \\ \gamma_{3,0} +:= \alpha_{3,p} \times \beta_{p,0} \end{array} \amp \begin{array}{c c c c} \gamma_{0,1} +:= \alpha_{0,p} \times \beta_{p,1} \\ \gamma_{1,1} +:= \alpha_{1,p} \times \beta_{p,1} \\ \gamma_{2,1} +:= \alpha_{2,p} \times \beta_{p,1} \\ \gamma_{3,1} +:= \alpha_{3,p} \times \beta_{p,1} \end{array} \amp \begin{array}{c c c c} \gamma_{0,2} +:= \alpha_{0,p} \times \beta_{p,2} \\ \gamma_{1,2} +:= \alpha_{1,p} \times \beta_{p,2} \\ \gamma_{2,2} +:= \alpha_{2,p} \times \beta_{p,2} \\ \gamma_{3,2} +:= \alpha_{3,p} \times \beta_{p,2} \end{array} \amp \begin{array}{c c c c} \gamma_{0,3} +:= \alpha_{0,p} \times \beta_{p,3} \\ \gamma_{1,3} +:= \alpha_{1,p} \times \beta_{p,3} \\ \gamma_{2,3} +:= \alpha_{2,p} \times \beta_{p,3} \\ \gamma_{3,3} +:= \alpha_{3,p} \times \beta_{p,3} \end{array} \end{array}\right) \\ {\bf end} \end{array} \end{equation*}

This, once again, shows how matrix-matrix multiplication can be cast in terms of a sequence of rank-1 updates, and that a rank-1 update can be implemented in terms of axpy operations with columns of \(C \text{.}\) This micro-kernel translates into the code that employs vector instructions, given in Figure 2.4.3. We hope that the instrinsic function calls are relatively self-explanatory. However, you may want to consult Intel's Intrinsics Reference Guide. When doing so, you may want to search on the name of the routine, without checking the AVX2 box on the left.

#include <immintrin.h> 

void Gemm_MRxNRKernel( int k, double *A, int ldA, double *B, int ldB, 
		double *C, int ldC ) 
{
  /* Declare vector registers to hold 4x4 C and load them */
  __m256d gamma_0123_0 = _mm256_loadu_pd( &gamma( 0,0 ) ); 
  __m256d gamma_0123_1 = _mm256_loadu_pd( &gamma( 0,1 ) ); 
  __m256d gamma_0123_2 = _mm256_loadu_pd( &gamma( 0,2 ) ); 
  __m256d gamma_0123_3 = _mm256_loadu_pd( &gamma( 0,3 ) ); 
   	
  for ( int p=0; p<k; p++ ){
    /* Declare vector register for load/broadcasting beta( p,j ) */
    __m256d beta_p_j; 
    
    /* Declare a vector register to hold the current column of A and load 
       it with the four elements of that column. */
    __m256d alpha_0123_p = _mm256_loadu_pd( &alpha( 0,p ) ); 

    /* Load/broadcast beta( p,0 ). */
    beta_p_j = _mm256_broadcast_sd( &beta( p, 0) ); 
    
    /* update the first column of C with the current column of A times 
       beta ( p,0 ) */
    gamma_0123_0 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_0 ); 
    
    /* REPEAT for second, third, and fourth columns of C.  Notice that the 
       current column of A needs not be reloaded. */

  
    
  }
  
  /* Store the updated results */
  _mm256_storeu_pd( &gamma(0,0), gamma_0123_0 ); 
  _mm256_storeu_pd( &gamma(0,1), gamma_0123_1 ); 
  _mm256_storeu_pd( &gamma(0,2), gamma_0123_2 ); 
  _mm256_storeu_pd( &gamma(0,3), gamma_0123_3 ); 
}

Figure 2.4.3. Partially instantiated \(m_R \times n_R \) kernel for the case where \(m_R = n_R = 4 \) (see Assignments/Week2/C/Gemm_4x4Kernel.c).

An illustration of how Figure 2.4.2 and Figure 2.4.3 compute is found in Figure 2.4.4.