Skip to main content

Unit 3.3.6 Micro-kernel with packed data

void Gemm_MRxNRKernel_Packed( int k,
		        double *BlockA, double *PanelB, double *C, int ldC )
{
  __m256d gamma_0123_0 = _mm256_loadu_pd( &gamma( 0,0 ) );
  __m256d gamma_0123_1 = _mm256_loadu_pd( &gamma( 0,1 ) );
  __m256d gamma_0123_2 = _mm256_loadu_pd( &gamma( 0,2 ) );
  __m256d gamma_0123_3 = _mm256_loadu_pd( &gamma( 0,3 ) );

  __m256d beta_p_j;
   	
  for ( int p=0; p<k; p++ ){
    /* load alpha( 0:3, p ) */
    __m256d alpha_0123_p = _mm256_loadu_pd( BlockA );

    /* load beta( p, 0 ); update gamma( 0:3, 0 ) */
    beta_p_j = _mm256_broadcast_sd( PanelB );
    gamma_0123_0 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_0 );

    /* load beta( p, 1 ); update gamma( 0:3, 1 ) */
    beta_p_j = _mm256_broadcast_sd( PanelB+1 );
    gamma_0123_1 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_1 );

    /* load beta( p, 2 ); update gamma( 0:3, 2 ) */
    beta_p_j = _mm256_broadcast_sd( PanelB+2 );
    gamma_0123_2 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_2 );

    /* load beta( p, 3 ); update gamma( 0:3, 3 ) */
    beta_p_j = _mm256_broadcast_sd( PanelB+3 );
    gamma_0123_3 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_3 );

    BlockA += MR;
    PanelB += NR;
  }

  /* Store the updated results.  This should be done more carefully since
     there may be an incomplete micro-tile. */
  _mm256_storeu_pd( &gamma(0,0), gamma_0123_0 );
  _mm256_storeu_pd( &gamma(0,1), gamma_0123_1 );
  _mm256_storeu_pd( &gamma(0,2), gamma_0123_2 );
  _mm256_storeu_pd( &gamma(0,3), gamma_0123_3 );
}

Figure 3.3.8. Blocking for multiple levels of cache, with packing.

Reference implementations of packing routines can be found in Figure 3.3.2, Figure 3.3.3, Figure 3.3.5, Figure 3.3.6. While these implementations can be optimized, the fact is that the cost when packing is in the data movement between main memory and faster memory. As a result, optimizing the packing has relatively little effect.

How to modify the five loops to incorporate packing is illustrated in Unit 3.3.5. A micro-kernel to compute with the packed data when \(m_R \times n_R = 4 \times 4 \) is illustrated in Figure 3.3.8.

Homework 3.3.6.2.

Copy the file Gemm_4x4Kernel_Packed.c into file Gemm_12x4Kernel_Packed.c. Modify that file so that it uses \(m_R \times n_R = 12 \times 4 \text{.}\) Test the result with

make Five_Loops_Packed_12x4Kernel

and view the resulting performance with Live Script Plot_Five_Loops.mlx.

Solution

Assignments/Week3/Answers/Gemm_12x4Kernel_Packed.c

On Robert's laptop:

Now we are getting somewhere!