## Unit4.3.3Parallelizing the third loop around the microkernel

Let's similarly parallelize the third loop around the microkernel:

###### Homework4.3.3.1.

In directory Week4/C,

• Copy Gemm_Parallel_Loop2_12x4.c into Gemm_Parallel_Loop3_12x4.c.

• Modify it so that only the third loop around the microkernel is parallelized.

• Set the number of threads to some number between $2$ and the number of CPUs in the target processor.

• Execute make Parallel_Loop3_12x4.

• View the resulting performance with ShowPerformance.mlx, uncommenting the appropriate lines.

• Be sure to check if you got the right answer! This actually is trickier than you might at first think!

Observations:

• Parallelelizing the third loop is a bit trickier. Likely, you started by inserting the #pragma statement and got the wrong answer. The problem is that you are now parallelizing

implemented by

\begin{center}
\lstinputlisting[firstline=62,lastline=71]{github/LAFF-On-PfHPC/Assignments/WeekB/C/GemmFiveLoops.c}
\end{center}


The problem with a naive parallelization is that all threads pack their block of $A$ into the same buffer Atilde. There are two solutions to this:

• Quick and dirty: allocate and free a new buffer in this loop, so that each iteration, and hence each thread, gets a different buffer.

• Cleaner: In Loop 5 around the microkernel, we currently allocate one buffer for Atilde:

void LoopThree( int m, int n, int k, double *A, int ldA,
double *Btilde, double *C, int ldC )
{
double *Atilde =
( double * ) malloc( MC * KC * sizeof( double ) );

for ( int i=0; i<m; i+=MC ) {
int ib = min( MC, m-i );
PackBlockA_MCxKC( ib, k, &alpha( i, 0 ), ldA, Atilde );
LoopTwo( ib, n, k, Atilde, Btilde, &gamma( i,0 ), ldC );
}

free( Atilde);
}


What one could do is to allocate a larger buffer here, enough for as many blocks $\widetilde A$ as there are threads, and then in the third loop around the microkernel one can index into this larger buffer to give each thread its own space. For this, one needs to know about a few OpenMP inquiry routines:

• omp_get_num_threads() returns the number of theads being used in a parallel section.

Now, this is all a bit tricky, because omp_get_num_threads() returns the number of threads being used only in a parallel section. So, to allocate enough space, one would try to change the code in Loop 5 to

double *Atilde;

#pragma omp parallel
{
if ( omp_get_thread_num() == 0 )
malloc( MC * NC * sizeof( double ) * omp_get_num_threads() );
}


Here the curly brackets indicate the scope of the parallel section. The call to free must then be correspondingly modified so that only thread 0 releases the buffer.

Now, in Loop 2, one can index into this larger buffer.