## Unit4.3.3Parallelizing the second loop around the micro-kernel

Let us next consider how to parallelize the second loop around the micro-kernel: This time, the situation is very similar to that considered in Unit 4.1.1 when parallelizing the JPI loop ordering. There, in Homework 4.1.1.2 we observed that if $C$ and $B$ are partitioned by columns, the matrix-matrix multiplication can be described by

\begin{equation*} \begin{array}{rcl} \left( \begin{array}{c | c | c } c_0 \amp \cdots \amp c_{n-1} \end{array} \right) \amp:= \amp A \left( \begin{array}{c | c | c } b_0 \amp \cdots \amp b_{n-1} \end{array} \right) + \left( \begin{array}{c | c | c } c_0 \amp \cdots \amp c_{n-1} \end{array} \right) \\ \amp = \amp \left( \begin{array}{c | c | c } A b_0 + c_0 \amp \cdots \amp A b_{n-1} + c_{n-1} \end{array} \right). \end{array} \end{equation*}

We then observed that the update of each column of $C$ can proceed in parallel.

The matrix-matrix multiplication performed by the second loop around the micro-kernel instead partitions the row panel of $C$ and the row panel of $B$ into micro-panels.

\begin{equation*} \begin{array}{rcl} \left( \begin{array}{c | c | c } C_0 \amp \cdots \amp C_{N-1} \end{array} \right) \amp:= \amp A \left( \begin{array}{c | c | c } B_0 \amp \cdots \amp B_{N-1} \end{array} \right) + \left( \begin{array}{c | c | c } C_0 \amp \cdots \amp C_{N-1} \end{array} \right) \\ \amp = \amp \left( \begin{array}{c | c | c } A B_0 + C_0 \amp \cdots \amp A B_{N-1} + C_{N-1} \end{array} \right). \end{array} \end{equation*}

The bottom line: the updates of the micro-panels of $C$ can happen in parallel.

###### Homework4.3.3.1.

In directory Assignments/Week4/C,

• Copy Gemm_MT_Loop1_MRxNRKernel.c. into Gemm_MT_Loop2_MRxNRKernel.c.

• Modify it so that only the second loop around the micro-kernel is parallelized.

• Execute it with

export OMP_NUM_THREADS=4
make MT_Loop2_8x6Kernel


• Be sure to check if you got the right answer!

• View the resulting performance with data/Plot_MT_performance_8x6.mlx, changing 0 to 1 for the appropriate section.

Solution

On Robert's laptop (using 4 threads): Parallelizing the second loop appears to work very well. The granuarity of the computation in each iteration is larger. The ratio between NC and NR is typically large: in the makefile I set NC=2016 and NR=6, so that there are $2016/6 = 336$ tasks that can be scheduled to the threads. The only issue is that all cores load their L2 cache with the same block of $A\text{,}$ which is a waste of a valuable resource (the aggregate size of the L2 caches).