PfHP Parallelizing the packing

Unit 4.4.3 Parallelizing the packing

Since not parallelizing part of the computation can translate into a slow ramping up of performance, it is worthwhile to consider what else perhaps needs to be parallelized. Packing can contribute significantly to the overhead (although we have not analyzed this, nor measured it), and hence is worth a look. We are going to look at each loop in our "five loops around the micro-kernel" and reason through whether parallelizing the packing of the block of \(A \) and/or the packing of the row panel of \(B \) should be considered.

Subsubsection 4.4.3.1 Loop two and parallelizing the packing

Let's start by considering the case where the second loop around the micro-kernel has been parallelized. Notice that all packing happens before this loop is reached. This means that, unless one explicitly parallelizes the packing of the block of \(A \) and/or the packing of the row panel of \(B \text{,}\) these components of the computation are performed by a single thread.

Homework 4.4.3.1.

Copy PackA.c into MT_PackA.c. Change this file so that packing is parallelized.

Execute it with

export OMP_NUM_THREADS=4
make MT_Loop2_MT_PackA_8x6Kernel

Be sure to check if you got the right answer! Parallelizing the packing, the way PackA.c is written, is a bit tricky.
View the resulting performance with data/Plot_MT_Loop2_MT_Pack_8x6.mlx.

Solution

Assignments/Week4/Answers/MT_PackA.c

On Robert's laptop (using 4 threads), the performance is not noticeably changed:

Homework 4.4.3.2.

Copy PackB.c into MT_PackB.c. Change this file so that packing is parallelized.

Execute it with

export OMP_NUM_THREADS=4
make MT_Loop2_MT_PackB_8x6Kernel

Be sure to check if you got the right answer! Again, parallelizing the packing, the way PackB.c is written, is a bit tricky.
View the resulting performance with data/Plot_MT_Loop2_MT_Pack_8x6.mlx.

Solution

Assignments/Week4/Answers/MT_PackB.c

On Robert's laptop (using 4 threads), the performance is again not noticeably changed:

Homework 4.4.3.3.

Now that you have parallelized both the packing of the block of \(A\) and the packing of the row panel of \(B\text{,}\) you are set to check if doing both shows a benefit.

Execute

export OMP_NUM_THREADS=4
make MT_Loop2_MT_PackAB_8x6Kernel

Be sure to check if you got the right answer!
View the resulting performance with data/Plot_MT_Loop2_MT_Pack_8x6.mlx.

Solution

On Robert's laptop (using 4 threads), the performance is still not noticeably changed:

Why don't we see an improvement? Packing is a memory intensive task. Depending on how much bandwidth there is between cores and memory, packing with a single core (thread) may already saturate that bandwidth. In that case, parallelizing the operation so that multiple cores are employed does not actually speed the process. It appears that on Robert's laptop, the bandwidth is indeed saturated. Those with access to beefier processors with more bandwidth to memory may see some benefit from parallelizing the packing, especially when utilizing more cores.

Subsubsection 4.4.3.2 Loop three and parallelizing the packing

Next, consider the case where the third loop around the micro-kernel has been parallelized. Now, each thread packs a different block of \(A\) and hence there is no point in parallelizing the packing of that block. The packing of the row panel of \(B \) happens before the third loop around the micro-kernel is reached, and hence one can consider parallelizing that packing.

Homework 4.4.3.4.

You already parallelized the packing of the row panel of \(B\) in Homework 4.4.3.2

Execute

export OMP_NUM_THREADS=4
make MT_Loop3_MT_PackB_8x6Kernel

Be sure to check if you got the right answer!
View the resulting performance with data/Plot_MT_Loop3_MT_Pack_8x6.mlx.

Solution

On Robert's laptop (using 4 threads), the performance is again not noticeably changed:

Subsubsection 4.4.3.3 Loop five and parallelizing the packing

Next, consider the case where the fifth loop around the micro-kernel has been parallelized. Now, each thread packs a different row panel of \(B\) and hence there is no point in parallelizing the packing of that block. As each thread executes subsequent loops (loops four through one around the micro-kernel), they pack blocks of \(A \) redundantly, since each allocates its own space for the packed block. It should be possible to have them collaborate on packing a block, but that would require considerable synchronization between loops... Details are beyond this course. If you know a bit about OpenMP, you may want to try this idea.