Skip to main content

Unit 4.4.3 Parallelizing the packing

Some observations:

  • If you choose to parallelize Loop 3 around the microkernel, then the packing into \(\widetilde A \) is already parallelized, since each thread packs its own such block. So, you will want to parallelize the packing of \(\widetilde B \text{.}\) If you choose to parallelize Loop 2 around the microkernel, you will also want to parallelize the packing of \(\widetilde A \text{.}\)

  • Be careful: we purposely wrote the routines that pack \(\widetilde A \) and \(\widetilde B \) so that a naive parallelization will give the wrong answer. Analyze carefully what happens when multiple threads execute the loop...


In directory Week4/C,

  • Copy Gemm_Parallel_Loop2_12x4.c into Gemm_Parallel_Loop2_Parallel_Pack_12x4.c.

  • Parallelize the packing of \(\widetilde B \) and/or \(\widetilde A \text{.}\)

  • Set the number of threads to some number between \(1 \) and the number of CPUs in the target processor.

  • Execute

    make Parallel_Loop2_Parallel_Pack_12x4

  • View the resulting performance with ShowPerformance.mlx, uncommenting the appropriate lines.

  • Be sure to check if you got the right answer! This actually is trickier than you might at first think!