## Unit4.4.3Parallelizing the packing

Some observations:

• If you choose to parallelize Loop 3 around the microkernel, then the packing into $\widetilde A$ is already parallelized, since each thread packs its own such block. So, you will want to parallelize the packing of $\widetilde B \text{.}$ If you choose to parallelize Loop 2 around the microkernel, you will also want to parallelize the packing of $\widetilde A \text{.}$

• Be careful: we purposely wrote the routines that pack $\widetilde A$ and $\widetilde B$ so that a naive parallelization will give the wrong answer. Analyze carefully what happens when multiple threads execute the loop...

###### Homework4.4.3.1.

In directory Week4/C,

• Copy Gemm_Parallel_Loop2_12x4.c into Gemm_Parallel_Loop2_Parallel_Pack_12x4.c.

• Parallelize the packing of $\widetilde B$ and/or $\widetilde A \text{.}$

• Set the number of threads to some number between $1$ and the number of CPUs in the target processor.

• Execute

make Parallel_Loop2_Parallel_Pack_12x4

.

• View the resulting performance with ShowPerformance.mlx, uncommenting the appropriate lines.

• Be sure to check if you got the right answer! This actually is trickier than you might at first think!