Unit 4.3.2 Parallelizing the first loop around the microkernel
ΒΆOne can similarly parallelize the first loop around the microkernel:
Homework 4.3.2.1.
In directory Week4/C,
Copy Gemm_Parallel_Loop2_12x4.c into Gemm_Parallel_Loop1_12x4.c.
Modify it so that only the first loop around the microkernel is parallelized.
Set the number of threads to some number between \(1 \) and the number of CPUs in the target processor.
Execute make Parallel_Loop1_12x4.
View the resulting performance with data/ShowPerformance.mlx, uncommenting the appropriate lines. (You should be able to do this so that you see previous performance curves as well.)
Be sure to check if you got the right answer!
How does the performance improve relative to the number of threads being used?