Unit 3.4.5 Loop unrolling (easy enough; see what happens)¶
There is some overhead that comes from having a loop in the micro-kernel. That overhead is in part due to the cost of updating the loop index p and the pointers to the buffers, MP_A and MP_B. It is also possible that "unrolling" the loop will allow the compiler to rearrange more instructions in the loop and/or the hardware to better perform out-of-order computation because each iteration of the loop involves a branch instruction, where the branch depends on whether the loop is done. Branches get in the way of compiler optimizations and out-of-order computation by the CPU.
What is loop unrolling? In the case of the micro-kernel, unrolling the loop indexed by p by a factor two means that each iteration of that loop updates the micro-tile of \(C \) twice instead of once. In other words, each iteration of the unrolled loop performs two iterations of the original loop, and updates p+=2 instead of p++. Unrolling by a larger factor is the natural extension of this idea. Obviously, a loop that is unrolled requires some care in case the number of iterations was not a nice multiple of the unrolling factor. You may have noticed that our performance experiments always use problem sizes that are multiples of 48. This means that if you use a reasonable unrolling factor that divides into 48, you are probably going to be OK.
Similarly, one can unroll the loops in the packing routines.
See what happens!
Modify your favorite implementation so that it unrolls the loop in the micro-kernel with different unrolling factors.
(Note: I have not implemented this myself yet...)