## Unit3.4.5Loop unrolling (easy enough; see what happens)

There is some overhead that comes from having a loop in the micro-kernel. That overhead is in part due to the cost of updating the loop index p and the pointers to the buffers, MP_A and MP_B. It is also possible that "unrolling" the loop will allow the compiler to rearrange more instructions in the loop and/or the hardware to better perform out-of-order computation because each iteration of the loop involves a branch instruction, where the branch depends on whether the loop is done. Branches get in the way of compiler optimizations and out-of-order computation by the CPU.

What is loop unrolling? In the case of the micro-kernel, unrolling the loop indexed by p by a factor two means that each iteration of that loop updates the micro-tile of $C$ twice instead of once. In other words, each iteration of the unrolled loop performs two iterations of the original loop, and updates p+=2 instead of p++. Unrolling by a larger factor is the natural extension of this idea. Obviously, a loop that is unrolled requires some care in case the number of iterations was not a nice multiple of the unrolling factor. You may have noticed that our performance experiments always use problem sizes that are multiples of 48. This means that if you use a reasonable unrolling factor that divides into 48, you are probably going to be OK.

Similarly, one can unroll the loops in the packing routines.

See what happens!

###### Homework3.4.5.1.

Modify your favorite implementation so that it unrolls the loop in the micro-kernel with different unrolling factors.

(Note: I have not implemented this myself yet...)