## Unit3.3.6Micro-kernel with packed data

Reference implementations of packing routines can be found in Figure 3.3.2, Figure 3.3.3, Figure 3.3.5, Figure 3.3.6. While these implementations can be optimized, the fact is that the cost when packing is in the data movement between main memory and faster memory. As a result, optimizing the packing has relatively little effect.

How to modify the five loops to incorporate packing is illustrated in Unit 3.3.5. A micro-kernel to compute with the packed data when $m_R \times n_R = 4 \times 4$ is illustrated in Figure 3.3.8.

###### Homework3.3.6.2.

Copy the file Gemm_4x4Kernel_Packed.c into file Gemm_12x4Kernel_Packed.c. Modify that file so that it uses $m_R \times n_R = 12 \times 4 \text{.}$ Test the result with

make Five_Loops_Packed_12x4Kernel


and view the resulting performance with Live Script Plot_Five_Loops.mlx.

Solution

Assignments/Week3/Answers/Gemm_12x4Kernel_Packed.c

On Robert's laptop:

Now we are getting somewhere!