As mentioned, there are many choices for $m_R \times n_R \text{.}$ In our discussions, we focused on $12 \times 4 \text{.}$ That different choices for $m_R \times n_R$ yield better performance, now that packing has been added into the implementation.
Once you have determined the best $m_R \times n_R$ you may want to go back and redo Homework 3.3.6.3 to determine the best $m_C$ and $k_C \text{.}$ Then, collect final performance data once you have updated the Makefile with the best choices.