Unit3.2.3Which cache to target?

Homework3.2.3.1.

In Assignments/Week3/C, execute

make PI_JI_12x4_MCxKC

and view the performance results with Live Script Assignments/Week3/C/data/Plot_MC_KC_Performance.mlx.

This experiment tries many different choices for $m_C$ and $k_C \text{,}$ and presents them as a scatter graph so that the optimal choice can be determined and visualized.

It will take a long time to execute. Go take a nap, catch a movie, and have some dinner! If you have a relatively old processor, you may even want to run it over night. Try not to do anything else on your computer, since that may affect the performance of an experiment. We went as far as to place something under the laptop to lift it off the table, to improve air circulation. You will notice that the computer will start making a lot of noise, as fans kick in to try to cool the core.

Solution

The last homework on Robert's laptop yields the performance in Figure 3.2.8. Figure 3.2.8. Performance when $m_R = 12$ and $n_R = 4$ and $m_C$ and $k_C$ are varied. The best performance is empirically determined by searching the results. We notice that the submatrix $A_{i,p}$ is sized to fit in the L2 cache. As we apply additional optimizations, the optimal choice for $m_C$ and $k_C$ will change. Notice that on Robert's laptop, the optimal choice varies greatly from one experiment to the next. You may observe the same level of variability on your computer.

To summarize, our insights suggest

1. Bring an $m_C \times k_C$ submatrix of $A$ into the cache, at a cost of $m_C \times k_C \beta_{M \leftrightarrow C} \text{,}$ or $m_C \times k_C$ memory operations (memops). (It is the order of the loops and instructions that encourages the submatrix of $A$ to stay in cache.) This cost is amortized over $2 m_C n k_C$ flops for a ratio of $2 m_C n k_C / (m_C k_C) = 2 n \text{.}$ The larger $n \text{,}$ the better.

2. The cost of reading an $k_C \times n_R$ submatrix of $B$ is amortized over $2 m_C n_R k_C$ flops, for a ratio of $2 m_C n_R k_C / ( k_C n_R) = 2 m_C \text{.}$ Obviously, the greater $m_C \text{,}$ the better.

3. The cost of reading and writing an $m_R \times n_R$ submatrix of $C$ is now amortized over $2 m_R n_R k_C$ flops, for a ratio of $2 m_R n_R k_C / ( 2 m_R n_R) = k_C \text{.}$ Obviously, the greater $k_C \text{,}$ the better.

Item 2 and Item 3 suggest that $m_C \times k_C$ submatrix $A_{i,p}$ be squarish.

If we revisit the performance data plotted in Figure 3.2.8, we notice that matrix $A_{i,p}$ fits in part of the L2 cache, but is too big for the L1 cache. What this means is that the cost of bringing elements of $A$ into registers from the L2 cache is low enough that computation can offset that cost. Since the L2 cache is larger than the L1 cache, this benefits performance, since the $m_C$ and $k_C$ can be larger.