## Unit2.4.4More options

You have now experienced the fact that modern architectures have vector registers and how to (somewhat) optimize the update of a $4 \times 4$ submatrix of $C$ with vector instructions.

In the implementation in Unit 2.4.2, the block of $C$ kept in registers is $4 \times 4 \text{.}$ In general, the block is $m_R \times n_R \text{.}$ If we assume we will use $R_C$ registers for elements of the submatrix of $C \text{,}$ what should $m_R$ and $n_R$ be to attain the best performance?

A big part of optimizing is to amortize the cost of moving data over useful computation. In other words, we want the ratio between the number of flops that are performed to the number of data that are moved between, for now, registers and memory to be high.

Let's recap: Partitioning

\begin{equation*} C = \left( \begin{array}{c | c | c} C_{0,0} \amp \cdots \amp C_{0,N-1} \\ \hline \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}
\begin{equation*} A = \left( \begin{array}{c | c | c} A_{0} \\ \hline \vdots \\ \hline A_{M-1} \end{array} \right), \quad B = \left( \begin{array}{c | c | c} B_{0} \amp \cdots \amp B_{N-1} \\ \end{array} \right), \end{equation*}

the algorithm we discovered in Section 2.3 is given by

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , M-1 \\ ~~~ {\bf for~} j := 0, \ldots , N-1 \\ ~~~ ~~~ {\rm Load~} C_{i,j} \mbox{ into registers} \\ ~~~ ~~~ C_{i,j} := A_{i} B_{j} + C_{i,j} {\rm ~~with~micro-kernel} \\ ~~~ ~~~ {\rm Store~} C_{i,j} \mbox{ to memory} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

We analyzed that its cost is given by

\begin{equation*} 2 m n k \gamma_R+ \left[ 2 m n + m n k ( \frac{1}{n_R} + \frac{1}{m_R} ) \right] \beta_{R \leftrightarrow M}. \end{equation*}

The ratio between flops and memory operations between the registers and memory is then

\begin{equation*} \frac{2 m n k} {2 m n + m n k ( \frac{1}{n_R} + \frac{1}{m_R} )} \end{equation*}

If $k$ is large, then $2 m n$ (the cost of loading and storing the $m_R \times n_R$ submatrices of $C$) can be ignored in the denominator, yielding, approximately,

\begin{equation*} \frac{2 m n k} {m n k ( \frac{1}{n_R} + \frac{1}{m_R} )} = \frac{2}{\frac{1}{n_R}+ \frac{1}{m_R} } = \frac{2}{\frac{m_R}{m_R n_R}+ \frac{n_R}{m_R n_R} } = \frac{2 m_R n_R}{m_R + n_R}. \end{equation*}

This is the ratio of floating point operations to memory operations that we want to be high.

If $m_R = n_R = 4$ then this ratio is $4 \text{.}$ For every memory operation (read) of an element of $A$ or $B \text{,}$ approximately $4$ floating point operations are performed with data that resides in registers.

###### Homework2.4.4.1.

Modify Assignments/Week2/C/Gemm_4x4Kernel.c to implement the case where $m_R = 8$ and $n_R = 4 \text{,}$ storing the result in Gemm_8x4Kernel.c. You can test the result by executing

make JI_8x4Kernel


in that directory. View the resulting performance by appropriately modifying Assignments/Week2/C/data/Plot_optimize_MRxNR.mlx.

Hint

A vector register can only hold four doubles... Each column now has four doubles. How many vector registers do you need to store a column?

Solution
###### Homework2.4.4.2.

Consider the implementation in Homework 2.4.4.1 with $m_R \times n_R = 8 \times 4 \text{.}$

• How many vector registers are needed?

• Ignoring the cost of loading the registers with the $8 \times 4$ submatrix of $C \text{,}$ what is the ratio of flops to loads?

Solution

Number of registers:

\begin{equation*} 8 + 3 = 11 \end{equation*}

Ratio:

\begin{equation*} 64 {\rm ~flops}/12 {\rm ~loads} \approx 5.33 {\rm ~flops}/{\rm load}. \end{equation*}
###### Homework2.4.4.3.

We have considered $m_R \times n_R = 4 \times 4$ and $m_R \times n_R = 8 \times 4 \text{,}$ where elements of $A$ are loaded without duplication into vector registers (and hence $m_R$ must be a multiple of $4$), and elements of $B$ are loaded/broadcast. Extending this approach to loading $A$ and $B \text{,}$ complete the entries in the following table:

\begin{equation*} \begin{array}{| c | c | c |} \hline \amp {\rm \#~of~vector~regs} \amp {\rm flops/load} \\ \hline 4 \times 1 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 4 \times 2 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 4 \times 4 \amp 6 \amp 32 / 8 = 4 \\ \hline 4 \times 8 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~~ \\ \hline 4 \times 12 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 4 \times 14 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 8 \times 1 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 8 \times 2 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 8 \times 4 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 8 \times 6 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 8 \times 8 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 12 \times 1 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 12 \times 2 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 12 \times 4 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline 12 \times 6 \amp ~~~~ \amp ~~~ / ~~~ = ~~~~~~ \\ \hline \end{array} \end{equation*}

Solution
\begin{equation*} \begin{array}{| c | c | r |} \hline \amp {\rm \#~of~vector~regs} \amp {\rm flops/load} \\ \hline 4 \times 1 \amp 3 \amp 8 / ~5 = 1.60 \\ \hline 4 \times 2 \amp 4 \amp 16 / ~6 = 2.66 \\ \hline 4 \times 4 \amp 6 \amp 32 /~ 8 = 4.00 \\ \hline 4 \times 8 \amp 10 \amp 64/12 = 5.44 \\ \hline 4 \times 12 \amp 14 \amp 96 / 16 = 6.00 \\ \hline 4 \times 14 \amp 16 \amp 112 / 18 = 6.22 \\ \hline 8 \times 1 \amp 5 \amp 16 / ~9 = 1.78 \\ \hline 8 \times 2 \amp 7 \amp 32 / 10 = 3.20 \\ \hline 8 \times 4 \amp 11 \amp 64 / 12 = 5.33 \\ \hline 8 \times 6 \amp 15 \amp 96 /14 = 6.86 \\ \hline 8 \times 8 \amp 19 \amp 128 /16 = 8.00 \\ \hline 12 \times 1 \amp 7 \amp 24 / 13 = 1.85 \\ \hline 12 \times 2 \amp 10 \amp 48 /14 = 3.43 \\ \hline 12 \times 4 \amp 16 \amp 96 /16 = 6.00 \\ \hline 12 \times 6 \amp 22 \amp 144 /18 = 8.00 \\ \hline \end{array} \end{equation*}
###### Remark2.4.6.

Going forward, keep in mind that the cores of the architectures we target only have 16 vector registers that can store four doubles each.

###### Homework2.4.4.4.

At this point, you have already implemented the following kernels: Gemm_4x4Kernel.c and Gemm_8x4Kernel.c. Implement as many of the more promising of the kernels you analyzed in the last homework as you like. Your implementations can be executed by typing

make JI_?x?Kernel


where the ?'s are replaced with the obvious choices of $m_R$ and $n_R \text{.}$ The resulting performance can again be viewed with Live Script in Assignments/Week2/C/data/Plot_optimize_MRxNR.mlx.

Solution

On Robert's laptop:

Performance of implementations for $m_R = 4$ and various choices for $n_R \text{:}$

Performance of implementations for $m_R = 8$ and various choices for $n_R \text{:}$

Performance of implementations for $m_R = 12$ and various choices for $n_R \text{:}$