## Unit2.4.1Vector registers and instructions

While the last unit introduced the notion of registers, modern CPUs accelerate computation by computing with small vectors of of numbers (double) simultaneously.

As the reader should have noticed by now, in matrix-matrix multiplication for every floating point multiplication a corresponding floating point addition is encountered to accumulate the result:

\begin{equation*} \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \end{equation*}

For this reason, such floating point computations are usually cast in terms of fused multiply add ( FMA) operations, performed by a floating point unit (FPU) of the core.

What is faster than computing one FMA at a time? Computing multiple FMAs at a time! For this reason, modern cores compute with small vectors of data, performing the same FMA on corresponding elements in those vectors, which is referred to as "SIMD" computation: Single-Instruction, Multiple Data. This exploits instruction-level parallelism.

Let's revisit the computation

\begin{equation*} \begin{array}{l} \left( \begin{array}{c c c c} \gamma_{0,0} \amp \gamma_{0,1} \amp \gamma_{0,2} \amp \gamma_{0,3} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \gamma_{1,2} \amp \gamma_{1,3} \\ \gamma_{2,0} \amp \gamma_{2,1} \amp \gamma_{2,2} \amp \gamma_{2,3} \\ \gamma_{3,0} \amp \gamma_{3,1} \amp \gamma_{3,2} \amp \gamma_{3,3} \end{array} \right) +:= \left( \begin{array}{c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right) \left( \begin{array}{c c c c} \beta_{p,0} \amp \beta_{p,1} \amp \beta_{p,2} \amp \beta_{p,3} \end{array} \right) \\ ~~~= \beta_{p,0} \left( \begin{array}{c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right) + \beta_{p,1} \left( \begin{array}{c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right) + \beta_{p,2} \left( \begin{array}{c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right) + \beta_{p,3} \left( \begin{array}{c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right) \end{array} \end{equation*}

that is at the core of the micro-kernel.

If a vector register has length four, then it can store four (double precision) numbers. Let's load one such vector register with a column of the submatrix of $C \text{,}$ a second vector register with the vector from $A \text{,}$ and a third with an element of $B$ that has been duplicated:

\begin{equation*} \begin{array}{|c|}\hline \gamma_{0,0} \\ \hline \gamma_{1,0} \\ \hline \gamma_{2,0} \\ \hline \gamma_{3,0} \\ \hline \end{array} ~~~~~ \phantom{ \begin{array}{c} +:= \\ +:= \\ +:= \\ +:= \end{array} } ~~~~~ \begin{array}{|c|}\hline \alpha_{0,p} \\ \hline \alpha_{1,p} \\ \hline \alpha_{2,p} \\ \hline \alpha_{3,p} \\ \hline \end{array} ~~~~~ \phantom{ \begin{array}{c} \times \\ \times \\ \times \\ \times \end{array} } ~~~~~ \begin{array}{|c|}\hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \end{array} \end{equation*}

A vector instruction that simultaneously performs FMAs with each tuple $( \gamma_{i,0} , \alpha_{i,p}, \beta_{p,0} )$ can then be performed:

\begin{equation*} \begin{array}{|c|}\hline \gamma_{0,0} \\ \hline \gamma_{1,0} \\ \hline \gamma_{2,0} \\ \hline \gamma_{3,0} \\ \hline \end{array} ~~~~~ \begin{array}{c} +:= \\ +:= \\ +:= \\ +:= \end{array} ~~~~~ \begin{array}{|c|}\hline \alpha_{0,p} \\ \hline \alpha_{1,p} \\ \hline \alpha_{2,p} \\ \hline \alpha_{3,p} \\ \hline \end{array} ~~~~~ \begin{array}{c} \times \\ \times \\ \times \\ \times \end{array} ~~~~~ \begin{array}{|c|}\hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \beta_{p,0} \\ \hline \end{array} \end{equation*}

You may recognize that this setup is ideal for performing an axpy operation with a small vector (of size 4 in this example).