Unit2.4.2Implementing the micro-kernel with vector instructions

Remark2.4.1.

This unit is one of the most important ones in the course. Take your time to understand it. The syntax of the intrinsic library is less important than the concepts: as new architectures become popular, new vector instructions will be introduced.

In this section, we are going to AVX2 vector instruction set to illustrate the ideas. Vector intrinsic functions support the use of these instructions from within the C programming language. You discover how the intrinsic operations are used by incorporating them into an implemention of the micro-kernel for the case where $m_R \times n_R = 4 \times 4 \text{.}$ Thus, for the remainder of this unit, $C$ is $m_R \times n_R = 4 \times 4 \text{,}$ $A$ is $m_R \times k = 4 \times k \text{,}$ and $B$ is $k \times n_R = k \times 4 \text{.}$

The architectures we target have 256 bit vector registers, which mean they can store four double precision floating point numbers. Vector operations with vector registers hence operate with vectors of size four.

Let's recap: Partitioning

\begin{equation*} C = \left( \begin{array}{c | c | c} C_{0,0} \amp \cdots \amp C_{0,N-1} \\ \hline \vdots \amp \amp \vdots \\ \hline C_{M-1,0} \amp \cdots \amp C_{M-1,N-1} \end{array} \right), \end{equation*}
\begin{equation*} A = \left( \begin{array}{c | c | c} A_{0} \\ \hline \vdots \\ \hline A_{M-1} \end{array} \right), \quad B = \left( \begin{array}{c | c | c} B_{0} \amp \cdots \amp B_{N-1} \\ \end{array} \right), \end{equation*}

the algorithm we discovered in Section 2.3 is given by

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , M-1 \\ ~~~ {\bf for~} j := 0, \ldots , N-1 \\ ~~~ ~~~ {\rm Load~} C_{i,j} \mbox{ into registers} \\ ~~~ ~~~ C_{i,j} := A_{i} B_{j} + C_{i,j} {\rm ~~with~micro-kernel} \\ ~~~ ~~~ {\rm Store~} C_{i,j} \mbox{ to memory} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

and translates into the code in Figure 2.4.2.

Let's drop the subscripts, and focus on computing $C +:= A B$ when $C$ is $4 \times 4 \text{,}$ with the micro-kernel. This translates to the computation

\begin{equation*} \begin{array}{l} \left( \begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \gamma_{0,2} \amp \gamma_{0,3} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \gamma_{1,2} \amp \gamma_{1,3} \\ \gamma_{2,0} \amp \gamma_{2,1} \amp \gamma_{2,2} \amp \gamma_{2,3} \\ \gamma_{3,0} \amp \gamma_{3,1} \amp \gamma_{3,2} \amp \gamma_{3,3} \end{array}\right) \\ ~~~ +:= \left( \begin{array}{c | c | c } \alpha_{0,0} \amp \color{red} \alpha_{0,1} \amp \cdots \\ \alpha_{1,0} \amp \color{red} \alpha_{1,1} \amp \cdots \\ \alpha_{2,0} \amp \color{red} \alpha_{2,1} \amp \cdots \\ \alpha_{3,0} \amp \color{red} \alpha_{3,1} \amp \cdots \end{array}\right) \left( \begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \beta_{0,2} \amp \beta_{0,3} \\ \beta_{1,0} \amp \color{blue} \beta_{1,1} \amp \beta_{1,2} \amp \beta_{1,3} \\ \vdots \amp \vdots \amp \vdots \amp \vdots \end{array}\right) \\ ~~~ = \left(\begin{array}{c | c | c | c } \alpha_{0,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{0,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{0,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{0,1}} \beta_{1,2} \!\!+\!\! \cdots \\ \alpha_{1,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{1,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{1,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{1,1}} \beta_{1,3} \!\!+\!\! \cdots \\ \alpha_{2,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{2,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{2,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{2,1}} \beta_{1,3} \!\!+\!\! \cdots \\ \alpha_{3,0} \beta_{0,0} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,0} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,1} \!\!+\!\! {\color{red} \alpha_{3,1}} {\color{blue} \beta_{1,1}} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,2} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,2} \!\!+\!\! \cdots \amp \alpha_{3,0} \beta_{0,3} \!\!+\!\! {\color{red} \alpha_{3,1}} \beta_{1,3} \!\!+\!\! \cdots \end{array} \right) \\ ~~~ = \left( \begin{array}{c} \alpha_{0,0} \\ \alpha_{1,0} \\ \alpha_{2,0} \\ \alpha_{3,0} \end{array}\right) \left( \begin{array}{c c c c} \beta_{0,0} \amp \beta_{0,1} \amp \beta_{0,2} \amp \beta_{0,3} \\ \end{array}\right) + \left( \begin{array}{c c c c} {\color{red} \alpha_{0,1}} \\ {\color{red} \alpha_{1,1}} \\ {\color{red} \alpha_{2,1}} \\ {\color{red} \alpha_{3,1}} \end{array}\right) \left( \begin{array}{c c c c} \beta_{1,0} \amp {\color{blue} \beta_{1,1}} \amp \beta_{1,2} \amp \beta_{1,3} \\ \end{array}\right) + \cdots \end{array} \end{equation*}

Thus, updating $4 \times 4$ matrix $C$ can be implemented as a loop around rank-1 updates:

\begin{equation*} \begin{array}{l} {\bf for~} p = 0, \ldots, k-1\\ ~~~ \left( \begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \gamma_{0,2} \amp \gamma_{0,3} \\ \gamma_{1,0} \amp \gamma_{1,1} \amp \gamma_{1,2} \amp \gamma_{1,3} \\ \gamma_{2,0} \amp \gamma_{2,1} \amp \gamma_{2,2} \amp \gamma_{2,3} \\ \gamma_{3,0} \amp \gamma_{3,1} \amp \gamma_{3,2} \amp \gamma_{3,3} \end{array}\right) +:= \left( \begin{array}{c c c c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array}\right) \left( \begin{array}{c c c c} \beta_{p,0} \amp \beta_{p,1} \amp \beta_{p,2} \amp \beta_{p,3} \\ \end{array}\right) \\ {\bf end} \end{array} \end{equation*}

or, equivalently to emphasize computations with vectors,

\begin{equation*} \begin{array}{l} {\bf for~} p = 0, \ldots, k-1 \\ ~~~ \left( \begin{array}{c | c | c | c} \begin{array}{c c c c} \gamma_{0,0} +:= \alpha_{0,p} \times \beta_{p,0} \\ \gamma_{1,0} +:= \alpha_{1,p} \times \beta_{p,0} \\ \gamma_{2,0} +:= \alpha_{2,p} \times \beta_{p,0} \\ \gamma_{3,0} +:= \alpha_{3,p} \times \beta_{p,0} \end{array} \amp \begin{array}{c c c c} \gamma_{0,1} +:= \alpha_{0,p} \times \beta_{p,1} \\ \gamma_{1,1} +:= \alpha_{1,p} \times \beta_{p,1} \\ \gamma_{2,1} +:= \alpha_{2,p} \times \beta_{p,1} \\ \gamma_{3,1} +:= \alpha_{3,p} \times \beta_{p,1} \end{array} \amp \begin{array}{c c c c} \gamma_{0,2} +:= \alpha_{0,p} \times \beta_{p,2} \\ \gamma_{1,2} +:= \alpha_{1,p} \times \beta_{p,2} \\ \gamma_{2,2} +:= \alpha_{2,p} \times \beta_{p,2} \\ \gamma_{3,2} +:= \alpha_{3,p} \times \beta_{p,2} \end{array} \amp \begin{array}{c c c c} \gamma_{0,3} +:= \alpha_{0,p} \times \beta_{p,3} \\ \gamma_{1,3} +:= \alpha_{1,p} \times \beta_{p,3} \\ \gamma_{2,3} +:= \alpha_{2,p} \times \beta_{p,3} \\ \gamma_{3,3} +:= \alpha_{3,p} \times \beta_{p,3} \end{array} \end{array}\right) \\ {\bf end} \end{array} \end{equation*}

This, once again, shows how matrix-matrix multiplication can be cast in terms of a sequence of rank-1 updates, and that a rank-1 update can be implemented in terms of axpy operations with columns of $C \text{.}$ This micro-kernel translates into the code that employs vector instructions, given in Figure 2.4.3. We hope that the instrinsic function calls are relatively self-explanatory. However, you may want to consult Intel's Intrinsics Reference Guide. When doing so, you may want to search on the name of the routine, without checking the AVX2 box on the left.

An illustration of how Figure 2.4.2 and Figure 2.4.3 compute is found in Figure 2.4.4. Figure 2.4.4. Illustration of how the routines in Figure 2.4.2 and Figure 2.4.3 indexes into the matrices.