PfHP Thinking in Terms of Vector-Vector Operations

Section 1.3 Thinking in Terms of Vector-Vector Operations

Unit 1.3.1 The Basic Linear Algebra Subprograms (BLAS)

Linear algebra operations are fundamental to computational science. In the 1970s, when vector supercomputers reigned supreme, it was recognized that if applications and software libraries are written in terms of a standardized interface to routines that implement operations with vectors, and vendors of computers provide high-performance instantiations for that interface, then applications would attain portable high performance across different computer platforms. This observation yielded the original Basic Linear Algebra Subprograms (BLAS) interface [3] for Fortran 77, which are now referred to as the level-1 BLAS. The interface was expanded in the 1980s to encompass matrix-vector operations (level-2 BLAS)~\cite{BLAS2} and matrix-matrix operations (level-3 BLAS) [1].

An overview of the BLAS and how they are used to achieve portable high performance is given in the Encyclopedia of Parallel Computing [5]. This article is somewhat out of date. In a later enrichment we will point you to the BLAS-like Library Instantiation Software (BLIS) [6], which is now a widely used open source framework for rapidly instantiating the BLAS and similar functionality.

Expressing code in terms of the BLAS has another benefit: the call to the routine hides the loop that otherwise implements the vector-vector operation and clearly reveals the operation being performed, thus improving readability of the code.

Unit 1.3.2 Notation

In our discussions, we use capital letters for matrices (\(A, B, C, \ldots \)), lower case letters for vectors (\(a, b, c, \ldots \)), and lower case Greek letters for scalars (\(\alpha, \beta, \gamma, \ldots \)). Exceptions are integer scalars, for which we will use \(i, j, k, m, n,\) and \(p \text{.}\)

Vectors in our universe are column vectors or, equivalently, \(n \times 1 \) matrices if the vector has \(n \) components (size \(n \)). A row vector we view as a column vector that has been transposed. So, \(x \) is a column vector and \(x^T \) is a row vector.

In the subsequent discussion, we will want to expose the rows or columns of a matrix. If \(X \) is an \(m \times n\) matrix, then we expose its columns as

\begin{equation*} X = \left(\begin{array}{c | c | c | c} x_0 \amp x_1 \amp \cdots \amp x_{n-1} \end{array} \right) \end{equation*}

so that \(x_j \) equals the column with index \(j \text{.}\) We expose its rows as

\begin{equation*} X = \left(\begin{array}{c} \widetilde x_0^T \\ \widetilde x_1^T \\ \vdots \\ \widetilde x_{m-1}^T \end{array} \right) \end{equation*}

so that \(\widetilde x_i^T \) equals the row with index \(i \text{.}\) Here the \(~^T\) indicates it is a row (a column vector that has been transposed). The tilde is added for clarity since \(x_i^T \) would in this setting equal the column indexed with \(i \) that has been transposed, rather than the row indexed with \(i \text{.}\) When there isn't a cause for confusion, we will sometimes leave the \(\widetilde ~\) off. We use the lower case letter that corresponds to the upper case letter used to denote the matrix, as an added visual clue that \(x_j \) is a column of \(X \) and \(\widetilde x_i^T \) is a row of \(X \text{.}\)

We have already seen that the scalars that constitute the elements of a matrix or vector are denoted with the lower Greek letter that corresponds to the letter used for the matrix of vector:

\begin{equation*} X = \left( \begin{array}{c c c c} \chi_{0,0} \amp \chi_{0,1} \amp \cdots \amp \chi_{0,n-1} \\ \chi_{1,0} \amp \chi_{1,1} \amp \cdots \amp \chi_{1,n-1} \\ \vdots \amp \vdots \amp \amp \vdots \\ \chi_{m-1,0} \amp \chi_{m-1,1} \amp \cdots \amp \chi_{m-1,n-1} \end{array} \right) \quad {\rm and} \quad x = \left( \begin{array}{c c c c} \chi_0 \\ \chi_1 \\ \vdots \\ \chi_{m-1} \end{array} \right). \end{equation*}

If you look carefully, you will notice the difference between \(x \) and \(\chi \text{.}\) The latter is the lower case Greek letter "chi."

Remark 1.3.1

Since this course will discuss the computation \(C := A B + C \text{,}\) you will only need to remember the Greek letters \(\alpha \) (alpha), \(\beta \) (beta), and \(\gamma \) (gamma).

Unit 1.3.3 The dot product (inner product)

Given two vectors \(x \) and \(y \) of size \(n \)

\begin{equation*} x = \left( \begin{array}{c c c c} \chi_0 \\ \chi_1 \\ \vdots \\ \chi_{n-1} \end{array} \right) \quad {\rm and} \quad y = \left( \begin{array}{c c c c} \psi_0 \\ \psi_1 \\ \vdots \\ \psi_{n-1} \end{array} \right), \end{equation*}

their dot product is given by

\begin{equation*} x^T y = \sum_{i=0}^{n-1} \chi_i \psi_i. \end{equation*}

The notation \(x^T y \) comes from the fact that the dot product also equals the result of multiplying \(1 \times n\) matrix \(x^T \) times \(n \times 1 \) matrix \(y \text{.}\)

A routine. coded in C, that computes \(x^T y + \gamma \) where \(x \) and \(y \) are stored at location x with stride incx and location y with stride incy, respectively, and \(\gamma \) is stored at location gamma is given by

#define chi( i ) x[ (i)*incx ]   // map chi( i ) to array x 
#define psi( i ) y[ (i)*incy ]   // map psi( i ) to array y

void Dots( int n, double *x, int incx, double *y, int incy, double *gamma )
{
  for ( int i=0; i<n; i++ )
    *gamma += chi( i ) * psi( i );
}

in Assignments/Week1/C/Dots.c. Here stride refers to the number of items in memory between the stored components of the vector. For example, the stride when accessing a row of a matrix is lda when the matrix is stored in column-major order with leading dimension lda.

The BLAS include a function for computing the dot operation. Its calling sequence in Fortran, for double precision data, is

DDOT( N, X, INCX, Y, INCY )

where

(input) N is an integer that equals the size of the vectors.
(input) X is the address where \(x \) is stored.
(input) INCX is the stride in memory between entries of \(x \text{.}\)
(input) Y is the address where \(y \) is stored.
(input) INCYnnnn is the stride in memory between entries of \(y \text{.}\)

The function returns the result as a scalar of type double precision. If the datatype were single precision, complex double precision, or complex single precision, then the first D is replaced by S, Z, or C, respectively.

To call the same routine in a code written in C, it is important to keep in mind that Fortran passes data by address. The call

Dots( n, x, incx, y, incy, &gamma );

which, recall, adds the result of the dot product to the value in gamma, translates to

gamma += ddot_( &n, x, &incx, y, &incy );

When one of the strides equals one, as in

Dots( n, x, 1, y, incy, &gamma );

one has to declare an integer variable (e.g, i_one) with value one and pass the address of that variable:

int i_one=1;
gamma += ddot_( &n, x, &i_one, y, &incy );

We will see examples of this later in this section.

In this course, we use the BLIS implementation of the BLAS as our library. This library also has its own (native) BLAS-like interface that we refer to as the BLIS Typed API. (BLIS is actually a framework for the rapid instantiation of BLAS-like functionality. It comes with four different interfaces to that functionality: The classic Fortran BLAS interface, the CBLAS interface for the C language (which is an interface that is rarely used), the BLIS Typed API with is reminiscent of the BLAS interface, but with added functionality and flexibility, and the BLIS object API, which which a Users' Guide can be found at https://github.com/flame/blis/blob/master/docs/BLISObjectAPI.md.) A Users' Guide for this interface can be found at https://github.com/flame/blis/blob/master/docs/BLISTypedAPI.md. There, we find the routine bli_ddotxv that computes \(\gamma := \alpha x^T y + \beta \gamma \text{,}\) optionally conjugating the elements of the vectors. The call

Dots( n, x, incx, y, incy, &gamma );

translates to

double one=1.0;    
bli_ddotxv( BLIS_NO_CONJUGATE, BLIS_NO_CONJUGATE,
            n, &one, x, incx, y, incy, &one, &gamma );

The BLIS_NO_CONJUGATE is to indicate that the vectors are not to be conjugated. Those parameters are there for consistency with the complex versions of this routine (bli_zdotxv and bli_cdotxv).

Unit 1.3.4 The IJP and JIP orderings

Let us return once again to the IJP ordering of the loops that compute matrix-matrix multiplication:

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , m-1 \\ ~~~ {\bf for~} j := 0, \ldots , n-1 \\ ~~~ ~~~ {\bf for~} p := 0, \ldots , k-1 \\ ~~~ ~~~ ~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\ ~~~ ~~~ {\bf end} \\ ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

This pseudo-code translates into the routine coded in C given in Figure 1.1.1.

Unit 1.3.2\(C \)\(A \)\(B \)

\begin{equation*} C = \left(\begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \cdots \amp \gamma_{0,n-1} \\ \hline \gamma_{1,0} \amp \gamma_{1,1} \amp \cdots \amp \gamma_{1,n-1} \\ \hline \vdots \amp \amp \vdots \\ \hline \gamma_{m-1,0} \amp \gamma_{m-1,1} \amp \cdots \amp \gamma_{m-1,n-1} \end{array} \right), \quad A = \left(\begin{array}{c} \widetilde a_0^T \\ \hline \widetilde a_1^T \\ \hline \vdots \\ \hline \widetilde a_{m-1}^T \end{array}\right), \quad \mbox{and } B = \left(\begin{array}{c | c | c | c} b_0 \amp b_1 \amp \cdots \amp b_{n-1} \end{array} \right). \end{equation*}

We then notice that

\begin{equation*} \begin{array}{l} \left(\begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \cdots \amp \gamma_{0,n-1} \\ \hline \gamma_{1,0} \amp \gamma_{1,1} \amp \cdots \amp \gamma_{1,n-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline \gamma_{m-1,0} \amp \gamma_{m-1,1} \amp \cdots \amp \gamma_{m-1,n-1} \end{array} \right) \\ ~~~:= \left(\begin{array}{c} \widetilde a_0^T \\ \hline \widetilde a_1^T \\ \hline \vdots \\ \hline \widetilde a_{m-1}^T \end{array}\right)\left(\begin{array}{c | c | c | c} b_0 \amp b_1 \amp \cdots \amp b_{n-1} \end{array} \right) + \left(\begin{array}{c | c | c | c} \gamma_{0,0} \amp \gamma_{0,1} \amp \cdots \amp \gamma_{0,n-1} \\ \hline \gamma_{1,0} \amp \gamma_{1,1} \amp \cdots \amp \gamma_{1,n-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline \gamma_{m-1,0} \amp \gamma_{m-1,1} \amp \cdots \amp \gamma_{m-1,n-1} \end{array} \right)\\ ~~~= \left(\begin{array}{c | c | c | c} \widetilde a_0^T b_0 + \gamma_{0,0} \amp \widetilde a_0^T b_1 + \gamma_{0,1} \amp \cdots \amp \widetilde a_0^T b_{n-1} + \gamma_{0,n-1} \\ \hline \widetilde a_1^T b_0 + \gamma_{1,0} \amp \widetilde a_1^T b_1 + \gamma_{1,1} \amp \cdots \amp \widetilde a_1^T b_{n-1} + \gamma_{1,n-1} \\ \hline \vdots \amp \vdots \amp \amp \vdots \\ \hline \widetilde a_{m-1}^T b_0 + \gamma_{m-1,0} \amp \widetilde a_{m-1}^T b_1 + \gamma_{m-1,1} \amp \cdots \amp \widetilde a_{m-1}^T b_{n-1} + \gamma_{m-1,n-1} \end{array} \right). \end{array} \end{equation*}

If this makes your head spin, you will want to quickly go through Weeks 3-5 of our MOOC titled "Linear Algebra: Foundations to Fontiers,.'' which is an introductory undergraduate course. It captures that the outer two loops visit all of the elements in \(C \text{,}\) and the inner loop implements the dot product of the appropriate row of \(A \) with the appropriate column of \(B \text{:}\) \(\gamma_{i,j} := \widetilde a_i^T b_j + \gamma_{i,j} \text{,}\) as illustrated by

\begin{equation*} \begin{array}{l} {\bf for~} i := 0, \ldots , m-1 \\ ~~~ {\bf for~} j := 0, \ldots , n-1 \\[0.15in] ~~~ ~~~ \left. \begin{array}{l} {\bf for~} p := 0, \ldots , k-1 \\ ~~~ ~~~ \gamma_{i,j} := \alpha_{i,p} \beta_{p,j} + \gamma_{i,j} \\ {\bf end} \end{array} \right\} ~~~\gamma_{i,j} := \widetilde a_i^T b_j + \gamma_{i,j} \\[0.2in] ~~~ {\bf end} \\ {\bf end} \end{array} \end{equation*}

which is, again, the IJP ordering of the loops.

Homework 1.3.1

In directory Assignments/Week1/C copy file Assignments/Week1/C/Gemm_IJP.c into file Gemm_IJ_Dots.c. Replace the inner-most loop with an appropriate call to Dots, and compile and execute them with

make IJ_Dots

View the resulting performance by making the necessary changes to the Live Script in Assignments/Week1/C/data/Plot_Inner_P.mlx. (Alternatively, use the script in Assignments/Week1/C/data/Plot_Inner_P_m.mlx.

Section 1.3 Thinking in Terms of Vector-Vector Operations

Unit 1.3.1 The Basic Linear Algebra Subprograms (BLAS)

Unit 1.3.2 Notation

Remark 1.3.1

Unit 1.3.3 The dot product (inner product)

Unit 1.3.4 The IJP and JIP orderings

Homework 1.3.1

Homework 1.3.2

Homework 1.3.3

Homework 1.3.4

Unit 1.3.5 The axpy operations

Homework 1.3.5

Homework 1.3.6

Unit 1.3.6 The IPJ and PIJ orderings

Homework 1.3.7

Homework 1.3.8

Homework 1.3.9

Homework 1.3.10

Unit 1.3.7 The JPI and PJI orderings

Homework 1.3.11

Homework 1.3.12

Homework 1.3.13

Homework 1.3.14

Unit 1.3.8 Discussion

Homework 1.3.15

Homework 1.3.16

Homework 1.3.17

Remark 1.3.5