We start with a few exercises that let you experience the BLIS typed API and the traditional BLAS interface.

###### Homework1.6.1.1. (Optional).

In Unit 1.3.3 and Unit 1.3.5, you wrote implementations of the routine

MyGemv( int m, int n, double *A, int ldA, double *x, int incx, double *y, int incy )


The BLIS typed API (Unit 1.5.2) includes the routine

bli_dgemv( trans_t transa, conj_t conjx, dim_t  m,  dim_t  n,
double* alpha, double* a, inc_t rsa, inc_t csa,
double* x, inc_t incx,
double* beta,  double* y, inc_t incy );


which supports the operations

\begin{equation*} \begin{array}{rcl} y \amp:=\amp \alpha A x + \beta y \\ y \amp:=\amp \alpha A^T x + \beta y \end{array} \end{equation*}

and other variations on matrix-vector multiplication.

To get experience with the BLIS interface, copy Assignments/Week1/C/Gemm_J_Gemv.c into Gemm_J_bli_dgemv.c and replace the call to MyGemv with a call to bli_dgemv instead. You will also want to include the header file blis.h,

#include "blis.h"


replace the call to MyGemv with a call to bli_dgemv and, if you wish, comment out or delete the now unused prototype for MyGemv. Test your implementation with

make J_bli_dgemv


You can then view the performance with Assignments/Week1/C/data/Plot_Outer_J.mlx.

Hint

Replace

MyGemv( m, k, A, ldA, &beta( 0, j ), 1, &gamma( 0,j ), 1 );


with

double d_one=1.0;
bli_dgemv( BLIS_NO_TRANSPOSE, BLIS_NO_CONJUGATE, m, k,
& d_one, A, 1, ldA, %amp;beta( 0,j ), 1, &gamma( 0,j ), 1 );


Note: the BLIS_NO_TRANSPOSE indicates we want to compute $y := \alpha A x + \beta y$ rather than $y := \alpha A^T x + \beta y \text{.}$ The BLIS_NO_CONJUGATE parameter indicates that the elements of $x$ are not conjugated (something one may wish to do with complex valued vectors).

The ldA parameter in MyGemv now becomes two paramters: 1, ldA. The first indicates the stride in memory between elements in the same column by consecutive rows (which we refer to as the row stride) and the second refers to the stride in memory between elements in the same row and consecutive columns (which we refer to as the column stride).

Solution
###### Homework1.6.1.2. (Optional).

The traditional BLAS interface (Unit 1.5.1) includes the Fortran call

DGEMV( TRANSA, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )


which also supports the operations

\begin{equation*} \begin{array}{rcl} y \amp:=\amp \alpha A x + \beta y \\ y \amp:=\amp \alpha A^T x + \beta y \end{array} \end{equation*}

and other variations on matrix-vector multiplication.

To get experience with the BLAS interface, copy Assignments/Week1/C/Gemm_J_Gemv.c (or Gemm_J_bli_dgemv) into Gemm_J_dgemv and replace the call to MyGemv with a call to dgemv. Some hints:

• In calling a Fortran routine from C, you will want to call the routine using lower case letters, and adding an underscore:

dgemv_


• You will need to place a "prototype" for dgemv_ near the beginning of Gemm_J_dgemv:

dgemv_( char *, int *, int *, double *, double *, int *, double *, int *, double *, double *, int * );


• Fortran passes parameters by address. So,

MyGemv( m, k, A, ldA, &beta( 0, j ), 1, &gamma( 0,j ), 1 );


becomes

{
int i_one=1;
double d_one=1.0;
dgemv_( 'No transpose', &m, &d_one, &k, A, &ldA, &beta( 0, j ), &i_one, &d_one, &gamma( 0,j ), &i_one );
}


Test your implementation with

make J_dgemv


You can then view the performance with Assignments/Week1/C/data/Plot_Outer_J.mlx.

On Robert's laptop, the last two exercises yield

Notice

• Linking to an optmized implementation (provided by BLIS) helps.

• The calls to bli_dgemv and dgemv are wrappers to the same implementations of matrix-vector multiplication within BLIS. The difference in performance that is observed should be considered "noise."

• It pays to link to a library that someone optimized.

###### Homework1.6.1.3. (Optional).

In Unit 1.4.1, You wrote the routine

MyGer( int m, int n, double *x, int incx, double *y, int incy, double *A, int ldA )


that compues a rank-1 update. The BLIS typed API (Unit 1.5.2) includes the routine

bli_dger( conj_t conjx, conj_t conjy, dim_t  m,  dim_t  n,
double* alpha, double* x, inc_t incx,  double* y, inc_t incy,
double* A, inc_t rsa,  inc_t csa );


which supports the operation

\begin{equation*} \begin{array}{rcl} A \amp:=\amp \alpha x y^T + A \end{array} \end{equation*}

and other variations on rank-1 update.

To get experience with the BLIS interface, copy Assignments/Week1/C/Gemm_P_Ger.c into Gemm_P_bli_dger and replace the call to MyGer with a call to bli_dger. Test your implementation with

make P_bli_dger


You can then view the performance with Assignments/Week1/C/data/Plot_Outer_P.mlx.

Hint

Replace

MyGer( m, n, &alpha( 0, p ), 1, &beta( p,0 ), ldB, C, ldC );


with

double d_one=1.0;
bli_dger( BLIS_NO_CONJUGATE, BLIS_NO_CONJUGATE, m, n,
&d_one, &alpha( 0, p ), 1, &beta( p,0 ), ldB, C, 1, ldC );


The BLIS_NO_CONJUGATE parameters indicate that the elements of $x$ and $y$ are not conjugated (something one may wish to do with complex valued vectors).

The ldC parameter in MyGer now becomes two paramters: 1, ldC. The first indicates the stride in memory between elements in the same column by consecutive rows (which we refer to as the row stride) and the second refers to the stride in memory between elements in the same row and consecutive columns (which we refer to as the column stride).

Solution
###### Homework1.6.1.4. (Optional).

The traditional BLAS interface (Unit 1.5.1) includes the Fortran call

DGER( M, N, ALPHA, X, INCX, Y, INCY, A, LDA );


which also supports the operation

\begin{equation*} \begin{array}{rcl} A \amp:=\amp \alpha x y^T + A. \end{array} \end{equation*}

To get experience with the BLAS interface, copy Assignments/Week1/C/Gemm_P_Ger.c (or Gemm_P_bli_dger) into Gemm_P_dger and replace the call to MyGer with a call to dger. Test your implementation with

make P_dger


You can then view the performance with Assignments/Week1/C/data/???.

Hint

Replace

Ger( m, n, &alpha( 0,p ), 1, &beta( p,0 ), ldB, C, ldC );


with

double d_one=1.0;
int i_one=1;
dger_( m, n,  &d_one, &alpha( 0,p ), &i_one, &beta( p,0 ), \amp;ldB, C, \amp;ldC );


The _ (underscore) allows C to call the Fortran routine. (This is how it works for most compilers, but not all.) The "No transpose" indicates we want to compute $y := \alpha A x + \beta y$ rather than $y := \alpha A^T x + \beta y \text{.}$ Fortran passes parameters by address, which is why d_one and i_one are passed as in this call.

Solution

On Robert's laptop, the last two exercises yield

Notice

• Linking to an optmized implementation (provided by BLIS) does not seem to help (relative to PJI).

• The calls to bli_dger and dger are wrappers to the same implementations of matrix-vector multiplication within BLIS. The difference in performance that is observed should be considered "noise."

• It doesn't always pay to link to a library that someone optimized.

Next, we turn to computing $C := A B + C$ via a series of row-times-matrix multiplications.

Recall that the BLAS include the routine dgemv that computes

\begin{equation*} y := \alpha A x + \beta y \quad \mbox{or} \quad y := \alpha A^T x + \beta y. \end{equation*}

What we want is a routine that computes

\begin{equation*} y^T := x^T A + y^T. \end{equation*}

What we remember from linear algebra is that if $A = B$ then $A^T = B^T \text{,}$ that $( A + B)^T = A^T + B^T \text{,}$ and that $(A B)^T = B^T A^T \text{.}$ Thus, starting with the the equality

\begin{equation*} y^T = x^T A + y^T, \end{equation*}

and transposing both sides, we get that

\begin{equation*} (y^T)^T = ( x^T A + y^T)^T \end{equation*}

which is equivalent to

\begin{equation*} y = ( x^T A )^T + (y^T)^T \end{equation*}

and finally

\begin{equation*} y = A^T x + y. \end{equation*}

So, updating

\begin{equation*} y^T := x^T A + y^T \end{equation*}

is equivalent to updating

\begin{equation*} y := A^T x + y. \end{equation*}

It this all seems unfamiliar, you may want to look at Linear Algebra: Foundations to Frontiers

Now, if

• $m \times n$ matrix $A$ is stored in array A with its leading dimension stored in variable ldA,

• $m$ is stored in variable m and $n$ is stored in variable n,

• vector $x$ is stored in array x with its stride stored in variable incx,

• vector $y$ is stored in array y with its stride stored in variable incy, and

• $\alpha$ and $\beta$ are stored in variables alpha and beta, respectively,

then $y := \alpha A^T x + \beta y$ translates, from C, into the call

dgemv_( "Transpose", &m, &n, &alpha, A, &ldA, x, &incx, &beta, y, &incy );


###### Homework1.6.1.5.

In directory Assignments/Week1/C complete the code in file Gemm_I_dgemv.c that casts matrix-matrix multiplication in terms of the dgemv BLAS routine, but compute the result by rows. Compile and execute it with

make I_dgemv


View the resulting performance by making the necessary changes to the Live Script in Assignments/Week1/C/Plot_All_Outer.mlx. (Alternatively, use the script in Assignments/Week1/C/data/Plot_All_Outer_m.mlx.)

The BLIS native call that is similar to the BLAS dgemv routine in this setting translates to

bli_dgemv( BLIS_TRANSPOSE, BLIS_NO_CONJUGATE, m, n,
&alpha, A, ldA, 1, x, incx, &beta, y, incy );


Because of the two parameters after A that capture the stride between elements in a column (the row stride) and elements in rows (the column stride), one can alternatively swap these parameters:

bli_dgemv( BLIS_NO_TRANSPOSE, BLIS_NO_CONJUGATE, m, n,
&alpha, A, ldA, 1, x, incx, &beta, y, incy );


On Robert's laptop, the last two exercises yield

Notice

• Casting matrix-matrix multiplication in terms of matrix-vector multiplications attains the best performance. The reason is that as columns of $C$ are computed, they can stay in cache, reducing the number of times elements of $C$ have to be read and written from main memory.

• Accessing $C$ and $A$ by rows really gets in the way of performance.