ALAFF BLAS and BLIS

Subsection 12.2.5 BLAS and BLIS

To facilitate portable high performance, scientific and machine learning software is often written in terms of the Basic Linear Algebra Subprograms (BLAS) interface [25] [15] [14]. This interface supports widely-used basic linear algebra functionality. The idea is that if optimized implementations of this interface are available for different target computer architectures, then a degree of portable high performance can be achieved by software that is written in terms of this interface. The interface was designed for the Fortran programming language starting in the 1970s. In our discussions, we will also show how to interface to it from the C programming language. As we discuss the BLAS, it may be useful to keep the "Quick Reference Guide" [9] to this interface handy.

In addition, our own BLAS-like Library Instantiation Software (BLIS) [56][51] is not only a framework for the rapid instantiation of high-performing BLAS through the traditional BLAS interface, but also an extended interface that, we believe, is more natural for C programming. We refer to this interface as the BLIS typed interface [52], to distinguish it from the traditional BLAS interface and the object based BLIS Object API.

A number of open source and vendors high-performance implementations of the BLAS interface are availabe. For example,

Our BLAS-like Library Instantiation Software (BLIS) is a widely-used open source implementation of the BLAS for modern CPUs. It underlies AMD's Optimizing CPU Libraries (AOCL).
Arm's Arm Performance Libraries.
Cray's Cray Scientific and Math Libraries (CSML).
IBM's Engineering and Scientific Subroutine Library (ESSL).
Intel's Math Kernels Library (MKL).
NVIDIA's cuBLAS.

Subsubsection 12.2.5.1 Level-1 BLAS (vector-vector functionality)

The original BLAS [25] interface was proposed in the 1970s, when vector supercomputers like the Cray 1 and Cray 2 reigned supreme. On this class of computers, high performance was achieved if computation could be cast in terms of vector-vector operations with vectors that were stored contiguous in memory (with "stride one"). These are now called "level-1 BLAS" because they perform \(O( n ) \) computation on \(O( n ) \) data (when the vectors have size \(n \)). The "1" refers to \(O( n ) = O( n^1 ) \) computation.

We here list the vector-vector functionality that is of importance in this course, for the case where we compute with double precision real-valued floating point numbers.

DOT: Returns \(x^T y \text{,}\) the dot product of real-valued \(x \) and \(y \text{.}\)

Traditional BLAS interface:
```
FUNCTION DDOT( N, X, INCX, Y, INCY )
```

double ddot_( int* n, double* x, int* incx, double* y, int* incy );

BLIS typed interface for computing \(\rho := \alpha x^T y + \beta \rho \) (details):

void bli_ddotxv( conj_t conjx, conj_t conjy, dim_t n,
    double* alpha, double* x, inc_t incx, double* y, inc_t incy,
    double* beta,  double* rho );

AXPY: Updates \(y := \alpha x + y \text{,}\) the scaled vector addition of \(x \) and \(y \text{.}\)

Traditional BLAS interface:

SUBROUTINE DAXPY( N, ALPHA, X, INCX, Y, INCY )

void daxpy_( int* n, double* alpha, double* x, int* incx,
	      double* y, int* incy );

BLIS typed interface (details):

void bli_daxpyf( conj_t conjx, dim_t n,
    double* alpha, double* x, inc_t incx, double* y, inc_t incy );

IAMAX: Returns the index of the element of \(x \) with maximal absolute value (indexing starts at 1).
- Traditional BLAS interface:
```
FUNCTION IDAMAX( N, X, INCX )
```
- C:
```
int idamax_( int* n, double* x, int* incx );
```
- BLIS typed interface (details):
```
void bli_damaxv( dim_t n, double* x, inc_t incx, dim_t* index );
```

NRM2: Returns \(\| x \|_2 \text{,}\) the 2-norm of real-valued \(x \text{.}\)

Traditional BLAS interface:
```
FUNCTION DNRM2( N, X, INCX )
```

double dnrm2_( int* n, double* x, int* incx );

BLIS typed interface (details):

void bli_dnormfv( dim_t n, double* x, inc_t incx, double* norm );

Versions of these interfaces for single precision real, single precision complex, and double precision complex can be attained by replacing the appropriate D with S, C, or Z in the call to the Fortran BLAS interface, or d with s, c, or z in the C and BLIS typed interfaces.

Subsubsection 12.2.5.2 Level-2 BLAS (matrix-vector functionality)

In addition to providing portable high performance, a second purpose of the BLAS is to improve readability of code. Many of the algorithms we have encountered are cast in terms of matrix-vector operations like matrix-vector multiplication and the rank-1 update. By writing the code in terms of routines that implement such operations, the code more closely resembles the algorithm that is encoded. The level-2 BLAS [15] interface provides matrix-vector functionality. The "level-2" captures that they perform \(O( n^2 ) \) computation (on \(O( n^2 ) \) data), when the matrix invoved is of size \(n \times n\text{.}\)

We here list the the matrix-vector operations that are of importance in this course, for the case where we compute with double precision real-valued floating point numbers.

GEMV: Updates \(y := \alpha {\rm op}(A) x + \beta y \text{,}\) where \({\rm op}( A ) \) indicates whether \(A\) is to be transposed.

Traditional BLAS interface:

SUBROUTINE DGEMV( TRANSA, M, N, ALPHA, A, LDA, X, INCX,
                                BETA, Y, INCY )

void dgemv_( char* transa, int* m, int* n,
               double* alpha, double* a, int* lda, double* x, int* incx,
               double* beta, double* y, int* incy );

BLIS typed interface for computing \(y := \alpha {\rm op}_A (A) {\rm op}_x(x) + \beta y \) (details):

void bli_dgemv( trans_t transa, conj_t conjx, dim_t m, dim_t n,
    double* alpha, double* A, inc_t rsa, inc_t csa,
    double* x, inc_t incx, double* beta, double* y, inc_t incy );

SYMV: Updates \(y := \alpha A x + \beta y \text{,}\) where \(A \) is symmetric and only the upper or lower triangular part is stored.

Traditional BLAS interface:

SUBROUTINE DSYMV( UPLO, N, ALPHA, A, LDA, X, INCX,
                           BETA, Y, INCY )

void dsymv_( char* uplo, int* n,
               double* alpha, double* a, int* lda, double* x, int* incx,
               double* beta, double* y, int* incy );

BLIS typed interface (details):

void bli_dhemv( uplo_t uploa, conj_t conja, conj_t conjx, dim_t n,
    double* alpha, double* A, inc_t rsa, inc_t csa,
    double* x, inc_t incx, double* beta, double* y, inc_t incy );

TRSV: Updates \(x := \alpha {\rm op}(A)^{-1} x \text{,}\) where \({\rm op}( A ) \) indicates whether \(A\) is to be transposed and \(A \) is either (unit) upper or lower triangular.

Traditional BLAS interface:

SUBROUTINE DTRSV( UPLO, TRANSA, DIAG, N, A, LDA, X, INCX )

void dtrsv_( char* uplo, char* transa, char* diag, int* n,
               double* a, int* lda, double* x, int* incx )

BLIS typed interface (details):

void bli_dtrsv( uplo_t uploa, trans_t transa, diag_t diag, dim_t n,
    double* alpha, double* A, inc_t rsa, inc_t csa,
    double* x, inc_t incx );

GER: Updates \(A := \alpha x y^T + A \text{:}\)

Traditional BLAS interface:

SUBROUTINE DGER( M, N, ALPHA, X, INCX, Y, INCY, A, LDA  )

void dger_( int* m, int* n, double* alpha, double* x, int* incx, 
               double* y, int* incy, double* a, int* lda );

BLIS typed interface (details):

void bli_dger( conj_t conjx, conj_t conjy, dim_t m, dim_t n,
    double* alpha, double* x, inc_t incx, double* y, inc_t incy,
    double* A, inc_t rsa, inc_t csa );

SYR: Updates \(A := \alpha x x^T + A \text{,}\) where \(A \) is symmetric and stored in only the upper or lower triangular part of array A:

Traditional BLAS interface:

SUBROUTINE DSYR( UPLO, N, ALPHA, X, INCX, A, LDA  )

void dsyr_( char* uplo, int* n, double* alpha, double* x, int* incx,
    double* a, int* lda );

BLIS typed interface (details):

void bli_dher( uplo_t uploa, conj_t conjx, dim_t n,
    double* alpha, double* x, inc_t incx, double* A, inc_t rsa, inc_t csa );

SYR2: Updates \(A := \alpha ( x y^T + y x^T) + A \) , where \(A \) is symmetric and stored in only the upper or lower triangular part of array A:

Traditional BLAS interface:

SUBROUTINE DSYR2( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA  )

void dsyr2_( char* uplo, int* n, double* alpha, double* x, int* incx, 
               double* y, int* incy,  double* a, int* lda );

BLIS typed interface (details):

void bli_dher2( uplo_t uploa, conj_t conjx, conj_t conjy, dim_t n,
    double* alpha, double* x, inc_t incx, double* y, inc_t incy,
    double* A, inc_t rsa, inc_t csa );

Subsubsection 12.2.5.3 Level-3 BLAS (matrix-matrix functionality)

To attain high performance, computation has to be cast in terms of operations that reuse data many times (for many floating point operations). Matrix-matrix operations that are special cases of matrix-matrix multiplication fall into this category. The strategy is to cast higher level linear algebra functionality can be implemented so that most computation is in terms of matrix-matrix operations by calling the level-3 BLAS routines [14]. The "level-3" captures that these perform \(O( n^3 ) \) computation (on \(O( n^2 ) \) data), when the matrices involved are of size \(n \times n\text{.}\)

We here list the the matrix-matrix operations that are of importance in this course, for the case where we compute with double precision real-valued floating point numbers.

GEMM: Updates \(C := \alpha {\rm op}_A(A) {\rm op}_B(B) + \beta C \text{,}\) where \({\rm op}_A( A ) \) and \({\rm op}_A( A ) \) indicate whether \(A\) and/or \(B \) are to be transposed.

Traditional BLAS interface:

SUBROUTINE DGEMM( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB,
                                BETA, C, LDC )

void dgemm_( char* transa, char* transb, int* m, int* n, int* k,
               double* alpha, double* a, int* lda, double* b, int* ldb,
               double* beta, double* c, int* ldc );

BLIS typed interface for computing \(C := \alpha {\rm op}_A (A) {\rm op}_B(B) + \beta C \) (details):

void bli_dgemm( trans_t transa, trans_t transb,
    dim_t m, dim_t n, dim_t k,
    double* alpha, double* A, inc_t rsa, inc_t csa,
    double* B, inc_t rsB, inc_t csb, 
    double* beta, double* C, inc_t rsc, inc_t csc );

SYMM: Updates \(C := \alpha A B + \beta C \) or \(C := \alpha B A + \beta C \text{,}\) where \(A \) is symmetric and only the upper or lower triangular part is stored.

Traditional BLAS interface:

SUBROUTINE DSYMM( SIDE, UPLO, M, N, ALPHA, A, LDA, B, LDB,
                           BETA, C, LDC )

void dsymm_( char* uplo, char* uplo, int* m, int* n,
               double* alpha, double* a, int* lda, double* b, int* ldb,
               double* beta, double* c, int* ldc );

BLIS typed interface (details):

void bli_dhemm( side_t sidea, uplo_t uploa, conj_t conja, trans_t transb,
    dim_t m, dim_t n, double* alpha, double* A, inc_t rsa, inc_t csa,
    double* B, inc_t rsb, inc_t csb, 
    double* beta, double* C, inc_t  rsc, inc_t csc );

TRSM: Updates \(B := \alpha {\rm op}(A)^{-1} B \) or \(B := \alpha B {\rm op}(A)^{-1} \text{,}\) where \({\rm op}( A ) \) indicates whether \(A\) is to be transposed and \(A \) is either (unit) upper or lower triangular.

Traditional BLAS interface:

SUBROUTINE DTRSM( SIDE, UPLO, TRANSA, DIAG, M, N,
		      ALPHA, A, LDA, B, LDB, C, LDC )

void dtrsm_( char* side, char* uplo, char* transa, char* diag,
    int* m, int*, n, double *alpha, double* a, int* lda,
    double* b, int* ldb )

BLIS typed interface (details):

void bli_dtrsm( side_t sidea, uplo_t uploa, trans_t transa, diag_t diag,
    dim_t m, dim_t n, double* alpha, double* A, inc_t rsa, inc_t csa,
    double* B, inc_t rsb, inc_t csb );

SYRK: Updates \(C := \alpha A A^T + \beta C \) or \(C := \alpha A^T A + \beta C \text{,}\) where \(C \) is symmetric and stored in only the upper or lower triangular part of array C:

Traditional BLAS interface:

SUBROUTINE DSYRK( UPLO, TRANS, N, K, ALPHA, A, LDA,
		      BETA, C, LDC )

void dsyrk_( char* uplo, char* trans, int* n, int* k,
    double* alpha, double* A, int* lda, double* beta, double* C, int* ldc );

BLIS typed interface (details):

void bli_dherk( uplo_t uploc, transt_t transa, dim_t n, dim_t k,
    double* alpha, double* A, inc_t rsa, inc_t csa 
    double* beta, double* C, inc_t rsc, inc_t csc );

SYR2K: Updates \(C := \alpha ( A B^T + B A^T) + \beta C\) or \(C := \alpha ( A^T B + B^T A) + \beta C \) , where \(C \) is symmetric and stored in only the upper or lower triangular part of array C:

Traditional BLAS interface:

SUBROUTINE DSYR2K( UPLO, TRANS, N, K, ALPHA, A, LDA, B, LDB,
		      BETA, C, LDC  )

void dsyr2k_( char* uplo, char *trans, int* n, int* k, double* alpha, double* A, int* lda,
    double* b, int* ldb, double* beta, double* c, int* ldc );

BLIS typed interface (details):

void bli_dher2k( uplo_t uploc, trans_t transab, dim_t n, dim_t k,
    double* alpha, double* A, inc_t rsa, inc_t csa,
    double* B, inc_t rsc, inc_t csc,
    double* beta,  double* C, inc_t rsc, inc_t csc ); 
    );

These operations are often of direct importance to scientific or machine learning applications. In the next next section, we show how higher-level linear algebra operations can be cast in terms of this basic functionality.