# BLIS Retreat 2016

## Contributed talks

### A Case Study for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization

**Sandra Catalan, Univ. Jaume I, Spain**

### BLAS for Deep Learning: tuple, mixed-precision, fixed-point, and binary GEMM

**Marat Dukhan, GATech**

Convolutional neural networks are the state-of-the-art method for computer vision tasks. The most computationally intensive parts of convolutional neural networks - fully connected layers, and convolutional layers, can be cast to GEMM primitives for a high-performance implementation. This talk would suggest non-standard modifications to GEMM functions and kernels, which are useful for implementation of deep learning primitives, and discuss their implementation in NNPACK library.

### PeachPy.io: a platform for crowdsourcing performance tuning

**Marat Dukhan, GATech**

We present PeachPy.io, an open-source cloud-based infrastructure for performance tuning. PeachPy.io lets developers optimize kernels, run them on a variety of hardware platforms, collect and visualize information from hardware event counters, compare against alternatives, and submit a pull request -- all within a Web browser and without need for any type of client-side configuration. PeachPy.io leverages PeachPy - Python-based assembly-like domain-specific language, which automates routine tasks in assembly programming, and lets developers use Python functions and statements for metaprogramming.

### Using BLIS for tensor computations in Q-Chem

**Evgeny Epifanovsky, Q-Chem**

TBD

### Cl1ck + LGen: FLAME for small scale linear algebra

**Diego Fabregat**

### Extended BLAS, Integer BLAS, Batched BLAS, etc.

**Greg Henry, Intel**

### Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

**Jianyu Huang and Leslie Rice, UT**

Matrix multiplication (GEMM) is a core operation to many scientific disciplines. Traditional implementations of Strassen-like fast matrix multiplication algorithm (Strassen) often do not perform well except for very large matrix sizes, due to the cost of memory movements, which is particularly noticeable for when non-square matrices are being multiplied. Such implementations also require considerable workspace and modifications to the standard BLAS interface.

Inspired by the BLIS framework, we present novel insights into several implementations of Strassen, yielding a family of algorithms that outperform both high-performance implementations of conventional GEMM and naive implementations of Strassen. We identify a micro-kernel for Strassen and show that the matrix additions that must be performed for Strassen can be incorporated into the inherent packing inside of GEMM or other level-3 BLAS operations. We present a performance model to predict the run time of the various implementations and demonstrate the performance benefits for single-core, multi-core, and many-core implementations. This shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use.

### A BLIS affair with FPGAs

**Tze Meng Low, CMU**

Recent improvements to high level synthesis (HLS) tools seemed to have lowered the barrier for library developers to design FPGA implementations. In this talk, we share our experience with porting BLIS onto a FPGA board. Specifically, we discuss how the analytical models for the BLIS framework can be used to design FPGA implementations.

### Tensor Contraction with BLIS

**Devin Matthews, UT**

Tensor computations–in particular tensor contraction (TC)–are important kernels in many scientific computing applications (SCAs). Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLIS framework, which allows for reshaping of the tensor to be fused with internal partitioning and packing operations, requiring no explicit reshaping operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor- to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in SCAs.

Link: Paper on arXiv. (Submitted to SISC).### An Implementation of GEMM for DMA-enabled Architectures

**Devangi Parikh, TI**

Our previous implementation of Dense Linear Algebra libraries for TI DSP processors assumes all on-chip memory is available as working space to efficiently move data through the various levels of memory using DMA and packing routines. However, this assumption prevents applications from using any on-chip memory to store data that the application may be using frequently. In this talk, we will describe an implementation of GEMM that uses a limited amount of working space, and DMAs to pack matrices freeing up most of the on-chip memory for the applications' use.

### Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era

**Ardavan Pedram, Movidius and Stanford**

### Design of a high-performance GEMM-like Tensor-Tensor Multiplication

**Paul Springer, RWTH-Aachen University**

We present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.

Link: Paper on arXiv.

### A New I/O Lower Bound for GEMM with a Tight Constant

**Tyler Smith, UT**

We present new and improved lower bound on the I/O cost that must be incurred by algorithms that compute the general matrix-matrix multiplication (GEMM) for a machine with two levels of memory: One fast and small, and one large and slow. We then demonstrate that this new lower bound is tight by producing an algorithm with an I/O cost that differs from this lower bound only in its lower order terms.

We use a modified version of the approach of breaking the operation into subcalculations, where each subcalculation has an equal I/O cost. The improvement in the lower bound can be attributed to two things. First, we examine the GEMM operation C+=AB instead of C:=AB. Secondly, we show that the lower bound can be improved by analyzing the operation in terms of fewer and larger subcalculations.

This work applies only to conventional algorithms for GEMM.

### NSF Software Infrastructure for Sustained Innovation: Our Second SSI Grant

**Robert van de Geijn, UT**

The development of BLIS was supported in part by a National Science Foundation SSI grant, starting Summer 2012. In Summer 2015, we submitted our second SSI grant proposal to NSF, which was eventually funded starting in late Summer 2016. We will discuss what development, supported by this grant, is planned for the next two years. We also hope to receive feedback from the participants on how to make the software, developed as part of these two grants, self-sustaining in the future.

### BLIS: Year In Review, 2015-2016

**Field Van Zee, UT**

Since its inception in 2012, BLIS has steadily matured into a stable and highly functional framework for implementing libraries for BLAS and BLAS-like operations. In this talk, we review the current status of BLIS, as well as improvements made to the framework in the previous year. These changes fall into three broad categories: performance-related, user experience, and developer experience.

### libFLAME Optimizations with BLIS

**Kiran Varaganti, AMD**

ibFLAME is a LAPACK-like library providing dense linear algebra functionality. The performance of libFLAME can be boosted by linking with high-performance BLAS-compatible libraries. BLIS is one such library. In this talk we present our experience of optimizing libFLAME’s Cholesky, LU & QR factorizations by optimizing the BLIS library.

### An Algorithmic Specific Code Generator for Matrix-Matrix Multiply-Like Operations

**Richard Veras, CMU**

Matrix-Matrix Multiply-like operations occur beyond Dense Linear Algebra (DLA) in fields such as graph analytics, genomics and machine learning. These fields deal with ever growing datasets and expectations of timely results to make critical decisions, thus performance is paramount. However, translating the performance achieved for General Matrix-Matrix Multiplication (DGEMM) in DLA requires a monumental effort.

For the domain expert there are two options: rely on the performance achieved by the compiler or expend considerable effort to tune these domain specific Matrix-Matrix Multiply-like kernels by hand. In this talk, we present an algorithm specific code generator for generating high performance matrix-matrix multiplication-like operations. We build on our previous work for implementing efficient DGEMM kernels and apply compiler techniques in order to generalize and automate the generation of these specialized operations.

### Scalable Dense Matrix Multiplication on Multi-Socket Many-Core Systems with Fast Shared Memory

**Natalia Vassilieva, Hewlett Packard Labs**

### A set of high performance kernel matrix operations on CPU and GPU

**Chenhan Yu**

GEMM (general matrix-matrix multiplication) is known to be the speed of light in high performance computing by exploiting memory hierarchy and instruction scheduling. We present GKMX (general kernel matrix operations) to abstract GEMM algorithms with semiring operations with flexible datatypes. We show that by fusing additional kernel and reduce operations GKMX can quickly instantiate high performance and low latency compute kernels for different applications. We present up to 10x speedup on Intel x86, 50x on KNL and 3x on NVIDIA GPU architectures with several N-body applications such as kernel summation, k-nearest neighbor and k-means clustering.

Last modified: Mon Sep 19 20:30:46 CDT 2016