BLIS Retreat 2015

Contributed talks

Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS.

Devangi Parikh

In this talk, we describe the implementation of dense linear algebra libraries on TI embedded processors using BLIS. We integrated DMA capabilities into BLIS to move data efficiently, hence improving the performance of Level 3 BLAS functions on TI DSP. We will present the framework of the DMA integration, and also present results of this work.

Porting the BLIS micro-kernel to PeachPy

Marat Dukhan

PeachPy is a Python-embedded domain-specific language for assembly programming of high-performance kernels. PeachPy represents assembly instructions, registers, and operands as Python objects and lets performance optimization experts use Python for assembly meta-programming. This talk will cover new features of PeachPy and author's experience with porting BLIS micro-kernels to PeachPy. It will demonstrate how PeachPy simplifies many aspects of low-level assembly programming in BLIS kernels:

Software implementation of correlated quantum chemistry methods that exploits advanced programming tools and new computer architectures

Evgeny Epifanovsky

Efficient implementation of post-Hartree-Fock methods in quantum chemistry such as coupled cluster theory for the ground state and equation-of-motion theory for excited states faces two primary challenges. Firstly, the programmable equations arising from the theory can be quite complex and non-trivial to implement as a computer program without the right tools. Secondly, the quickly growing computational cost of the methods when used for larger molecular systems demands the use of the latest computer technology.

In this paper we present a software development strategy that separates the process of computational method development from the process of software optimization and tuning for a particular computer platform. This removes the necessity to reimplement and reoptimize each electronic structure method for a computer architecture du jour. This strategy is adopted in the development of the Q-Chem quantum chemistry package for coupled cluster, algebraic diagrammatic construction, and other method suites based on our open-source tensor algebra library (libtensor).

To demonstrate the success of this strategy we show how a massively parallel capability is added to our existing suite of coupled cluster methods by interfacing libtensor with CTF, an open-source library for distributed memory tensor computations. The result is a single set of high-level codes that enable both shared and distributed memory calculations using the same executable, allowing the user to select an implementation that offers the most favorable time to solution.

Adding Efficient Scheduling Policy to SuperMatrix on Heterogeneous Platforms

Jianyu Huang

Here we provide an orthogonal view to show our resilient dense linear algebra software stack to embrace BLIS. Supermatrix is a run-time system extension to libflame that enables task-level parallel execution. It maps dense linear algebra operations (BLAS3 and LAPACK) into tasks as nodes in directed acyclic graph (DAG), analyzes the dependencies between tasks and schedules tasks on independent threads of execution. However, it mainly targets at shared-memory multicore platforms.

In this talk, we propose some efficient supports for Supermatrix to work on heterogeneous platforms composed of multi-core processor and multiple hardware accelerators (GPU, MIC, etc). We leverage HEFT (Heterogeneous Earliest Finish Time) scheduling policies and achieve reasonable load balance results.

Solving the Small Matrix Linear Algebra Problem

Greg Henry

Small matrix computations run fastest when they have hand-tuned algorithms just for an application’s specific sizes. While “border” computations never matter for larger problems, for smaller problems they are critical but also far too numerous to include in a library every possible small case. Since there are too many small cases to cover exhaustively, many libraries compromise performance to create a smaller and more manageable foot-print. Sometimes the resulting “compromised” performance is between 0.1 and 0.5 times what’s possible. In this work, we present a high performance implementation of small matrix BLAS for the latest Intel architectures. We show a new way of looking at math libraries to dynamically achieve top performance without this unmanageable foot-print problem, and show how other statically-built solutions (such as exist in BLIS, Intel® MKL, ATLAS, OpenBLAS) are missing performance opportunities.

Linkage Disequilibrium is BLIS

Tze Meng Low

Linkage disequilibrium (LD) is a fundamental technique used in population genetics studies to understand how mutations interact, and affect the population. Improvements in DNA sequencing technology has spurred population studies, making LD-based analysis even more pervasive.

In this talk, we share our initial experience in using BLIS to compute LD. We cast LD into GEMM, and show how the BLIS micro-kernel needs to be adapted in order to compute LD. We also show that as DNA sequencing technology improves, the BLIS approach naturally scales with the amount of DNA data available. Finally, we discuss some possible hardware features that may be needed to compute LD efficiently.

BLAS for Tensor: What, Why, and How

Devin Matthews

The BLAS interface has stood as a standard interface for basic matrix operations for more than twenty years. While this interface has its limitations, a compromise of low-level access patterns, sufficient functionality, and a reference implementation that (at the time) provided sufficient performance has allowed both users and implementers to easily adopt it. For tensors operations though, which otherwise share many computational and mathematical similarities to matrix analogues, no such universal interface has emerged.

In this talk I will discuss just what such an interface would, should, or could include in analogy to the BLAS, and how a reference implementation can (and will) be implemented using the BLIS framework. Lastly, I will discuss the important differences between tensors and matrices that make the issue of universal interfaces more complicated for tensor applications.

A Systematic Approach for Blocking of Convolutional Neural Networks for Architectures with Hierarchical Memories

Ardavan Pedram

ROTE + DxTer = Bigger, Faster CCSD

Martin Schatz

Multi-dimensional arrays, also known as tensors, are useful structures to represent complex data. The large-scale datasets in computational chemistry often require distributed-memory parallel methods to perform tensor contractions. Typically, an optimization across a large number of operations (including data distribution, redistribution, and algorithm choice) is required to develop efficient implementations of applications in this domain (such as spin-adapted CCSD). This process is daunting due to the higher-order nature of these models and the complexity of the applications. To address this, we define a notation that expresses the data distribution and redistribution of dense, non-symmetric tensors (implemented in the Redistribution Operations and Tensor Expressions or ROTE API). This notation exposes a systematic procedure for algorithm derivation (implemented in the DxTer prototype system), enabling the automation of creating high-performance, distributed-memory implementations for given applications. Experiments performed on the IBM BlueGene/Q and Cray XC30 architectures show that the ideas presented here improve upon other state-of-the-art methods in terms of performance and storage requirements.

Optimizing Tensor Contraction Sequences via Data Movement Lower Bounds

Saday Sadayappann

Sequences of tensor contractions with producer-consumer relationships arise frequently in ab initio quantum chemistry models such as coupled cluster methods. A specific computationally intensive example is the so called 4-index transformation to transform a four-dimensional tensor from an atomic basis to a molecular basis. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a 4-D tensor with a 2D transformation matrix. It is very challenging to develop an effective mapping/scheduling of the computation on a distributed-memory parallel system to control the amount of data movement, across nodes of a parallel machine, as well as within the disk/memory/cache hierarchy. This is because of the huge number of possible fusion/tiling choices for the sequence of four tensor contractions. We present an approach that uses lower bounds modeling of data movement complexity to address this problem. The methodology has been effective in enabling the development of a new implementation of the 4-index transform that improves on the currently available implementations in the NWCHEM quantum chemistry software suite.

Tony Skjellum

Adding Algorithm-Based Fault Tolerance (ABFT) to BLIS

Tyler Smith

Fault-tolerant computing is likely to become a necessary ingredient if supercomputers are to operate in the Exascale era within a reasonable power budget. In this paper we examine how Algorithm-Based Fault Tolerance (ABFT) can be added to BLIS's implementation of general matrix-matrix multiplication (GEMM). Our approach exploits BLIS's internal organization in order to balance the workspace requirements and the overheads introduced by error detection and correction. We also investigate the implications on parallel performance achieved both within an Intel Xeon E5-2680 node and across nodes of a distributed memory architecture.

BLIS/libflame: past, present, and future

Robert van de Geijn

BLIS and libflame on the Myriad2 MA2100

Cormac Brick

The Current State of BLIS

Field Van Zee

Since its inception in 2012, BLIS has steadily matured into a stable and highly functional framework for implementing libraries for BLAS and BLAS-like operations. In this talk, we review the current status of BLIS, its marquee features (including those of interest to end-users as well as developers), and also provide a look ahead to changes and improvements planned for the future.

Generation of SIMD Dense Linear Algebra Kernels with Analytical Models

Richard Veras

The popular use of autotuning and code generation in the domain of dense linear algebra (DLA) is driven by the need to reduce the effort required of a domain expert to manually optimize the general matrix multiply (GEMM) kernel in a high performance DLA library for every supported architecture. However, many existing autotuning and code generation tools used within the DLA domain require a domain expert to manually identify and implement high-performance architecture-specific kernels. In this talk we discuss how we automate the last mile in generating the GEMM kernel using a modern architecture as our motivating example. Additionally, we will discuss open issues and how we can extend our technique to other architectures.

BLIS-based High-Performance Compute Kernels in N-body Problems

Chenhan Yu

N-body problems such as kernel summation and nearest neighbor search are cornerstone problems in computational physics, non-parametric statistics, and machine learning. For N points, exact kernel summation and neighbor search require quadratic works, but many fast algorithms can approximate the results by divide-and-conquer. The common computational kernel in all these algorithms involve a GEMM-like operation, which is benefited but also limited from the current BLAS development. Inspired by the BLIS framework, we propose an efficient implementation that fuses multiple operations in the micro-kernel to utilize the memory throughput. We present an analysis of the algorithm and explain parameter selection. Overall we observe significant speedups (1X-10X), portability inherited from BLIS and higher flexibility in supporting different kernel functions.

Exploiting BLIS to optimize LU with pivoting

Xianyi Zhang