BLIS Retreat

Contributed talks


BLIS Overview

Field Van Zee

In this talk, I explain the what, why and how of BLIS. What are the goals of BLIS and how does it relate to BLAS, LAPACK, FLAME, GotoBLAS, etc.? Why do we need BLIS and not just a better implementation of BLAS? How does the BLIS framework enable rapid implementations of not just BLAS but higher-level libraries that require dense linear algebra? What techniques are we using to achieve a high ratio of performance to programmer effort?


Developing low-level assembly kernels with Peach-Py

Marat Dukhan (Georgia Tech)

Peach-Py is a new Python framework which allows to write assembly kernels in Python. Peach-Py was created to simplify writing assembly kernels for HPC while preserving the efficiency of hand-tuned assembly codes. The framework represents assembly instructions as Python objects, and simplifies mixing of instruction streams, software pipelining, and generating multiple similar kernels (e.g. targeting different instruction sets, or different data types). Peach-Py also handles register allocation, allocation of constants in data section, adaptation program to different calling conventions, and collects information about the used instruction sets.


Discussion of requirements for the BLIS Fortran interface(s)

Jeff Hammond (ALCF/UChicago)

The BLAS is a legacy library with a Fortran 77 interface, which is still one of the most common ways in which it is used. If BLIS is to be a modern successor to BLAS, then what is the right way to deal with Fortran users? Supporting the BLAS interface is critical to ensuring drop-in replaceability but the BLIS team should aspire to something more than this. I would like to discuss user requirements for a modern Fortran interface for BLIS. I will attempt to create a strawman Fortran 2003 interface to stimulate critical thinking.


Porting BLIS to new architectures. Early experiences

Fran Igual (UCM)

Portability is one of the main premises in the design of BLIS since its inception. In this talk, we review some successful porting experiences to different general and specific purpose architectures (mainly ARM and DSPs), and we give some hints about future directions to adapt the framework to other type of present and future processing architectures.


BLIS as a Research Vehicle

Bryan Marker (UT-Austin CS)

With Design by Transformation (DxT) we encode knowledge about software to automatically generate code as an expert would. To date, generating sequential and shared-memory BLAS code has been a difficult undertaking, often missing the high performance attained by experts because of the way BLAS libraries are viewed from a software engineering perspective. With BLIS, this is changing. Microkernels are still manually coded (but could be automatically generated). The layers of code built on microkernels, though, can be automatically generated for sequential and shared-memory machines thanks to the way BLIS is structured. This would not have been possible without the building blocks BLIS provides. In this talk, I discuss how BLIS's unique layering enables my research and can enable others’ research, as we are no longer stuck to the traditional BLAS interfaces.


Optimizing for the inner kernel: a low power, high performance core for matrix multiplication kernels

Ardavan Pedram (UT-Austin ECE)

When implemented in software, high-performance implementations of matrix-matrix multiplication is blocked for successive memory layers. At the bottom of the food chain is a "micro-kernel" that performs a "block dot product". We present details of mapping and implementation of the GEMM micro kernel onto a custom linear algebra core that performs this same "block dot product". Further, we compare this architectures with other conventional core configurations such as conventional SIMD, 2D-SIMD, and single flat register file in terms of area and energy efficiency. The conclusion is that the LAC can achieve orders of magnitude improvements in power and area efficiency compared to conventional designs.


When all you have is linear algebra, everything looks like a matrix

Devin Matthews (UT-Austin Chem)

Computational quantum chemistry, and in particular high-accuracy, high-cost methods such as Coupled Cluster rely heavily on tensor operations. However, the lack of efficient and flexible tensor libraries* requires that any tensor operations either be written using simple nested loops or be recast into a form amenable to the use of optimized linear algebra packages such as BLAS. Considerations such as point group and index permutational symmetry further restrict the tensor contraction "patterns" to which linear algebra can be applied, necessitating a large amount of extraneous data movement. The flexibility and extensibility of the BLIS framework can potentially alleviate some of these restrictions, as illustrated by application to the CFOUR program suite.

*To be fair, this is an area of ongoing research with several promising projects.


Performance-Portable Kernels in OpenCL: Lessons Learned

Karl Rupp (Argonne MCS)

(Abstract skipped because the title says it all!)

The talk will discuss recent results from OpenCL-based autotuning for GPUs and CPUs.

Read the paper


Parallelizing BLIS

Tyler Smith (UTCS)

I discuss how DGEMM has been parallelized within BLIS, and its implementation using what we are calling thread communicators. I also discuss how multiple levels of parallelization within BLIS maps to multiple levels of the cache hierarchy, and a heuristic of which levels should be used to parallelize a given architecture. Finally I will discuss results of parallelizing BLIS for a couple of architectures.


Performance Modeling for DLA Kernels

Elmar Peise (RWTH Aachen University)

It is well known that the performance behavior of dense linear algebra programs is greatly influenced by factors such as target architecture, underlying libraries and problem size; because of this, the accurate prediction of their performance is a real challenge. Aware of the hierarchical structure of dense linear algebra routines and libraries, we develop a framework for the automatic generation of statistical performance models for linear algebra kernels. By evaluating and combining such models, we aim at predicting and optimizing the performance of routines at higher levels in the software hierarchy, entirely avoiding their execution and empirical tuning. Related paper


Effective Methods for Propagating the Schroedinger in Time

Barry Schneider (NSF)

Many current approaches to propagating the Schroedinger in time use explicit time propagators based on Krylov approximations to propagate the solution. Our group has developed an efficient technique based on the short iterative Lanczos method. Here we suggest how that can be extended and perhaps made more efficient using an exponential time differencing approach.

Extended Abstract

Lessons from a parallel sparse direct solver with multilevel scheduling

Kyungjoo Kim (Sandia)

We present a parallel sparse direct solver for multi-core architectures based on Directed Acyclic Graph (DAG) scheduling. The developed solver is based on the multifrontal method, which converts the sparse matrix factorization into dense subproblems that are hierarchically related in an assembly tree. Our solver exploits two-level task parallelism: tasks are first generated from a parallel tree traversal on the assembly tree; next, tasks are further refined by using algorithms-by-blocks to obtain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executed after their dependencies are analyzed. We discuss the performance of the solver for the particular problems arising from the high order Finite Element (FE) method. While the solver uses a runtime system to schedule talks, a question is whether the ability to define thread groups would further enhance performance and programmability.


Distributed Contraction of Symmetric Tensors

Saday Sadayappan

Tensor contractions constitute a computationally significant kernel for many high-accuracy methods in quantum chemistry, such as coupled cluster methods. The tensors in these methods typically have significant degrees of symmetry/anti-symmetry among subsets of dimensions and are therefore represented in a compact form that minimizes redundancy in the storage of the tensors. Production parallel quantum chemistry suites such as NWCHEM implement distributed contraction algorithms for symmetric tensors, where dynamic load balancing is achieved but structure within the contraction algorithm is not exploited for communication optimization. This talk presents a framework for load-balanced communication optimization fro distributed contraction of symmetric tensors. The approach is compared to the recently developed CTF (Cyclops Tensor Framework).


Last modified: Mon Aug 26 21:19:56 CDT 2013