BLISRetreat Talks

Section 5.2 Talks

Subsection 5.2.1 John Gunnels, "Context is (Almost) Everything: A Framework for Productizing Ozaki-Style Emulation"

Nvidia

Abstract

The rapid growth of AI has led to a marked increase in low-precision capabilities on modern GPUs. Given that these capabilities are ~100x greater than native FP64 comnpute, it is natural to look at emulation methods, such as the Ozaki-scheme for matrix multiplication. Some users are reluctant to employ such methods in non-research environments without sufficient guardrails. In this talk, we will discuss the framework we employ for providing performance, a familiar environment, and reliability. We delve into the matrix analysis required, the trade-offs between the depth of analysis and efficiency, and show early results on NVIDIA’s Blackwell GB200 and RTX Pro 6000 Server Edition.

Slides ¹

Subsection 5.2.2 Stepan Nassyr, "Continued experiments in microkernel generation"

Juelich Supercomputing Center

Slides ²

Subsection 5.2.3 Marco Barbone, "Fast NUFFT and SIMD"

Flatiron Institute

Abstract

The nonuniform fast Fourier transform (NUFFT) is widely used in imaging, astronomy, and scientific computing. This talk reviews the kernel-based NUFFT algorithm and presents recent performance advances in FINUFFT on CPUs. We show how accurate, fast kernel evaluation via piecewise-polynomial approximations and vectorized Horner’s rule removes the kernel as a bottleneck, shifting focus to the ``spreading’’ (type-1) and interpolation (type-2) stages. To unlock reliable vectorization, we introduce a compile-time dispatch mechanism that specializes over small integer parameters (e.g., kernel width \(w /N_s\)) while preserving a clean runtime API. We then detail a SIMD spreading scheme using fused multiply--adds and lane ``zip’’ swizzles to reuse kernel loads across complex lanes, reducing memory traffic and improving throughput. Across compilers and CPUs, these techniques deliver 10-100% speedups for spreading and 30-100% faster end-to-end transforms at typical tolerances (\(10^{-6} - 10^{-12}\)).

Additional Materials

Slides ³

Subsection 5.2.4 Chao Yin, "Low-Precision GEMM on ARM Architecture"

SMU

Abstract

Low-precision general matrix multiplication (GEMM) has become increasingly important due to its widespread use in machine learning workloads. Despite its significance, the BLIS framework does not yet provide support for low-precision operations. In this work, we investigate the implementation and performance of low-precision GEMM on ARM architectures, with a focus on FP16 arithmetic and mixed-precision FP32–FP16 computations.

Subsection 5.2.5 Devin Matthews, "BLIS Day the 13th: BLIS out of BLAS"

SMU

Abstract

This talk explores themes which extend the core BLIS framework in new directions outside of the traditional BLAS API. Starting with a brief retrospective of the BLIS project, we first examine the TBLIS library which leverages BLIS to perform dense and structured tensor contractions. Then, we revisit the libflame layer and explore how the BLIS framework can be stretched all the way from low-level unblocked linear algebra to massively parallel distributed computing across both CPUs and GPUs.

Additional Materials

Slides ⁴

Subsection 5.2.6 Tze Meng Low, "TBD"

CMU

Subsection 5.2.7 Harsh Dave, "Decomposition-Aware GEMM Kernels for High-Performance Skinny Matrix Computation"

AMD

Subsection 5.2.8 Harsh Dave, "Memory Allocator Impact on AOCL-BLAS: Jemalloc"

AMD

Subsection 5.2.9 Sameer Ahmadl, "Accelerating Eigenvalue and Singular Value Decompositions: Techniques for Improved Performance"

AMD

Subsection 5.2.10 Bhaskar Nallani, "AOCL-DLP Overview"

AMD

Subsection 5.2.11 Shiv Sundram, "REPTILE: Performant Tiling of Recurrences and Linear Solvers"

Stanford University

Subsection 5.2.12 Nima Sahraneshan, "“No-I-Meant-Another QR” (NIMA-QR): Two-Stage Newton–Schulz-Refined Mixed-Precision QR Factorisation"

Universitat Jaume I

Slides ⁵

Subsection 5.2.13 Cem Bassoy, "Fast and layout-oblivious tensor-matrix multiplication with BLAS"

DeepL

Additional Materials

Subsection 5.2.14 Jackson Vanover, "EXCVATE: Spoofing Exceptions and Solving Constraints to Test Exception Handling in Numerical Libraries"

University of California, Davis

Abstract

Testing a numerical library’s exception handling is often left to its regression tests. However, designing floating-point inputs that exercise exceptional behavior is difficult. Further- more, existing input generation techniques are designed with the view that any exception-causing input is of interest when most are unremarkable from the standpoint of exception handling: most functions should handle exceptions correctly by construction, i.e., by returning exceptional values (NaN, ±Inf), due to IEEE 754’s mandate that most operations propagate such values. To test library exception handling, we propose an approach which – given only a set of regression test executables and a set of function prototypes – cheaply identifies potential failures via exception- spoofing and then reifies these failures using an off-the-shelf SMT solver that generates concrete inputs inducing buggy behavior. We implement this approach in a prototype tool EXCVATE, present an evaluation that targets 26 BLAS functions across three implementations, two compilers, and multiple compiler optimizations, and ultimately identify exception-handling failures in five functions in multiple BLAS versions.

Additional Materials

Prev Top Next