Fast Fourier Transforms on the Cell Broadband Engine

In collaboration with Nikos Pitsianis and Xiaobai Sun.

The Cell Broadband Engine.
Figure 1
2D FFT: Time Diagram.
Figure 2
Streaming 2D FFTs: Overlapping computation and communication.
Figure 3



Hierarchical Sparse Direct Solver for hp-adaptive Finite Element Methods

In collaboration with Victor Eijkhout, Kyungjoo Kim, Jason Kurtz and Robert van de Geijn.

We set out to efficiently compute the solution of a sequence of linear systems A_i x_i = b_i, where the matrix A_i is tightly related to matrix A_{i-1}.

In the setting of an hp-adaptive Finite Element Method, the sequence of matrices A_i results from successive local refinements of the problem domain. At any step i > 1, a factorization already exists and it is the updated linear system relative to the refined mesh for which a factorization must be computed in the least amount of time. This observation holds the promise of a tremendous reduction in the cost of an individual refinement step.

We argue that traditional matrix storage schemes, whether dense or sparse, are a bottleneck, limiting the potential efficiency of the solvers. We propose a new hierarchical data structure, the Unassembled Hyper-Matrix (UHM), which allows the matrix to be stored as a tree of unassembled element matrices, hierarchically ordered to mirror the refinement history of the physical domain. The factorization of such an UHM proceeds in terms of element matrices, only assembling nodes when they need to be eliminated. Efficiency comes in terms of both performance and space requirements.

A Systematic, Goal-oriented, Modular Approach to Stability Analysis

In collaboration with Robert van de Geijn.

Stability Analysis for LU factorization (Crout variant).
Table 1
Stability Analysis for a blocked LU factorization.
Table 2



Accurate and Fast Evaluation of the Prolate Spheroidal Functions

In collaboration with Xiaobai Sun.

Prolate functions of degree 1.
Figure 1
Prolate functions of degree 2.
Figure 2
Eigenvectors:
c=50, n=50.
Figure 3
Eigenvectors:
c fixed, n varying.
Video 1 (1.1Mb)

Performance Modelling and Prediction for Dense Linear Algebra Algorithms

In collaboration with Kazushige Goto, Robert van de Geijn.

Scheduling for multi-core and shared memory architectures.
Video (1Mb)

Mechanical Generation and Implementation of Correct Algorithms for Dense Linear Algebra

This project is at the core of my Ph.D. dissertation.
As part of the FLAME project at UT-Austin, we discovered a systematic procedure for generating formally correct algorithms for a large class of linear algebra operations. The input to the procedure is a mathematical description of a target operation; the output is an algorithm or a routine computing the operation. The output algorithm is formally correct in the sense of Hoare's triples. Debugging and testing is needed only to certify the translation to code.
The key insight to this result is that for many linear algebra operations it is possible to state -a priori- one or more loop-invariants for loop-based algorithms computing the operation. Once one loop-invariant is stated, a corresponding formally correct algorithm can be built around it. The generation procedure can be repeated, resulting in a family of different algorithms implementing the same operation.

In my dissertation "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms" ([PS] [PDF]), I present the methodology, in the form of a procedure, for generating families of correct algorithms. Then I demonstrate that the methodology is so systematic that it is possible to automate it. I describe a prototype of a mechanical system, developed in Mathematica, which implements the generation procedure and requires only limited human intervention.

My system returns algorithms as well as code, reducing even further the need for extensive and time consuming testing. The scope of the procedure includes, but it is not limited to, all the operations in the BLAS library, many operations arising in control theory (all the operations in the RECSY library), and several operations from LAPACK.

In the paper "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix" ([PS] [PDF]), we use my system to generate a family of algorithms for computing the covariance matrix. There, we show that having a set of different routines (algorithms) computing the same operation, is indispensable in order to obtain high-performance on different architectures and settings.

PMR3: A Parallel Eigensolver for Symmetric Matrices Based on the MRRR Algorithm

In collaboration with Inderjit Dhillon and Robert van de Geijn.

PMR3: Breakdown of execution time.
Figure 1
Comparison of Parallel Eigensolvers.
Figure 2
PMR3: Incremental Speedup. Constant memory allocation per processor.
Figure 3


Given a symmetric matrix A, we are interested in computing all or a subset of its eigenvalues L and the corresponding eigenvectors Z.
In symbols: A Z = Z L, where A is the input matrix, L is a diagonal matrix containing eigenvalues and the columns of matrix Z are the corresponding eigenvectors.

Together with Prof I. Dhillon and Prof. R. van de Geijn, I developed the first parallel dense symmetric eigensolver based on the Algorithm MRRR (Algorithm of Multiple Relatively Robust Representations), proposed by Dhillon and Parlett. Algorithm MRRR is the first stable O(n k) algorithm for computing k eigenvalues and eigenvectors of a symmetric tridiagonal n-by-n matrix.
Our parallel algorithm consists of three stages: 1) reduction to tridiagonal form, 2) solution of the symmetric tridiagonal eigenproblem, and 3) backtransformation. Both the reduction and backtransformation phases require O(n^3) arithmetic operations. Until recently, all algorithms for the tridiagonal eigenproblem too had cubic complexity in the worst case; these include the QR algorithm, inverse iteration and the Divide&Conquer method. Some implementations, such as Divide&Conquer, have an extra O(n^2) memory requirement.
Our parallel algorithm balances the workload as well as the memory requirement among the processors, allowing problems of very large size to be solved efficiently and very satisfactory speedup is attained. Thanks to the O(n^2) complexity of the tridiagonal eigensolver, the time spent on Stage 2) is now negligible compared to the time spent on the reduction and backtransformation. Comparisons against SCALAPACK's routine PDSTEDC (Divide & Conquer) show that our tridiagonal eigensolver is always faster. Stages 1) and 3) have been implemented using the PLAPACK library and allow problems of very large size to be tackled, while Stage 2) is implemented in Fortran and C with explicit invocations to MPI, and attains almost perfect speedup and scalability. Our parallel software is being used and studied in many institutions and laboratories.

Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn. A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations. [PS] [PDF]
SIAM Journal on Scientific Computing, Vol. 27, No. 1, 2005.

Numerical Linear Algebra

Matrix-Vector Multiplication
Column-wise & row-wise computation. (click to start)