Projects:
-
Fast Fourier Transforms on the Cell Broadband Engine.
-
Hierarchical Sparse Direct Solver for hp-adaptive Finite Element Methods.
-
A Systematic, Goal-oriented, Modular Approach to Stability Analysis.
-
Accurate and Fast Evaluation of the Prolate Spheroidal Functions.
-
Performance Modelling and Prediction for Dense Linear Algebra Algorithms.
-
Mechanical Generation and Implementation of Correct Algorithms for Dense Linear Algebra.
-
PMR3: A Parallel Eigensolver for Symmetric Matrices Based on the MRRR Algorithm.
Teaching Aids:
-
Parallel Computing: Scheduling. (coming soon)
-
Linear Algebra: Matrix Computations.
Fast Fourier Transforms on the Cell Broadband Engine
In collaboration with Nikos Pitsianis and Xiaobai Sun.
|
The Cell Broadband Engine.
|
|
|
Figure 1
|
|
|
2D FFT: Time Diagram.
|
|
|
Figure 2
|
|
|
Streaming 2D FFTs: Overlapping computation and communication.
|
|
|
Figure 3
|
|
Hierarchical Sparse Direct Solver for hp-adaptive Finite Element Methods
In collaboration with Victor Eijkhout, Kyungjoo Kim, Jason Kurtz and Robert van de Geijn.
We set out to efficiently compute the solution of a sequence of linear
systems A_i x_i = b_i, where the matrix A_i is tightly related to matrix
A_{i-1}.
In the setting of an hp-adaptive Finite Element Method, the
sequence of matrices A_i results from successive local
refinements of the problem domain. At any step i > 1, a
factorization already exists and it is the updated linear system relative
to the refined mesh for which a factorization must be computed in the
least amount of time. This observation holds the promise of a tremendous
reduction in the cost of an individual refinement step.
We argue that traditional matrix storage schemes, whether dense or
sparse, are a bottleneck, limiting the potential efficiency of the
solvers. We propose a new hierarchical data structure, the Unassembled
Hyper-Matrix (UHM), which allows the matrix to be stored as a tree of
unassembled element matrices, hierarchically ordered to mirror the
refinement history of the physical domain. The factorization of such an UHM
proceeds in terms of element matrices, only assembling nodes when they need
to be eliminated. Efficiency comes in terms of both performance and space
requirements.
A Systematic, Goal-oriented, Modular Approach to Stability Analysis
In collaboration with Robert van de Geijn.
|
Stability Analysis for LU factorization (Crout variant).
|
|
|
Table 1
|
|
|
Stability Analysis for a blocked LU factorization.
|
|
|
Table 2
|
|
Accurate and Fast Evaluation of the Prolate Spheroidal Functions
In collaboration with Xiaobai Sun.
|
Prolate functions of degree 1.
|
|
|
Figure 1
|
|
|
Prolate functions of degree 2.
|
|
|
Figure 2
|
|
Eigenvectors: c=50, n=50.
|
 |
|
Figure 3
|
|
Eigenvectors: c fixed, n varying.
|
 |
|
Video 1 (1.1Mb)
|
|
Performance Modelling and Prediction for Dense Linear Algebra Algorithms
In collaboration with Kazushige Goto, Robert van de Geijn.
|
Scheduling for multi-core and shared memory architectures.
|
 |
|
Video (1Mb)
|
Mechanical Generation and Implementation of Correct Algorithms for Dense Linear Algebra
This project is at the core of my Ph.D. dissertation.
As part of the FLAME project at UT-Austin, we discovered a systematic procedure for
generating formally correct algorithms for a large class of linear algebra operations.
The input to the procedure is a mathematical description of a target operation; the
output is an algorithm or a routine computing the operation.
The output algorithm is formally correct in the sense of Hoare's triples.
Debugging and testing is needed only to certify the translation to code.
The key insight to this result is that for many linear algebra operations it is possible
to state -a priori- one or more loop-invariants for loop-based algorithms computing the
operation. Once one loop-invariant is stated, a corresponding formally correct algorithm
can be built around it.
The generation procedure can be repeated, resulting in a family of different algorithms
implementing the same operation.
In my dissertation "Mechanical Derivation and Systematic Analysis of
Correct Linear Algebra Algorithms"
(
[PS] [PDF]),
I present the methodology, in the form of a procedure, for generating
families of correct algorithms. Then I demonstrate that the methodology is
so systematic that it is possible to automate it. I describe a prototype
of a mechanical system, developed in Mathematica, which implements the
generation procedure and requires only limited human intervention.
My system returns algorithms as well as code, reducing even further the
need for extensive and time consuming testing. The scope of the procedure
includes, but it is not limited to, all the operations in the BLAS library,
many operations arising in control theory (all the operations in the RECSY library),
and several operations from LAPACK.
In the paper "Families of Algorithms Related to the Inversion of a
Symmetric Positive Definite Matrix"
(
[PS] [PDF]),
we use my system to generate a family of algorithms for computing the covariance matrix.
There, we show that having a set of different routines (algorithms)
computing the same operation, is indispensable in order to obtain
high-performance on different architectures and settings.
PMR3: A Parallel Eigensolver for Symmetric Matrices Based on the MRRR Algorithm
In collaboration with Inderjit Dhillon and Robert van de Geijn.
|
PMR3: Breakdown of execution time.
|
 |
|
Figure 1
|
|
|
Comparison of Parallel Eigensolvers.
|
 |
|
Figure 2
|
|
|
PMR3: Incremental Speedup. Constant memory allocation per processor.
|
 |
|
Figure 3
|
|
Given a symmetric matrix
A, we are interested in computing all or a subset
of its eigenvalues
L and the corresponding eigenvectors
Z.
In symbols:
A Z = Z L, where
A is the input matrix,
L is a diagonal matrix containing eigenvalues and
the columns of matrix
Z are the corresponding eigenvectors.
Together with Prof I. Dhillon and Prof. R. van de Geijn, I
developed the first parallel dense symmetric eigensolver based on the
Algorithm MRRR (Algorithm of Multiple Relatively Robust
Representations), proposed by Dhillon and Parlett. Algorithm MRRR is
the first stable
O(n k) algorithm for computing
k eigenvalues and
eigenvectors of a symmetric tridiagonal
n-by-
n matrix.
Our parallel algorithm consists of three stages: 1) reduction to tridiagonal form, 2)
solution of the symmetric tridiagonal eigenproblem, and 3) backtransformation. Both the
reduction and backtransformation phases require
O(n^3) arithmetic operations.
Until recently, all algorithms for the tridiagonal eigenproblem too had cubic complexity
in the worst case; these include the QR algorithm, inverse iteration and the
Divide&Conquer method. Some implementations, such as Divide&Conquer,
have an extra
O(n^2) memory requirement.
Our parallel algorithm balances the workload as well as the memory requirement
among the processors, allowing problems of very large size to be solved efficiently and
very satisfactory speedup is attained. Thanks to the
O(n^2) complexity of the
tridiagonal eigensolver, the time spent on Stage 2) is now negligible compared to the
time spent on the reduction and backtransformation. Comparisons against SCALAPACK's
routine PDSTEDC (Divide & Conquer) show that our tridiagonal eigensolver is always
faster. Stages 1) and 3) have been implemented using the PLAPACK library and allow
problems of very large size to be tackled, while Stage 2) is implemented in Fortran and
C with explicit invocations to MPI, and attains almost perfect speedup and scalability.
Our parallel software is being used and studied in many institutions and laboratories.
Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn.
A Parallel Eigensolver for Dense Symmetric Matrices Based on
Multiple Relatively Robust Representations.
[PS] [PDF]
SIAM Journal on Scientific Computing, Vol. 27, No. 1, 2005.
Numerical Linear Algebra
|
Matrix-Vector Multiplication
|
 |
|
Column-wise & row-wise computation. (click to start)
|
|