Next: 1.3 Physically Based Matrix Up: 1 Introduction Previous: 1.1 Why a New

1.2 Natural Description of Linear Algebra Algorithms

Let us consider the simple example of implementation of the Cholesky factorization, which we will revisit in Chapter . Given a symmetric positive definite matrix A , the Cholesky factorization of A is given by

where L is lower triangular.

The algorithm for implementing this operation can be described by partitioning (blocking) the matrices

where and are scalars, and and are vectors of length n-1 . The indicates the symmetric part of A . Now,

This in turn yields the equations

eqnarray231

We now conclude that the following steps will implement the Cholesky factorization, overwriting the lower triangular portion of A with L :

compute the Cholesky factorization of updated recursively, yielding .

We make a few observations:

The algorithm does not index individual elements of the original matrix explicitly. Notice it references different parts of the current matrix, not the original matrix.
The blocks are simply references into the original matrix.
There is an inherent use of recursion.

The primary opportunity for parallelism is in the symmetric rank-1 update:

Using the PLAPACK infrastructure, parallel code for the above algorithm is given by

PLA_Obj_view_all( a, &acur );
while ( TRUE ){
     PLA_Obj_global_length( acur, &size );
     if ( 0 == size ) break;
     PLA_Obj_split_4( acur, 1, 1, &a11, &a12,
                                  &a21, &acur );
     Take_sqrt( a11 );
     PLA_Inv_scal( a11, a21 );
     PLA_Syr( PLA_LOW_TRIAN, min_one, a21, acur );
}

Previewing the rest of the book, all information that describes A , its data, and how it is distributed to processors (nodes) is encoded in an object (data structure) referenced by a. The PLA_Obj_view_all creates a second reference, acur, into the same data. The PLA_Obj_global_length call extracts the current size of the matrix that acur references. If this is zero, the recursion has completed. The call to PLA_Obj_split_4 partitions the matrix, creating new references for the four quadrants. Element

is updated by taking its square root in subroutine Take_sqrt,

is scaled by

by calling PLA_Inv_scal, and finally the symmetric rank-1 update is achieved through the call to PLA_Syr. This sets up the next iteration, which is also the next level of recursion in our original algorithm.

In order to understand how parallelism is achieved in the above calls, we must first examine how PLAPACK distributes vectors and matrices, in Section 1.3. After this, we show how operations like the symmetric rank-1 update can be naturally parallelized, in Sections 1.5 and 1.6.

Next: 1.3 Physically Based Matrix Up: 1 Introduction Previous: 1.1 Why a New

rvdg@cs.utexas.edu