CS383C Numerical Analysis: Linear Algebra Project Report I Title: Performance Analysis of QR factorization on superscalar processors ------ Team members: Haiming Liu, Nitya Ranganathan ------------- Introduction: ------------- QR factorization arises in many applications like solving the least squares problem, eigen value decomposition, singular value decomposition etc. We plan to implement algorithms for QR factorization that try to optimize performance on a given superscalar processor. [1], [2] describe the loss in performance when the standard QR algorithms are implemented and give ways to improve performance by considering cache misses, memory stalls and pipeline stalls through parameters. Since superscalar processors are state-of-the-art microprocessors, and are also a building block for several multiprocessors, such optimizations are necessary to get a high level of ILP (Instruction Level Parallelism) on the core. Householder QR factorization: ----------------------------- Background Problem with basic Householder QR decomposition One drawback of the basic Householder algorithm is that it doesn't take into account the registers and cache hierarchy of modern superscalar processors. In each step, it computes a Householder reflection matrix that will zero out the the elements below the diagonal element of a column and updates the remaining columns. If the matrix to decompose can't fit into the cache, this will flush out the elements in the cache before they are used next time. Furthermore, most of the computations are column updates, which are BLAS-2 operations. This will result in high cache miss rates and poor performance on a modern superscalar processor. To improve the performance of Householder QR decomposition on modern superscalar processors, the algorithm has to consider the registers and cache hierarchy of these processors. In a paper by James J. Carrig and Gerard G. L. Meyer, two parameterized Householder QR decomposition algorithms are proposed which take into account of the registers and cache hierarchy of modern superscalar processors. These algorithms use the same floating-point operations as the basic Householder algorithm, they are numerically correct for all problems and parameter combinations. Since the sizes of register file and cache are different on different superscalar processors, one must choose different parameters to achieve best performance on a target platform. In their paper, they evaluated their algorithms on Intel Pentium Pro, IBM SP2, and SGI Power Challenge XL. They also gave guidelines about choosing parameter values for different superscalar processors. Cache Householder QR decomposition The cache model for the cache Householder decomposition algorithm model is: the processor is connected to L>=2 caches; the caches are fully associative and use least-recently-used(LRU) replacement policy. In the cache Householder algorithm, instead of updating columns j+1 to n immediately after computing u(j), it performs only those updates necessary to accumulate u(j), u(j+1), ..., u(j+s-1) using basic Householder algorithm. Then, the updates due to s vectors are performed together. As a result, most columns are read into the cache only n/s times instead of n times as in the basic Householder algorithm. This is very similar to Gaussian elimination using BLAS_3 operations. Register Householder QR decomposition The purpose of the register Householder algorithm is to save on register loads and stores. After computing u(j), instead of updating columns j+1 to n one column a time, this algorithm interleaves the computations that are involved when updating w columns. By accumulating w inner products over w columns and holding the partial inner products in registers(provided w is sufficiently small), each element of u(j) can be loaded into the register only once to process the w columns. The Matlab implementation of basic Householder algorithm and cache Householder algorithm function [A,v,alpha]=BH(A) %QR using basic Householder reflection %R is stored in the upper triangular of A %P is stored as a product form, actually only the u(i)'s are stored in the %lower triangular part of A, the first entry of u(i) is stored in v(i), %u(i) is not normalized [m,n]=size(A); for i=1:min(m-1,n) v(i) = norm(A(i:m,i)); u = A(i:m,i); u(1) = u(1)-v(i); alpha(i) = 1/(norm(u)^2); %A(i:m,i+1:n) = A(i:m,i+1:n) - 2*u*(u'*A(i:m,i+1:n)); for k = i+1:n A(i:m,k) = A(i:m,k) - 2*alpha(i)*u*(u'*A(i:m,k)); end end function [A,v,alpha] = CH(A,s) %blocked QR using Householder reflection, step size is s %R is stored in the upper triangular of A %P is stored as a product form, actually only the u(i)'s are stored in the %lower triangular part of A, the first entry of u(i) is stored in v(i), %u(i) is not normalized, alpha(i)=1/(norm(u(i))^2) [m,n]=size(A); for c = 1:s:n p = min(s,n-c+1); [A(c:m,c:c+p-1),v(c:c+p-1),alpha(c:c+p-1)]=BH(A(c:m,c:c+p-1)); %BH is original QR using Householder reflection for k = c+p:n for j = 1:p u = [v(c+j-1),A(c+j:m,c+j-1)']'; A(c+j-1:m,k) = A(c+j-1:m,k) - 2*alpha(c+j-1)*u*(u'*A(c+j-1:m,k)); end end end5~ Givens QR algorithm: -------------------- This section attempts to determine the likely points for improvement in the standard fast Givens algorithm for a superscalar processor. [2] describes adapting a fast Givens algorithm to superscalars by optimizing it through 3 parameters. The standard Givens rotation algorithm takes more operations than the Householder algorithm for doing a QR factorization. The fast Givens algorithm tries to speed this procedure by reducing the number of flops. The idea is to represent the matrix Q by a pair (M, D) where M'M = D = diag(di) and each di is positive. The relationship between A, M and D is given by Q = MD^(-1/2) = M * diag(1/sqrt(di)); This algorithm is given below. (A,d) = Fast_Givens_QR(A): Matrices: A in R(m * (n+b)), A = [A0, B], B in R(m*b), d in Rm 1. d = 1 2. for j = 1 to n 3. for i = m to j+1 by -1 4. (A, d, t, alpha1, beta1) = rotate_1(A, d, i, j) 5. (A) = update_1_1_type_t(A, i, j, alpha1, beta1); 6. end for 7. end for Basically the standard fast Givens algorithm initializes each element of the vector d to 1. Then it updates d while transforming A into (D^1/2)Q'A. Zeroes are introduced from top to bottom and left to right in the matrix A. The zero in row i and column j is obtained by combining rows i and i-1 through either a type 0 or a type 1 fast Givens rotation. The subroutines rotate_1 and update_1_1_type_t where t is in {0, 1} accomplish the necessary transformation. 3-parameter algorithm for a superscalar architecture: The above algorithm is efficient for processors were serial floating point operations dominate the execution time, but not for most superscalar processors. We can adapt the above algorithm by considering the three parameters that are most likely to influence performance in a superscalar processor. a) The cache parameter C: This parameter is targeted to improve performance on problems where the cache is large enough to hold between 4 and m-1 rows of the A matrix. Smaller problems are cache efficient since all the elements fit in the cache. The idea is to make use of the flexibility provided by the zeroing order of the fast Givens algorithm. The standard ordering zeroes columns from bottom to top and from left to right. Placing a zero in row i requires rows i and i-1 to be read and modified. Thus the entire matrix has been read by the time the first column of zeroes is complete. If the cache is not large enough, when we come to the 2nd column, the bottom rows would have been evicted out, especially if the cache used an LRU replacement policy. The new algorithm introduces zeroes into the A matrix in diagonal bands of height C. The requirement that should be satisfied when making A(i,j) = 0 is that Condition 1: A(l, j) = 0, for l = i+1, ..., m Condition 2: A(l, k) = 0, for l = i-1, ..., m and k = 1, ... , j-1. b) The register parameter G: Note that introducing a zero in row i and column j modifies the matrix elements A(i-1, k) for k = j+1, ..., n+b. We later modify these same elements for the zero in row i-1. So the storing and reloading of each intermediate value of A(i-1, k) is overhead. If the zeroes in rows i and i-1 were introduced together, this overhead could be eliminated. If G zeroes are introduced together, the overhead of storing and reloading G-1 elements per column could be eliminated. Increasing the parameter G yields a benefit depending on the number of registers available in the processor. c) The pipeline parameter P: The pipeline parameter reorders the computation further to avoid pipeline stalls. To avoid stalls, we need every set of N (issue width of the processor) consecutive operations to be independent. This may not be satisfied by the standard algorithm because the dependence distance is very small, and if N is large then we have very few instructions to fill up the issue slots. This problem can be avoided by recognizing that the computations for different columns are still independent of one another (after the first 2 transformations discussed above). The parameter P specifies the number of columns that we completely update before moving on to the next P columns. That is, we apply alpha1 and beta1 to P columns, then alpha2 and beta2 to the same P columns and so on. The new algorithm uses the above 3 parameters to implement a superscalar-efficient procedure. Project status: --------------- Householder factorization: a) Read paper [1]. b) Implementation of the algorithm in Matlab: We've implemented the proposed cache Householder algorithm in Matlab to see the performance difference between cache Householder algorithm and the basic Householder algorithm. We measured the time to do QR decomposition using basic Householder algorithm and cache Householder algorithm for matrices of different sizes. But the performance difference is small. We're trying to find out why. Givens rotation: a) Read paper [2]. Work to do: ---------- The register Householder algorithm is more complex and it uses both the basic Householder algorithm and the cache Householder algorithm. We need to understand the algorithm and implement it. We'll try to see if we can optimize these 2 algorithms by using matrix multiplication which is BLAS_3 because the cache Householder algorithm updates the remaining columns one column a time, which is BLAS_2. Our final goal is to implement these 2 algorithms in C and optimize and evaluate their performance on more platforms: Pentium 4, Xeon, UltraSparc, and Alpha. The 3-parameter Givens algorithm will be implemented first in Matlab and its performance tested. We intend to optimize the algorithm for Pentium 4 and Alpha 21264 by doing a search space over the parameters. References: ----------- 1. James J. Carrig Jr. Gerard G. L. Meyer, "Efficient householder QR factorization for superscalar processors", ACM Transactions on Mathematical Software (TOMS), Volume 23, Issue 3 (September 1997). 2. James J. Carrig Jr. and Gerard G. L. Meyer, "A three-parameter fast Givens QR algorithm for superscalar processors'', in 1996 International Conference on Parallel Processing, August 1996, vol. II, pp. 11-18.