Analysis of a Class of
Parallel Matrix Multiplication Algorithms

John Gunnels
Calvin Lin
Greg Morrow
Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
Austin, TX 78712
{gunnels,lin,morrow,rvdg}@cs.utexas.edu
A Technical Paper Submitted to IPPS 98

Abstract:

Publications concerning parallel implementation of matrix-matrix multiplication continue to appear with some regularity. It may seem odd that an algorithm that can be expressed as one statement and three nested loops deserves this much attention. This paper provides some insights as to why this problem is complex: Practical algorithms that use matrix multiplication tend to use different shaped matrices, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide theoretical analysis and experimental results to explain the differences in performance achieved when these algorithms are applied to differently shaped matrices. This analysis sets the stage for hybrid algorithms which choose between the algorithms based on the shapes of the matrices involved. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued.

1 Introduction

Over the last three decades, a number of different approaches have been proposed for implementation of matrix-matrix multiplication on distributed memory architectures. These include Cannon's algorithm [7], the broadcast-multiply-roll algorithm [16, 15], and Parallel Universal Matrix Multiplication Algorithm (PUMMA) [11]. This last algorithm is a generalization of broadcast-multiply-roll to non-square meshes of processors. An alternative generalization of this algorithm [19, 20] and an interesting algorithm based on a three-dimensional data distribution have also been developed [2].

The approach now considered the most practical, sometimes known as broadcast-broadcast, was first proposed by Agarwal et al. [1], who showed that a sequence of parallel rank-k (panel-panel) updates is a highly effective way to parallelize C = A B . That same observation was independently made by van de Geijn et al. [24], who introduced the Scalable Universal Matrix Multiplication Algorithm (SUMMA). In addition to computing C = A B as a sequence of panel-panel updates, SUMMA implements tex2html_wrap_inline3163 and tex2html_wrap_inline3165 as a sequence of matrix-panel (of vectors) multiplications, and tex2html_wrap_inline3167 as a sequence of panel-panel updates. Later work [8] showed how these techniques can be extended to a large class of commonly used matrix-matrix operations that are part of the Level-3 Basic Linear Algebra Subprograms [12].

All of the efforts above-mentioned have concentrated primarily on the special case where the input matrices are approximately square. More recently, work by Li, et al. [22] attempts to create a ``poly-algorithm'' that chooses between three of the above algorithms (Cannon's, broadcast-multiply-roll, and broadcast-broadcast). Through empirical data they show some advantage when meshes are non-square or odd shaped. However, the algorithms being considered are not inherently more adapted to specific shapes of matrices, and thus limited benefit is observed.

In earlier work [23, 4, 3] we observe the need for algorithms to be sensitive to the shape of the matrices, and we hint at the fact that a class of algorithms naturally supported by the Parallel Linear Algebra Package (PLAPACK) infrastructure provides a convenient set of algorithms to serve as the basis for such shape-adaptive algorithms. Thus, a number of hybrid algorithms are now part of PLAPACK. This observation was also made as part of the ScaLAPACK [9] project, and indeed ScaLAPACK includes a number of matrix-matrix operations that are implemented to choose algorithms based on the shape of the matrix. We comment on the difference between our approach and ScaLAPACK later.

This paper provides a description and analysis of the class of parallel matrix multiplication algorithms that naturally lends itself to hybridization. The analysis combines theoretical results with empirical results to provide a complete picture of the benefits of the different algorithms and how they can be combined into hybrid algorithms. A number of simplifications are made to provide a clear picture that illustrates the issues. Having done so, a number of extensions are identified for further study.

2 Background

2.1 Model of Parallel Architecture

For our analysis, we will assume a logical tex2html_wrap_inline3169 mesh of computational nodes which are indexed tex2html_wrap_inline3171 , tex2html_wrap_inline3173 and tex2html_wrap_inline3175 , so that the total number of nodes is p = r c . These nodes are connected through some communication network, which could be a hypercube (Intel iPSC/860), a higher dimensional mesh (Intel Paragon, Cray T3D/E) or a multistage network (IBM SP2).

2.2 Model of Communication Cost

2.2.1 Individual messages

We will assume that in the absence of network conflict, sending a message of length n between any two nodes can be modeled by

displaymath3179

where tex2html_wrap_inline3183 represents the latency (startup cost) and tex2html_wrap_inline3185 the cost per byte transferred. For all available communication networks, the collective communications in Section 2.2.2 can be implemented to avoid network conflicts. This will in some networks require an increased number of messages, which affects the latency incurred.

2.2.2 Collective communication

 

In the matrix-matrix multiplication algorithms discussed here, all communication encountered is collective in nature. These collective communications are summarized in Table 1.

   table571
Table 1: Collective communication routines used in matrix-matrix multiplication.

Frequently operations like broadcast are implemented using minimum spanning trees, in which case the cost can be modeled by tex2html_wrap_inline3201 . However, when n is large, better algorithms are available. Details on the algorithms and estimated costs can be found in the literature (e.g., [6, 15, 18]).

2.3 Model of Computational Cost

All computation on the individual nodes will be implemented using calls to highly optimized sequential matrix-matrix multiply kernels that are part of the Level-3 Basic Linear Algebra Subprograms (BLAS) [12]. In our model, we will represent the cost of one add or one multiply by tex2html_wrap_inline3207 .

For many vendors, the shape of the matrices involved in a sequential matrix-matrix multiply (dgemm call) influences the performance of this kernel. Given the operation

displaymath3205

there are three dimensions: m , n , and k, where C is tex2html_wrap_inline3217 , A is tex2html_wrap_inline3221 and B is tex2html_wrap_inline3225 . In a typical use, each of these dimensions can be either large or small, leading to the different shapes given in Table 2. In our model, we will use a different cost for each shape, as indicated in the table. The fact that this cost can vary considerably will be demonstrated in Section 4 and Table 5.

We identify the different shapes by the operation they become when ``small'' equals one, which indicates that one or more of the matrices become scalars or vectors. The different operations gemm, gemv, ger, axpy, and dot refer to the BLAS routinesgif for neral atrix-atrix multiplication, atrix-ector multiplication, ank-1 update, scalar--times--lus-, and product. The names for some of the cases reported in the last column come from recognizing that when ``small'' is greater than one but small, a matrix that has a row or column dimension of one becomes a row panel or column panel of vectors, respectively.

   table619
Table 2: Different possible cases for C = A B .

2.4 Data distribution

2.4.1 Physically Based Matrix Distribution

We have previously observed [14] that data distributions should focus on the decomposition of the physical problem to be solved, rather than on the decomposition of matrices. Typically, it is the elements of vectors that are associated with data of physical significance, and it is therefore their distribution to nodes that is significant. A matrix, which is a discretized operator, merely represents the relation between two vectors, which are discretized spaces: y = A x . Since it is more natural to start by distributing the problem to nodes, we partition x and y , and assign portions of these vectors to nodes. The matrix A is then distributed to nodes so that it is consistent with the distribution of the vectors, as we describe below. We call such matrix distributions physically based if the layout of the vectors which induce the distribution of A to nodes is consistent with where an application would naturally want them. We will use the abbreviation PBMD for Physically Based Matrix Distribution.

We now describe the distribution of the vectors, x and y , to nodes, after which we will show how the matrix distribution is induced (derived) by this vector distribution. Let us assume x and y are of length n and m , respectively. We will distribute these vectors using a distribution block size of tex2html_wrap_inline3427 . For simplicity assume tex2html_wrap_inline3429 and tex2html_wrap_inline3431 . Partition x and y so that

displaymath3403

where N >> p and M >> p and each tex2html_wrap_inline3441 and tex2html_wrap_inline3443 is of length tex2html_wrap_inline3427 . Partitioning A conformally yields the blocked matrix

  equation861

where each subblock is of size tex2html_wrap_inline3449 . Blocks of x and y onto a two-dimensional mesh of nodes in column-major order, as illustrated in Fig. 1. The matrix distribution is induced by these vector distributions by assigning blocks of columns tex2html_wrap_inline3455 to the same column of nodes as subvector tex2html_wrap_inline3457 , and blocks of rows tex2html_wrap_inline3459 to the same row of nodes as subvector tex2html_wrap_inline3443 . This is illustrated in Fig. 1, taken from [23].

   figure885
Figure 1: Inducing a matrix distribution from vector distributions when the vectors are over-decomposed. Top: The subvectors of x and y are assigned to a logical tex2html_wrap_inline3121 mesh of nodes by wrapping in column-major order. By projecting the indices of y to the left, we determine the distribution of the matrix row-blocks of A . By projecting the indices of x to the top, we determine the distribution of the matrix column-blocks of A . Bottom: Distribution of the subblocks of A . The indices indicate the node to which the subblock is mapped.

Notice that the distribution of the matrix wraps blocks of rows and columns onto rows and columns of processors, respectively. The wrapping is necessary to improve load balance.

2.4.2 Size of local data

As part of our cost analysis in the subsequent sections, it will be important to compute the maximum number of elements of a vector x of length n assigned to a processor. Similarly, we need the maximum number of matrix elements assigned to any node. Functions for these quantities are given in Table 3.

   table901
Table 3: Functions that give the maximum local number of elements for distributed vectors and matrices.

2.4.3 Data movements for matrix-matrix multiplication

In the PBMD representation, vectors are fully distributed (not duplicated!), while a single row or column of a matrix will reside in a single row or column of processors. One benefit of the PBMD representation is the clean manner in which vectors, matrix rows, and matrix columns interact. Redistributing a column of a matrix like a vector requires a scatter communication within rows. The natural extension is that redistributing a panel of columns like a multivector (a group of vectors) requires each of the columns of the panel to be scattered within rows. In Table 4, we show the natural data movements encountered during matrix-matrix multiplication, and their costs. In that table, we also count the time for packing ( tex2html_wrap_inline3541 ) and unpacking ( tex2html_wrap_inline3543 ) data into a format that allows for the collective communication to proceed. The necessity for this can be seen by carefully examining Fig. 1, which shows that when indices that indicate vector distribution are projected to derive row and column distribution, an interleaving occurs.

   table932
Table 4: Data movements encountered as part of matrix-matrix multiplication.

3 A Class of Algorithms

 

The previous section laid the foundation for the analysis of a class of parallel matrix-matrix multiplication algorithms. We show that different blockings of the operands lead to different algorithms, each of which can be built from a simple parallel matrix-matrix multiplication kernel. These kernels themselves can be recognized as the special cases in Table 2 where one or more of the dimensions are small.

The kernels can be separated into two groups. For the first, at least two dimensions are large, leading to operations involving matrices and panels of rows or columns: the panel-panel update, matrix-panel multiply, and panel-matrix multiply. Notice that these cases are similar to kernels used in our paper [8] for parallel implementation of matrix-matrix operations as well as basic building blocks used by ScaLAPACKgif for parallel implementation of Basic Linear Algebra Subprograms. However, it is the systematic data movement supported by PLAPACK that allows the straight-forward implementation, explanation and analysis of these operations, which we present in the rest of this section.

The second group of kernels assumes that only one of the dimensions is large, making one the operands a small matrix, and the other two row or column panels. Their implementation recognizes that when the bulk of the computation involves operands that are panels, redistributing like a multivector becomes beneficial. These kernels inherently can not be supported by ScaLAPACK due to fundamental design decisions underlying that package. In contrast, Physically Based Matrix Distribution and PLAPACK naturally support their implementation.

We will see that all algorithms will be implemented as a series of calls to these kernels. One benefit from this approach is that a high level of performance is achieved while workspace required for storing duplicates of data is kept reasonable.

We recognize that due to page limitations, the explanations of the algorithms are somewhat terse. We refer the reader to the literature [8, 23, 24] for additional details.

3.1 Notation

For simplicity, we will assume that the dimensions m , n , and k are integer multiples of the algorithmic block size tex2html_wrap_inline3807 . When discussing these algorithms, we will use the following partitionings of matrices A , B , and C :

displaymath3797

where tex2html_wrap_inline3815 , and tex2html_wrap_inline3817 and tex2html_wrap_inline3819 are the row and column dimension of the indicated matrix.

Also,

displaymath3798

In the above partitionings, the block size tex2html_wrap_inline3807 is choosen to maximize the performance of the local matrix-matrix multiplication operation.

3.2 Panel-panel based algorithm

Observe that

eqnarray1577

Thus, matrix-matrix multiplication can be implemented as tex2html_wrap_inline3825 calls to a kernel that performs a panel-panel multiplication. This kernel can be implemented as follows:

tabular1597

The cost of calling this kernel repeatedly to compute C = A B becomes

displaymath3823

3.3 Matrix-panel based algorithm

Observe that

eqnarray1655

Thus, matrix-matrix multiplication can be implemented by computing each column panel of C, tex2html_wrap_inline3945 , by a call to a matrix-panel kernel:

tabular1670

The total cost to compute C = A B in this fashion is

displaymath3943

3.4 Panel-matrix based algorithm

Observe that

eqnarray1731

Thus, matrix-matrix multiplication can be implemented by computing each row panel of C, tex2html_wrap_inline4073 , by a call to the panel-matrix kernel:

tabular1756

The total cost is given by

displaymath4071

3.5 Panel-axpy based algorithm

Observe that

eqnarray1817

Thus, each column panel of C, tex2html_wrap_inline3945 , can be computed by tex2html_wrap_inline3825 calls to the panel-axpy kernel:

tabular1852

For computing all panels in this fashion, the cost becomes

displaymath4197

Alternatively, C and B can be partitioned by row blocks, and A into tex2html_wrap_inline3449 blocks, which leads to an algorithm that computes each row panel of C by tex2html_wrap_inline3825 calls to an alternative panel-axpy kernel. This leads to a total cost of

displaymath4198

3.6 Panel-dot based algorithm

Observe that

eqnarray1957

Thus, each block tex2html_wrap_inline4323 can be computed as a panel-dot operation:

tabular1990

Total cost:

displaymath4321

4 Preliminary Results

 

In this section, we report preliminary performance results on the Cray T3E of implementations of the described algorithms. Our implementations were coded using the PLAPACK library infrastructure, which naturally supports these kinds of algorithms. In addition, we report results from our performance model.

4.1 Low-level parameters

In order to report results from our performance model, a number of parameters must first be determined. These can be grouped onto two sets: communication related parameters and computation related parameters.

4.1.1 Communication

All PLAPACK communication is accomplished by calls to the Message-Passing Interface (MPI). Independent of the algorithms, we determined the communication related parameters by performing a simple "bounce-back" communication. From this, the latency (or startup cost) was measured to equal tex2html_wrap_inline4415 tex2html_wrap_inline4417 sec. while the cost per byte communicated equaled tex2html_wrap_inline4419 nanosec. (for a transfer rate of 22 Mbytes per second).

For the data movements that require packing, we measured cost per byte packed to be tex2html_wrap_inline4421 nanosec. Considerable improvements in the PLAPACK packing routines can still be achieved and thus we expect this overhead to be reduced in the future.

4.1.2 Computation

Ultimately, it is the performance of the local 64 bit matrix-matrix multiplication implementation that determines the asymptotic performance rate that can be achieved. Notice that our algorithms inherently operate on matrices and panels, and thus the performance for the different cases in Table 2 where "small" is taken to equal the algorithmic blocking size needs to be determined. In our preliminary performance tests, we chose the algorithmic blocking size to equal 128, which has, in the past, given us reasonable performance for a wide range of algorithms. Given this blocking size, the performance of the local BLAS matrix-matrix multiplication kernel is given in Table 5.

   table2586
Table 5: Values of the local computational parameters measured on the Cray T3E. Notice, these parameters represent performance of an early version of the BLAS.

It is interesting to note that it is the panel-panel kernel that performs best. To those of us experienced with high-performance architectures this is not a surprise: This is the case that is heavily used by the LINPACK benchmark, and thus the first to be fully optimized. There is really no technical reason why for a block size of 128 the other shapes do not perform equally well. Indeed, as an architecture matures, those cases also get optimized. For example, on the Cray T3D the performance of the local matrix-matrix BLAS kernel is not nearly as dependent on the shape.

4.2 Performance of the algorithms

In this section, we demonstrate both through the cost model and measured data how the different algorithms perform as a function of shape. To do so, we fix one or two of the dimensions to be large and vary the other dimension(s).

4.2.1 Dimensions m and n are large

In Figure 2, we show the predicted and measured performance of the various algorithms when both m and n are fixed to be large.

First, let us comment on the asymptotic behavior. It is the case that for the panel-panel, matrix-panel, and panel-matrix algorithms as k gets large, communication and other overhead eventually become a lower order term when compared to computation. Thus, asymptotically, they should all approach a performance rate equal to the rate achieved by the local matrix-matrix kernel. This argument does not hold for the panel-axpy and panel-dot based algorithms: For each call the the parallel panel-axpy and panel-dot kernel, tex2html_wrap_inline4481 or tex2html_wrap_inline4483 data is communicated before performing tex2html_wrap_inline4485 floating point operations. Thus, if tex2html_wrap_inline3807 , r and c are fixed, communication does not become less significant, even if the matrix dimensions are increased. All of these observations can be observed in the graphs.

   figure2609
Figure 2: Performance of the various algorithms on a 16 processor configuration of the Cray T3E. In this graph, two dimensions, m and n are fixed to be large, while k is varied. A distribution block size equal to the algorithmic block size of 128 was used to collect this data. The graph on the left reports data derived from the model, while the graph on the right reports measured data.

Next, notice from the graphs that the panel-panel based algorithm always outperforms the others. A moment of thought leads us to conclude that this is natural, since the panel-panel kernel is inherently designed for the case where m and n are large, and k is small. By comparison, a matrix-panel based algorithm will inherently perform badly when k is small, since in that algorithm parallelism is derived from the distribution of A , which is only distributed to one or a few columns of nodes when k is small. Similar arguments explain the poor behavior of the other algorithms when k is small.

Finally, comparing the graph derived from the model to the graph reporting measured data, we notice that while the model is clearly not perfect, it does capture qualitatively the measured data. Also, the predicted performance reasonably reflects the measured performance, especially for the panel-panel, matrix-panel, and panel-matrix based algorithms.

4.2.2 Dimensions m and k are large

In Fig. 3, we show the predicted and measured performance of the various algorithms when both m and k are fixed to be large.

Again, we start by commenting on the asymptotic behavior. As for the case where m and k are fixed to be large, for the panel-panel, matrix-panel, and panel-matrix algorithms as k gets large, communication and other overhead eventually become a lower order term when compared to computation. Thus, asymptotically, they should and do all approach a performance rate equal to the rate achieved by the local matrix-matrix kernel. Again, this argument does not hold for the panel-axpy and panel-dot based algorithms. All of these observations can be observed in the graphs.

   figure2623
Figure 3: Performance of the various algorithms on a 16 processor configuration of the Cray T3E. In this graph, two dimensions, m and k are fixed to be large, while n is varied. A distribution block size equal to the algorithmic block size of 128 was used to collect this data. The graph on the left reports data derived from the model, while the graph on the right reports measured data.

Next, notice from the graphs that the matrix-panel based algorithm initially outperforms the others. This is natural, since the matrix-panel kernel is inherently designed for the case where m and k are large, and n is small. By comparison, a panel-panel or panel-matrix based algorithm will inherently perform badly when n is small, since in those algorithms parallelism is derived from the distribution of C and B , respectively, which are only distributed to one or a few columns of nodes when k is small. Similar arguments explain the poor behavior of the other algorithms when k is small.

Notice that eventually the panel-panel based algorithm outperforms the matrix-panel algorithm. This is solely due to the fact that the implementation of the local matrix-matrix multiply is more efficient for the panel-panel case.

Again,the model is clearly not perfect, but does capture qualitatively the measured data. For the panel-panel, matrix-panel and panel-matrix based algorithms, the model is quite good: for example, it predicts the cross-over point where the panel-panel algorithm becomes superior to within five percent.

4.2.3 Dimension n is large

In Fig. 4, we present results for one of the cases where only one dimension is fixed to be large. This time, we would expect one of the panel-axpy based algorithms to be superior, at least when the other dimensions are small. While it is clear from the graphs that the other algorithms perform poorly when m and k are small, due to time pressures, we did not collect data for the case where m and k were extremely small (much less than the algorithmic blocking size) and thus at this moment we must declare the evidence inconclusive.

   figure2636
Figure 4: Performance of the various algorithms on a 16 processor configuration of the Cray T3E. In this graph, one dimension, n is fixed to be large and m = k are varied. A distribution block size equal to the algorithmic block size of 128 was used to collect this data. The graph on the left reports data derived from the model, while the graph on the right reports measured data.

5 Conclusion

In this paper, we have systematically presented algorithms, analysis, and performance results for a class of parallel matrix algorithms. This class of algorithms allows for high performance to be attained across a range of shapes of matrices.

As mentioned, the results presented in the previous section are extremely preliminary. They are literally the first results computed with the model and the first results for implementations of the panel-axpy and panel-dot based algorithms.

The good news is that certainly qualitatively the model reflects the measured performance. Even quantitatively, the initial results look good. Finally, there were no surprises: both the model and the empirical results reflected what we expected from experience.

A lot of further experimentation must and will be performed in order to paint a complete picture:

Experimental and model data must be collected for various architectures. Of particular interest will be those architectures where the local matrix-matrix multiplication kernel is more mature, yielding a more uniform performance regardless of shape.
The algorithmic blocking size was chosen to equal an algorithmic blocking size that has performed well for us across a range of algorithms. However, this choice is somewhat skewed towards the blocking size that makes the local panel-panel multiply efficient. In particular, we would expect that using a much larger algorithmic blocking size for the panel-axpy and panel-dot based algorithms would improve the ratio of computation to communication and thus the parallel performance.
Both the model and PLAPACK support the case where distribution and algorithmic blockings that are not equal. Much of the jagged behavior in the curves in the graphs are due to load imbalance, which can be reduced by decreasing the distribution blocking size. By keeping the algorithmic blocking size large, the asymptotic performance will not change.
The model is greatly affected by the choice of implementation for the collective communication algorithms on a given architecture. Our results for the model were derived by assuming that commonly used algorithms were implemented. However, optimizations (or lack thereof) on different architectures may make this assumption invalid.
The differences between the algorithms can be expected to become much more pronounced when larger numbers of processors are utilized. Thus, we must collect data from larger machine configurations.

We conclude by remarking that if the model were perfect, it could be used to pick between the different algorithms, thus creating a hybrid algorithm. Clearly, too many unknown parameters make it difficult to tune the model for different vendors. Thus there is a need for simple heuristics which are not as sensitive to imperfections in the model. If the vendors supplied local matrix-matrix multiply kernels that performed reasonably well for all shapes of matrices encountered in this paper, this task would be greatly simplified. In particular, choosing between the algorithms that target the case where at least two dimensions are large would become a matter of determining which matrix has the largest number of elements. The algorithm that keeps that matrix stationary, communicating data from the other matrices, should perform best, since parallelism is derived from the distribution of that stationary matrix. Similarly, choosing between the algorithms that target the case where at one dimension is large would become a matter of determining which matrix has the least number of elements. The algorithm that redistributes the other two matrices should perform best, since parallelism is derived from the length of the matrices that are redistributed. Thus, what would be left would be to determine whether shapes of the matrices are such that at least two dimensions are large, or only one dimension is large. Investigation of this heuristic is left as future research.

Acknowledgments

We are indebted to the rest of the PLAPACK team (Philip Alpatov, Dr. Greg Baker, Dr. Carter Edwards, James Overfelt, and Dr. Yuan-Jye Jason Wu) for providing the infrastructure that made this research possible.

Primary support for this project come from the Parallel Research on Invariant Subspace Methods (PRISM) project (ARPA grant P-95006) and the Environmental Molecular Sciences construction project at Pacific Northwest National Laboratory (PNNL) (PNNL is a multiprogram national laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under Contract DE-AC06-76RLO 1830). Additional support for PLAPACK came from the NASA High Performance Computing and Communications Program's Earth and Space Sciences Project (NRA Grants NAG5-2497 and NAG5-2511), and the Intel Research Council.

We gratefully acknowledge access to equipment provided by the National Partnership for Advanced Computational Infrastructure (NPACI), The University of Texas Computation Center's High Performance Computing Facility, and the Texas Institute for Computational and Applied Mathematics' Distributed Computing Laboratory,

Additional information

For additional information, visit the PLAPACK web site: http://www.cs.utexas.edu/users/plapack

References

1
Agarwal, R. C., F. Gustavson, and M. Zubair, ``A high-performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication,'' IBM Journal of Research and Development, Volume 38, Number 6, 1994.

2
Agarwal, R. C., S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar, ``A 3-Dimensional Approach to Parallel Matrix Multiplication,'' IBM Journal of Research and Development, Volume 39, Number 5, pp. 1-8, Sept. 1995.

3
Alpatov, P., G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert van de Geijn, and J. Wu, "PLAPACK: Parallel Linear Algebra Package - Design Overview", Proceedings of SC97, to appear.

4
Alpatov, P., G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, Robert van de Geijn, and J. Wu, "PLAPACK: Parallel Linear Algebra Package," Proceedings of the SIAM Parallel Processing Conference, 1997.

5
Anderson, E., Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, ``Lapack: A Portable Linear Algebra Library for High Performance Computers,'' Proceedings of Supercomputing '90, IEEE Press, 1990, pp. 1-10.

6
Barnett, M., S. Gupta, D. Payne, L. Shuler, R. van de Geijn, and J. Watts, ``Interprocessor Collective Communication Library (InterCom),'' Scalable High Performance Computing Conference 1994.

7
Cannon, L.E., A Cellular Computer to Implement the Kalman Filter Algorithm, Ph.D. Thesis (1969), Montana State University.

8
Chtchelkanova, A., J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, "Parallel Implementation of BLAS: General Techniques for Level 3 BLAS," TR-95-40, Department of Computer Sciences, University of Texas, Oct. 1995.

9
Choi J., J. J. Dongarra, R. Pozo, and D. W. Walker, ``Scalapack: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers,'' Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 1992, pp. 120-127.

10
Choi, J., J. J. Dongarra, and D. W. Walker, ``Level 3 BLAS for distributed memory concurrent computers'', CNRS-NSF Workshop on Environments and Tools for Parallel Scientific Computing, Saint Hilaire du Touvet, France, Sept. 7-8, 1992. Elsevier Science Publishers, 1992.

11
Choi, J., J. J. Dongarra, and D. W. Walker, ``PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers,'' Concurrency: Practice and Experience, Vol 6(7), 543-570, 1994.

12
Dongarra, J. J., J. Du Croz, S. Hammarling, and I. Duff, ``A Set of Level 3 Basic Linear Algebra Subprograms,'' TOMS, Vol. 16, No. 1, pp. 1-16, 1990.

13
Dongarra, J. J., R. A. van de Geijn, and D. W. Walker, ``Scalability Issues Affecting the Design of a Dense Linear Algebra Library,'' Journal of Parallel and Distributed Computing, Vol. 22, No. 3, Sept. 1994, pp. 523-537.

14
C. Edwards, P. Geng, A. Patra, and R. van de Geijn, Parallel matrix distributions: have we been doing it all wrong?, Tech. Report TR-95-40, Dept of Computer Sciences, The University of Texas at Austin, 1995.

15
Fox, G. C., M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors, Vol. 1, Prentice Hall, Englewood Cliffs, N.J., 1988.

16
Fox, G., S. Otto, and A. Hey, ``Matrix algorithms on a hypercube I: matrix multiplication,'' Parallel Computing 3 (1987), pp 17-31.

17
Gropp, W., E. Lusk, A. Skjellum, Using MPI: Portable Programming with the Message-Passing Interface, The MIT Press, 1994.

18
C.-T. Ho and S. L. Johnsson, Distributed Routing Algorithms for Broadcasting and Personalized Communication in Hypercubes, In Proceedings of the 1986 International Conference on Parallel Processing, pages 640-648, IEEE, 1986.

19
Huss-Lederman, S., E. Jacobson, A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, Starksville, MS, Oct. 1993.

20
Huss-Lederman, S., E. Jacobson, A. Tsao, G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA," Concurrency: Practice and Experience, Vol. 6 (7), Oct. 1994, pp. 571-594.

21
Lin, C., and L. Snyder, ``A Matrix Product Algorithm and its Comparative Performance on Hypercubes,'' in Proceedings of Scalable High Performance Computing Conference, (Stout, Q, and M. Wolfe, eds.), IEEE Press, Los Alamitos, CA, 1992, pp. 190-3.

22
Li, J., A. Skjellum, and R. D. Falgout, ``A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies,'' Concurrency, Practice and Experience, VOL. 9(5), pp. 345-389, May 1997.

23
van de Geijn, R., Using PLAPACK: Parallel Linear Algebra Package, The MIT Press, 1997.

24
van de Geijn, R. and J. Watts, ``SUMMA: Scalable Universal Matrix Multiplication Algorithm,'' Concurrency: Practice and Experience, VOL. 9(4), 255-274 (April 1997).

About this document ...

Analysis of a Class of
Parallel Matrix Multiplication Algorithms

This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -link 1 -show_section_numbers -address rvdg@cs.utexas.edu ipps97.tex.

The translation was initiated by Robert van de Geijn on Thu Oct 30 16:33:53 CST 1997

...routines
Notice that the more general operation performed is 19#19, which further explains the choice of BLAS routine mentioned.
...ScaLAPACK
The PB-BLAS, for Parallel Blocked BLAS
 


rvdg@cs.utexas.edu