A Parallel Linear Algebra Server for Matlab-like Environments

Greg Morrow
and
Robert van de Geijn
Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712
{morrow,rvdg}@cs.utexas.edu

An Extended Abstract Submitted to SC98

Introduction

Mathematical software packages such as Mathematica, Matlab, HiQ and others allow scientists and engineers to perform complex analysis and modeling in their familiar workstation environment. However, these systems are limited in the size of problem that can be solved, because the memory and CPU power of the workstation are limited. Obviously, the benefit of the interactive software is lost if the problem takes two weeks to run. With the advent of inexpensive ``beowulf''-type parallel machines [4], and the proliferation of parallel computers in general, it is natural to wonder about combining the user-friendliness and interactivity of the commercially-available mathematical packages with the computing power of the parallel machines.

We have implemented a system, which we call PLAPACK-Mathpackage Interface (``PMI''), that allows a user of one of the supported mathematical packages to export computationally intense problems to a parallel computer running PLAPACK. The interface consists of a set of functions the user calls from within the mathematical package that allow creating, filling, manipulating, and freeing matrix, vector, and scalar objects on the parallel computer. Both memory and CPU power scale linearly with the number of processing elements in the parallel machine. Thus, PMI allows the interactive software packages to break the bonds of the workstation to solve ever larger and more complex problems.

PMI is not the first attempt to exploit parallelism from within interactive mathematical software packages. MultiMatlab [6] (from the Cornell Theory Center) and the MATLAB toolbox [5](from University of Rostock, Germany) are extensions of the Matlab intrepreter that essentially run on each node of the parallel machine. A similar product for Mathematica [10] is available from Mathconsult in Switzerland. Compiler-based systems such as FALCON [9] (from University of Illinois), Otter [8] (from Oregon State University) and CONLAB [7] (from University of Umea, Sweden) start with Matlab script files and use compiler technology to create explicit message-passing codes, which then execute essentially independently of the script's orignal interactive software platform. This list is not exhaustive, but should give the idea that there are many approaches to this problem. Our approach is most similar to the MultiMATLAB and MATLAB Toolbox approaches, with one important difference. In PMI, the third-party software (MATLAB, Mathematica, etc.) only runs on one node of the machine, rather than all nodes. All parallel communication is handled from within PLAPACK.

This paper is organized as follows. Section 2 gives a brief overview of PLAPACK and of the interactive mathematical packages. Section 3 discusses some implementation details of PMI. Section 4 shows what PMI looks like from a user's point of view. Section 5 details some measurements of PMI's performance on a parallel system. Finally, section 6 gives some concluding remarks.

Overview of PLAPACK and the interactive packages

This section gives a brief description of PLAPACK and of the interactive packages that PMI connects to PLAPACK.

PLAPACK

PLAPACK (Parallel Linear Algebra Package) is an object-oriented system for doing dense linear algebra on parallel computers [3,1,2]. It is written in C, and uses the Message Passing Interface (MPI) for communication. It is distinguished by the fact that the programmer is not exposed to error-prone index computations. Instead, the concept of a ``view'' into a matrix or vector is used to allow for index-free programming, even of highly complex algorithms. PLAPACK's high-level abstraction and user-friendliness do not come at the expense of high performance. Our Cholesky factorization algorithm, for example, achieves over 360 MFLOPS per PE on 16 PE's of the Cray T3E-600 (300 MHz).

X-lab

Interactive mathematics packages such as Matlab (from the Mathworks, www.mathworks.com), Mathematica (from Wolfram Research, www.wolfram.com), HiQ (from National Instruments, www.natinst.com), and others allow their users access to sophisticated mathematics in an interactive, workstation environment. The packages typically include functionality for linear algebra, curve-fitting, differential equations, signal processing and sophisticated graphics, as well as many other areas of functionality.

Because PMI can connect with any of these products, and in an attempt to be even-handed, throughout the text we will refer to the interactive package as ``X-lab.'' This is intended to refer to any of the above products.

Implementation

This section briefly describes the implementation of PMI. We begin by discussing the basic mechanism of communication in PMI. Then we describe the ``third-party'' part of the program, i.e. the part of the PMI software associated with a particular platform (Matlab, Mathematica, etc.). Next, we detail the PLAPACK side of the interface. Finally, we give some remarks about software layering in PMI.

Overview of implementation

Figure 1 shows a diagram of the PMI implementation. The third-party software (``X-lab'') communicates via its inherent interface to a set of C routines. These routines then communicate via shared memory with the PLAPACK process running on the same node as the third-party software. This PLAPACK process then uses the PLAPACK application interface and MPI routines to communicate with the other PLAPACK nodes. For sending results back to the third party software, the process is reversed.

**Figure 1:** Diagram of the PMI machine layout.
$\begin{figure} \psfig {figure=pmi_machine.ps,height=4.0in,width=6.0in,angle=270}\end{figure}$

The ``X-lab'' side

Each piece of third-party software that is a candidate for a PLAPACK interface must be able to call user-supplied C code, and must also be able to pass data back and forth to that C code. (In practice, if user C code is callable within a package, then there is always a way to pass data back and forth.)

Given that the third-party program possesses the features described above, the following outline shows how it processes PMI commands. For concreteness, we refer to the third-party software as ``X-lab'', and this is intended to represent any of the PMI-supported software platforms.

1.: The user calls a PMI function from within X-lab
2.: X-lab invokes a call to the PMI code, passing whatever parameters are necessary for this particular function
3.: The PMI code writes a header to the shared memory area, followed by data items to be transferred, where applicable
4.: The PMI code turns over ownership of the shared memory to PLAPACK
5.: The PMI code waits for ownership of the shared memory to be returned by PLAPACK
6.: The PMI code performs any post-processing required, and either returns the relevant data to X-lab, or, if no return data is required, returns an integer whose value signifies the success or failure of the requested operation

PLAPACK side

The PLAPACK side of PMI is a parallel application that plays the role of a compute server : it sets up the PLAPACK environment, then waits for commands from the client (which is the third-party side of PMI.) Within the parallel application, one processing element (PE) plays a special role, called the master PE, and the rest of the PE's are slaves. The following is an outline of how the PLAPACK side processes commands.

1.: The master PE waits for ownership of the shared memory to be granted to PLAPACK
2.: The master PE reads the command information from the shared memory header area, and broadcasts any relevant information to the slave PE's
3.: In the case of a request to place data into a parallel object or to retrieve data from a parallel object, the master PE uses the PLAPACK Application-program interface to transfer the data to or from the intended PE's
4.: The PLAPACK PE's perform whatever parallel computation is required by the command
5.: The master PE writes any return information, error messages, and an error return code to the shared memory header
6.: In the cases where numerical data (e.g. matrix elements) need to be trasferred back to the third-party software, the data are written by the master PE into the data area of the shared memory
7.: Ownership of the shared memory area is relinquished by PLAPACK

Software layering

PMI is intended to be a general-purpose interface. That is, it should be able to plug into a variety of third-party packages, and should be able to run on a variety of parallel machines. Portability of the parallel program is easy : since PLAPACK itself is highly portable (requiring only C and MPI), this part of PMI's portability is automatic.

Because of the quirks and idiosyncrasies of the third-party interface specifications, there are ceratin functions that must be reimplemented for each intended third party platform. We have attempted to layer our software in such a way that these non-portable parts of the code are isolated and small.

Figure 2 shows a diagram of the PMI layering. Notice that the non-portable sections are limited to the ``shared memory mechanism'' and the ``PMIPutData'' and ``PMIGetData'' modules. The shared memory mechanism is only slightly non-portable : one version works for all Unix platforms, but some slight changes are required for Windows NT.

The functions ``PMIPutData'' and ``PMIGetData'', which send and receive data from PMI to and from the third party software, must be reimplemented for each third party platform.

**Figure 2:** Software layering in PMI.
$\begin{figure} \psfig {figure=pmi_layers.ps,height=3.0in,width=6.0in,angle=270}\end{figure}$

Using PMI

This section describes PMI from a user's point of view. What does she do to initiate a session? What does she see from within her third party mathematical software? What does a sample application look like?

The parallel application

The PLAPACK side of PMI is just like any other parallel MPI executable. Hence, whatever steps are necessary to launch a parallel program on your computer must be followed. We assume here that the ``mpirun'' facility is available for the parallel machine.

Let us suppose that we wish to have 16 PE's involved in the parallel side of PMI. Then we would launch the PMI_plapack.x executable, and specify 16 processors on the command line. This would look something like the following

	% mpirun -np 16 PMI_plapack.x
	  PLAPACK software interface waiting for connection

Once the parallel executable is running, the user is free to start PMI from within her mathematical software. (Actually, the order in which the two sides of PMI are initiated is immaterial. However, a PMIopen[] call from within Mathematica will block until the parallel executable is started.)

Within the math-environment

From within a third party mathematical package (which, for concreteness, we will again refer to as X-lab), PMI is accessed through a set of commands, all prefaced with the letters ``PMI.'' These commands fall into three basic categories.

The first categrory consists of commands to initialize, finalize, and manipulate the environment. Examples of commands in this category are PMIOpen[], PMIClose[], and PMIVerbose[].

The second class of commands in PMI perform parallel object manipulations. The purpose of these commands is to create, free, query, and fill parallel matrices, multivectors, and multiscalars. Examples of commands in this category are PMICreateObject[], PMIFreeObject[], PMIAxpyToObject[], PMIAxpyFromObject[]. The latter two functions are used to put values into a parallel object and to get values from a parallel object, respectively.

The third class of commands in PMI cause some action to be taken on parallel objects that already exist within the PMI parallel application (i.e. objects that have already been created and filled with data values.) Examples of these commands are PMILU[] and PMIGemm[], which perform LU factorization and matrix multiplication, respectively.

A sample application

This section presents a sample application. This program would be executed from within a Matlab session, and of course would require a copy of the PMI parallel application to be running as well. This example performs the following steps.

1.: Open the PMI interface
2.: Create parallel objects : a matrix, a multivector, and a multiscalar
3.: Fill the matrix and vector with data values
4.: Perform a general linear solve (LU factorization of the matrix, triangular solves with the vector as right hand side, LU pivots stored in the multiscalar)
5.: Retrieve the data from the vector
6.: Close the PMI interface

Figure 3 shows the above program from within the Matlab version of PMI.

**Figure 3:** A PMI program
$\begin{figure} {\footnotesize {\tt \begin{verbatim} matrixSize = 1000; axpySize ... ...I('getfrom', x, matrixSize, 1, 0, 0)) PMI('close');\end{verbatim}} }\end{figure}$

Performance

This section gives some performance measurements for the PMI system, as measured on a ``beowulf'' system located at the Texas Institute for Computational and Applied Mathematics at the University of Texas at Austin. This system consists of 16 300 MHz Pentium II workstations connected with a 100Mbit network.

We concentrate on properties of the interface itself, rather than on the properties of the parallel executable. (The parallel executable is simply a PLAPACK program in disguise, and performance numbers for PLAPACK are available in the literature and from the PLAPACK web page.) We do, however, show some speedup values to get an idea of overheads inherent in the interface.

The main performance metrics for PMI concern the speed of the shared-memory connection. In particular, we measure the latency of the connection (the time required to get a zero-length message back and forth to the parallel executable) and the inverse bandwidth (the time per byte of data sent to or received from the parallel executable.) In addition to these measurements, we also provide a profile of the sample application described in the previous section.

Latency

The latency of the PMI shared memory connection is measured by timing a round-trip message, where the command involved requires no processing on the parallel side. Table 1 shows some measured latencies.

**Table 1:** Latency and bandwidth of the PMI interface
# PE's	latency (sec)	2c\| bandwidth (Mbyte/sec)
		nb=16	nb = 32
1	0.020	32.0	43.0
4	0.019	3.0	7.3
8	0.019	2.4	2.7
16	0.019	2.4	2.7

Bandwidth

The bandwidth of the PMI connection is measured by timing the action of putting data into a parallel object, and by timing the action of retrieving data from a parallel object. The speed of these operations is dependent upon the specifics of the parallel mesh, in particular the number of processors and the distribution blocksize for the parallel mesh.

We report bandwidth measurements for several sizes of parallel mesh and values of distribution blocksize in Table 1.

Performance of the sample application

This section presents data concerning the performance of the sample application described in the previous section of this paper. First, in Figure 4, we show a profile of the application (generated from within Matlab, the environment where we ran the PMI program) This plot shows a bar graph, with the vertical axis representing time in seconds, and the horizontal axis representing matrix size. These numbers were generated with a four processor configuration. Second, in Figure 5 we present performance numbers for a particular kernel (LU factorization with row pivoting, followed by forward and backward substitution). The horizontal axis is the matrix size, and the vertical axis is total MFLOPS (millions of floating point operations per second). The different curves are for Matlab's LU, and PMI in one, four and sixteen PE configurations.

**Figure 4:** Profile of a PMI application. Vertical axis is time in seconds.
$\begin{figure} \begin{center} \psfig {figure=blank.ps,height=3.0in,width=5.0in,angle=270}\end{center}\end{figure}$

**Figure 5:** Total MFLOPS of the LU kernel : single node Matlab vs. Matlab with PMI in 1, 4, and 16 PE configurations.
$\begin{figure} \begin{center} \psfig {figure=machoFLOPS.eps,height=4.0in,width=5.0in,angle=270}\end{center}\end{figure}$

Conclusions and future work

This paper has descibed the PLAPACK-Mathpackage interface, which is a software connection allowing a user to plug a parallel computer into the back of their favorite interactive mathematical software. We have given some details of the implemntation, use and performance of PMI. As yet, we have only realized a small subset of what can be done with this package. First, the package can be extended to support other software packages, subject only to the constraints referered to in the implementation section of this paper. Second, we intend to thoroughly test the connection of interactive software running on a workstation to a completely separate parallel machine. Third, we intend to incorporate more of the unique features of PLAPACK (for example, the use of ``views'' into matrices and vectors) into PMI. Finally, we are interested in experimenting with ``real applications.'' That is, we wish to take an existing Matlab or Mathematica application code and parallelize it through PMI.

Acknowledgements

This work was sponsored in part by the Intel Research Council. The PLAPACK project was sponsored in part by the Parallel Research on Invariant Subspace Methods (PRISM) project (ARPA grant P-95006), the NASA High Performance Computing and Communications Program's Earth and Space Sciences Project (NRA Grants NAG5-2497 and NAG5-2511), and the Environmental Molecular Sciences construction project at Pacific Northwest National Laboratory (PNNL) (PNNL is a multiprogram national laboratory operated by Battelle Memorial Institute for the U.S. Department of Energy under Contract DE-AC06-76RLO 1830).

References

1

P. Alpatov, G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Y.-J. J. Wu, "PLAPACK: Parallel Linear Algebra Package," in Proceedings of the SIAM Parallel Processing Conference, 1997.

2

P. Alpatov, G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Y.-J. J. Wu, "PLAPACK: Parallel Linear Algebra Package Design Overview," in Proceedings of SC97, 1997.

3

Robert van de Geijn, Using PLAPACK: Parallel Linear Algebra Package,
The MIT Press, 1997.

4

D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. V. Packer. BEOWULF: A parallel workstation for scientific computation. In Proceedings of the 1995 International Conference on Parallel Processing (ICPP), pages 11-14, 1995.

5

Pawletta, S., Drewelow, W., Duenow, P., Pawlette, T., and Suesse, M. ``A MATLAB toolbox for Distributed and Parallel Processing,'' in Proceedings of the MATLAB Conference 95, Cambridge, MA (1995).

6

A. E. Trefethen , V. S. Menon, C. C. Chang, G. J. Czajkowski, C. Myers, L. N. Trefethen, ``MultiMATLAB: MATLAB on multiple processors,'' Technical Report 96-239, Cornell Theory Center, Ithaca, NY (1996).

7

P. Drakenberg, P. Jacobson, B. Kagstrom, ``A CONLAB compiler for a Distributed Memory Multicomputer,'' in Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computation, Volume 2 (1993), pp. 814-821.

8

M. J. Quinn, A. Malishevsky, N. Seelam, Y. Zhao, ``Preliminary results from a parallel MATLAB compiler,'' in Proceedings of the IEEE International Parallel Processing Symposium (1998), pp. 81-87.

9

L. DeRose and D. Padua, ``A MATLAB to Fortran 90 Translator and its effectiveness,'' in Proceedings of the 10th ACM International Conference on Supercomputing (May 1996).

10

R. A. Maeder, ``Demonstration programs from keynote lectures given by R. Maeder at IMS'97 (Second International Mathematica Symposium), Rovaniemi, Finland, June 29 - July 4, 1997,'' available at http://www.mathconsult.ch/math/stuff/ims.html.

About this document ...

A Parallel Linear Algebra Server for Matlab-like Environments

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

The command line arguments were:
latex2html -address morrow@cs.utexas.edu -split 1 pmi.tex.

The translation was initiated by Greg Morrow on 5/15/1998

morrow@cs.utexas.edu