Next: General implementation Up: Example: Parallelizing Matrix-Vector Multiplication Previous: Example: Parallelizing Matrix-Vector Multiplication

## Simple implementation

We start by explaining the simple case of matrix-vector multiplication:

where A is an matrix. For our explanations, we will assume that matrix A and vectors x and y are created according to the driver given in Example 2.3. Thus, the template is created with a communicator and a blocking of . Given this template, vectors x and y are aligned with the first element of the template vector and matrix A with the upper-left element of the template matrix. This yields objects a, x, and y.

As explained in Section 1.5, the following steps will perform the matrix-vector multiply y = A x :

• spread (collect) the entries of x within columns of nodes,
• perform the local matrix-vector multiply, and
• perform a distributed reduce (summation) of the local results within rows, leaving the global result in vector y .
We will now show how to translate these operations into PLAPACK code.

The mechanism used by PLAPACK to communicate is to describe the initial and final distribution as objects, and perform a copy or reduce. Thus, the following statements will perform the spread of x within columns of nodes:

```PLA_Obj_datatype( a, &datatype );
PLA_Pvector_create( datatype, PLA_PROJ_ONTO_ROW, PLA_ALL_ROWS,
n, template, PLA_ALIGN_FIRST, &xdup );
PLA_Copy( x, xdup );```
After this, all information is available locally to perform the local matrix-vector multiply. Before doing so, we need to create duplicated multiscalars to hold the constants ``0'' and ``1''. Also, a duplicated projected vector (column) must be created to hold the result:
```PLA_Mscalar_create( datatype, PLA_ALL_ROWS, PLA_ALL_COLS, 1, 1, template, &one );
PLA_Obj_set_to_one( one );
PLA_Mscalar_create( datatype, PLA_ALL_ROWS, PLA_ALL_COLS, 1, 1, template, &zero );
PLA_Obj_set_to_zero( zero );

PLA_Pvector_create( datatype, PLA_PROJ_ONTO_COL, PLA_ALL_COLS,
m, template, PLA_ALIGN_FIRST, &ydup );

PLA_Local_gemv( PLA_NO_TRANSPOSE, one, a, xdup, zero, ydup );```

Finally, the local results (in the different versions of the duplicated projected vector ydup) must be reduced into a single vector y :

```PLA_Obj_set_to_zero( y );
PLA_Reduce( ydup, MPI_SUM, y );```

Next: General implementation Up: Example: Parallelizing Matrix-Vector Multiplication Previous: Example: Parallelizing Matrix-Vector Multiplication

rvdg@cs.utexas.edu