CS 378: Homework 4

Due: April 24th, 2008

This assignment is optional. You should do it only if you want to make up for a bad grade in a previous assignment.

Introduction

The goal of this assignment is to implement dense matrix-vector multiplication (MVM) on Lonestar in two different ways, and measure the performance of both implementations.

In this write-up, we will refer to the MVM as A*x where A is the matrix and x is the vector. You can assume that the matrix A is square and is of size n x n. Recall from the class discussion that we usually do not perform a single MVM in isolation - rather, we usually want to perform a large number of MVMs with the same matrix A and different vectors x1, x2,...where the vector x2 is obtained from the vector A*x1, the vector x3 is obtained from the vector A*x2 etc. Therefore, our MVM routine will be optimized for this case: in particular, we will partition matrix A between the processors just once, and also require that the distribution of the vector A*x be the same as the distribution of the vector x.

One approach to building the distributed matrix is to create the entire matrix on the root process, and then have the root processor send appropriate portions to other processes. This approach limits the size of the matrix you can work with, so a better approach is to have each process create its own portion of the global matrix. In a real code, this could be accomplished by having each process read in its own portion of the global matrix from disk, or by having some other program like the finite-element mesh generator and formulator create the appropriate portion of the global matrix in the local memories of each processor. We will do something simpler - the root process should broadcast the size of the global matrix to other processes, and the other processes allocate a local sub-matrix of the appropriate size, initializing the entries as follows: A(i,j) = (i+j). Each process should create its own piece of vector x as well: initialize all elements of the vector to 1. Both the matrix and vector should contain doubles.

Problem 1 -- Block-row distribution of matrix

Write an MPI program for MVM in which the matrix A is distributed by block row, and the vector x has a matching block distribution.

For a global matrix A of size 1000x1000, determine the time for performing MVM using 1,2,4,8,16,and 32 processes.
Repeat this experiment for matrices of sizes 200x200, 2000x2000, and 4000x4000 (if you have problems with large matrix sizes, use smaller sizes).
On a single graph in which the x-axis is the number of processes and the y-axis is the time to perform MVM, plot the running time of MVM for different matrix sizes (use a separate line for each matrix size but overlay all lines in the same graph). From these curves, estimate if possible the optimal number of processes for running each problem size.
For the largest problem size you were able to run, graph the speed-up for different numbers of processes.

Problem 2 -- Performance of compiler-generated code

Repeat Problem 1 for a block/block distribution of the matrix A and the corresponding distribution of the vector x.

Problem 3 -- Conclusions

Answer the following question.

In class, we argued quantitatively that the algorithm using the block/block distribution scales better than the one that uses the block distribution. Do your measurements support the conclusions drawn from the performance model? Explain your answer briefly.