LAFF Transpose matrix-vector multiplication

Subsection 4.3.1 Transpose matrix-vector multiplication

Homework 4.3.1.1.

Implement the routines

[ y_out ] = Mvmult_t_unb_var1( A, x, y )}

y_out ] = Mvmult_t_unb_var2( A, x, y )}

that compute $y := A^T x + y $ via the algorithms in Figure~\ref{fig-mvmult_t}.

Homework 4.3.1.2.

Implementations achieve better performance (finish faster) if one accesses data consecutively in memory. Now, most scientific computing codes store matrices in "column-major order" which means that the first column of a matrix is stored consecutively in memory, then the second column, and so forth. Now, this means that an algorithm that accesses a matrix by columns tends to be faster than an algorithm that accesses a matrix by rows. That, in turn, means that when one is presented with more than one algorithm, one should pick the algorithm that accesses the matrix by columns.

Our FLAME notation makes it easy to recognize algorithms that access the matrix by columns.

For the matrix-vector multiplication $y := A x + y \text{,}$ would you recommend the algorithm that uses dot products or the algorithm that uses axpy operations?
For the matrix-vector multiplication $y := A^T x + y \text{,}$ would you recommend the algorithm that uses dot products or the algorithm that uses {\tt axpy} operations?

Solution

When computing $A x + y \text{,}$ it is when you view $A x $ as taking linear combinations of the columns of $A $ that you end up accessing the matrix by columns. Hence, the axpy-based algorithm will access the matrix by columns.

When computing $A^T x + y \text{,}$ it is when you view $A^T x $ as taking dot products of columns of $A $ with vector $x $ that you end up accessing the matrix by columns. Hence, the dot-based algorithm will access the matrix by columns. The important thing is: one algorithm doesn't fit all situations. So, it is important to be aware of all algorithms for computing an operation.

Let $A = \left( \begin{array}{r r r r} {1}\amp {-2} \amp {0} \\ {2}\amp {-1} \amp {1} \\ {1}\amp {2} \amp {3} \end{array}\right)$ and $x = \left( \begin{array}{r} {-1} \\ {2} \\ {-3} \end{array}\right) \text{.}$ Then

\begin{equation*} A^T x = \left( \begin{array}{r | r | r} {1}\amp {-2} \amp {0} \\ {2}\amp {-1} \amp {1} \\ {1}\amp {2} \amp {3} \\ \end{array} \right) ^T \left( \begin{array}{r} {-1} \\ {2} \\ {-3} \end{array}\right) = \left( \begin{array}{r r r} {1}\amp {2} \amp {1} \\ \hline {-2 } \amp {-1} \amp {2} \\ \hline {0}\amp {1} \amp {3} \end{array} \right) \left( \begin{array}{r} {-1} \\ {2} \\ {-3} \end{array}\right) = \left( \begin{array}{r} {0} \\ \hline {-6} \\ \hline {-7} \end{array}\right) . \end{equation*}

The thing to notice is that what was a column in $A $ becomes a row in $A^T \text{.}$

Let us consider how to compute $y := A^T x + y \text{.}$ It would be possible to explicitly transpose matrix $A $ into a new matrix $B $ (using, for example, the transpose function you wrote in Week 3) and to then compute $y := B x + y\text{.}$ This approach has at least two drawbacks:

You will need space for the matrix $B \text{.}$ Computational scientists tend to push the limits of available memory, and hence are always hesitant to use large amounts of space that isn't absolutely necessary.
Transposing $A $ into $B $ takes time. A matrix-vector multiplication requires $2 m n $ flops. Transposing a matrix requires $2 m n $ memops ($m n $ reads from memory and $m n $ writes to memory). Memory operations are very slow relative to floating point operations... So, you will spend all your time transposing the matrix.

\begin{equation*} % Define the operation to be computed \renewcommand{\routinename}{ y := \mbox{Mvmult\_t\_unb\_var1}( A, x, y ) } % Step 3: Loop-guard \renewcommand{\guard}{ m( y_T ) \lt m( y ) } % Step 4: Define Initialize \renewcommand{\partitionings}{ \setlength{\arraycolsep}{2pt} A \rightarrow \FlaOneByTwo{A_L}{A_R} , y \rightarrow \FlaTwoByOne{y_{T}} {y_{B}} } \renewcommand{\partitionsizes}{ A_L is m \times 0 and y_{T} is 0 \times 1 } % Step 5a: Repartition the operands \renewcommand{\repartitionings}{ \setlength{\arraycolsep}{2pt} \FlaOneByTwo{A_L}{A_R} \rightarrow \FlaOneByThreeR{A_0}{a_1}{A_2} , \FlaTwoByOne{ y_T } { y_B } \rightarrow \FlaThreeByOneB{y_0} {\psi_1} {y_2} } \renewcommand{\repartitionsizes}{ a_1 is a column } % Step 5b: Move the double lines \renewcommand{\moveboundaries}{ \setlength{\arraycolsep}{2pt} \FlaOneByTwo{A_L}{A_R} \leftarrow \FlaOneByThreeL{A_0}{a_1}{A_2} , \FlaTwoByOne{ y_T } { y_B } \leftarrow \FlaThreeByOneT{y_0} {\psi_1} {y_2} } % Step 8: Insert the updates required to change the % state from that given in Step 6 to that given in Step 7 % Note: The below needs editing!!! \renewcommand{\update}{ \begin{array}{l} \psi_1 \becomes a_1^T x + \psi_1 \end{array} } \FlaAlgorithm \end{equation*}

\begin{equation*} % Define the operation to be computed \renewcommand{\routinename}{ y := \mbox{Mvmult\_t\_unb\_var2}( A, x, y ) } % Step 3: Loop-guard \renewcommand{\guard}{ m( A_T ) \lt m( A ) } % Step 4: Define Initialize \renewcommand{\partitionings}{ \setlength{\arraycolsep}{2pt} A \rightarrow \FlaTwoByOne{A_{T}} {A_{B}} , x \rightarrow \FlaTwoByOne{x_{T}} {x_{B}} } \renewcommand{\partitionsizes}{ A_{T} is 0 \times n and x_{T} is 0 \times 1 } % Step 5a: Repartition the operands \renewcommand{\repartitionings}{ \setlength{\arraycolsep}{2pt} \FlaTwoByOne{ A_T } { A_B } \rightarrow \FlaThreeByOneB{A_0} {a_1^T} {A_2} , \FlaTwoByOne{ x_T } { x_B } \rightarrow \FlaThreeByOneB{x_0} {\chi_1} {x_2} } \renewcommand{\repartitionsizes}{ a_1 is a row } % Step 5b: Move the double lines \renewcommand{\moveboundaries}{ \setlength{\arraycolsep}{2pt} \FlaTwoByOne{ A_T } { A_B } \leftarrow \FlaThreeByOneT{A_0} {a_1^T} {A_2} , \FlaTwoByOne{ x_T } { x_B } \leftarrow \FlaThreeByOneT{x_0} {\chi_1} {x_2} } % Step 8: Insert the updates required to change the % state from that given in Step 6 to that given in Step 7 % Note: The below needs editing!!! \renewcommand{\update}{ \begin{array}{l} y \becomes \chi_1 a_1 + y \end{array} } \FlaAlgorithm \end{equation*}

Figure 4.3.1.1. Algorithms for computing $y := A^T x + y \text{.}$

Now, the motivation for this unit suggest that we can simply use columns of $A $ for the dot products in the dot product based algorithm for $y := A x + y \text{.}$ This suggests the algorithm in FLAME notation in Figure~\ref{fig-mvmult_t}~(left). Alternatively, one can exploit the fact that columns in $A $ become rows of $A^T $ to change the algorithm for computing $y := A x + y $ that is based on \axpy\ operations into an algorithm for computing $y := A^T x + y \text{,}$ as shown in Figure~\ref{fig-mvmult_t}~(right).

The point of this last exercise is to make you aware of the fact that knowing more than one algorithm can give you a performance edge. (Useful if you pay \$30 million for a supercomputer and you want to get the most out of its use.)