## Unit12.2.2Opportunities for optimization

We now examine the opportunity for the reuse of data for different linear algebra operations that we have encountered. In our discussion, we assume that scalars, the elements of vectors, and the elements of matrices are all stored as floating point numbers and that arithmetic involves floating point computation.

###### Example12.2.2.1.

Consider the dot product

\begin{equation*} \rho := x^T y + \rho, \end{equation*}

where $\rho$ is a scalar, and $x$ and $y$ are vectors of size $m \text{.}$

• How many floating point operations (flops) are required to compute this operation?

• If the scalar $\rho$ and the vectors $x$ and $y$ are initially stored in main memory and are written back to main memory, how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

• What is the ratio of flops to memops?

Solution
• How many floating point operations are required to compute this operation?

The dot product requires $m$ multiplies and $m$ additions, for a total of $2 m$ flops.

• If the scalar $\rho$ and the vectors $x$ and $y$ are initially stored in main memory and (if necessary) are written back to main memory, how many reads and writes (memory operations) are required? (Give a reasonably tight lower bound.)

• The scalar $\rho$ is moved into a register and hence only needs to be read once and written once.

• The $m$ elements of $x$ and $m$ elements of $y$ must be read (but not written).

Hence the number of memops is $2( m + 1 ) \approx 2 m \text{.}$

• What is the ratio of flops to memops?

\begin{equation*} \frac{2 m \mbox{ flops}}{2( m+1 ) \mbox{ memops}} \approx 1 \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

We conclude that the dot product does not exhibit an opportunity for the reuse of most data.

###### Homework12.2.2.1.

Consider the axpy operation

\begin{equation*} y := \alpha x + y, \end{equation*}

where $\alpha$ is a scalar, and $x$ and $y$ are vectors of size $m \text{.}$

• How many floating point operations (flops) are required to compute this operation?

• If the scalar $\alpha$ and the vectors $x$ and $y$ are initially stored in main memory and are written back to main memory (if necessary), how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

• What is the ratio of flops to memops?

Solution
• How many floating point operations are required to compute this operation?

The axpy operation requires $m$ multiplies and $m$ additions, for a total of $2 m$ flops.

• If the scalar $\alpha$ and the vectors $x$ and $y$ are initially stored in main memory and (if necessary) are written back to main memory, how many reads and writes (memory operations) are required? (Give a reasonably tight lower bound.)

• The scalar $\alpha$ is moved into a register, and hence only needs to be read once. It does not need to be written back to memory.

• The $m$ elements of $x$ are read, and the $m$ elements of $y$ must be read and written.

Hence the number of memops is $3m + 1 \approx 3 m \text{.}$

• What is the ratio of flops to memops?

\begin{equation*} \frac{2 m \mbox{ flops}}{3m+1 \mbox{ memops}} \approx \frac{2}{3} \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

We conclude that the axpy operation also does not exhibit an opportunity for the reuse of most data.

The time for performing a floating point operation is orders of magnitude less than that of moving a floating point number from and to memory. Thus, for an individual dot product or axpy operation, essentially all time will be spent in moving the data and the attained performance, in GFLOPS, will be horrible. The important point is that there just isn't much reuse of data when executing these kinds of "vector-vector" operations.

Example 12.2.2.1 and Homework 12.2.2.1 appear to suggest that, for example, when computing a matrix-vector multiplication, one should do so by taking dot products of rows with the vector rather than by taking linear combinations of the columns, which casts the computation in terms of axpy operations. It is more complicated than that: the fact that the algorithm that uses axpys computes with columns of the matrix, which are stored contiguously when column-major order is employed, makes accessing memory cheaper.

###### Homework12.2.2.2.

Consider a matrix-vector multiplication

\begin{equation*} y := A x + y, \end{equation*}

where $A$ is $m \times m$ and $x$ and $y$ are vectors of appropriate size.

• How many floating point operations (flops) are required to compute this operation?

• If the matrix and vectors are initially stored in main memory and are written back to main memory (if necessary), how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

• What is the ratio of flops to memops?

Solution
• How many floating point operations are required to compute this operation?

$y := A x + y$ requires $m^2$ multiplies and $m^2$ additions, for a total of $2 m^2$ flops.

• If the matrix and vectors are initially stored in main memory and are written back to main memory, how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

To come up with a reasonably tight lower bound, we observe that every element of $A$ must be read (but not written). Thus, a lower bound is $m^2$ memops. The reading and writing of $x$ and $y$ contribute a lower order term, which we tend to ignore.

• What is the ratio of flops to memops?

\begin{equation*} \frac{2 m^2 \mbox{ flops}}{m^2 \mbox{ memops}} \approx 2 \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

While this ratio is better than either the dot product's or the axpy operation's, it still does not look good.

What we notice is that there is a (slighly) better opportunity for reuse of data when computing a matrix-vector multiplication than there is when computing a dot product or axpy operation. Can the lower bound on data movement that is given in the solution be attained? If you bring $y$ into, for example, the L1 cache, then it only needs to be read from main memory once and is kept in a layer of the memory that is fast enough to keep up with the speed of floating point computation for the duration of the matrix-vector multiplication. Thus, it only needs to be read and written once from and to main memory. If we then compute $y := A x + y$ by taking linear combinations of the columns of $A \text{,}$ staged as axpy operations, then at the appropriate moment an element of $x$ with which an axpy is performed can be moved into a register and reused. This approach requires each element of $A$ to be read once, each element of $x$ to be read once, and each element of $y$ to be read and written (from and to main memory) once. If the vector $y$ is too large for the L1 cache, then it can be partitioned into subvectors that do fit. This would require the vector $x$ to be read into registers multiple times. However, $x$ itself might then be reused from one of the cache memories.

###### Homework12.2.2.3.

Consider the rank-1 update

\begin{equation*} A := x y^T + A, \end{equation*}

where $A$ is $m \times m$ and $x$ and $y$ are vectors of appropriate size.

• How many floating point operations (flops) are required to compute this operation?

• If the matrix and vectors are initially stored in main memory and are written back to main memory (if necessary), how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

• What is the ratio of flops to memops?

Solution
• How many floating point operations are required to compute this operation?

$A := x y^T + A$ requires $m^2$ multiplies and $m^2$ additions, for a total of $2 m^2$ flops. (One multiply and one add per element in $A \text{.}$)

• If the matrix and vectors are initially stored in main memory and are written back to main memory, how many reads and writes (memops) are required? (Give a reasonably tight lower bound.)

To come up with a reasonably tight lower bound, we observe that every element of $A$ must be read and written. Thus, a lower bound is $2 m^2$ memops. The reading of $x$ and $y$ contribute a lower order term. These vectors need not be written, since they don't change.

• What is the ratio of flops to memops?

\begin{equation*} \frac{2 m^2 \mbox{ flops}}{ 2 m^2 \mbox{ memops}} \approx 1 \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

What we notice is that data reuse when performing a matrix-vector multiplication with an $m \times m$ matrix is twice as favorable as is data reuse when performing a rank-1 update. Regardless, there isn't much reuse and since memory operations are much slower than floating point operations in modern processors, none of the operations discussed so far in the unit can attain high performance (if the data starts in main memory).

###### Homework12.2.2.4.

Consider the matrix-matrix multiplication

\begin{equation*} C := A B + C, \end{equation*}

where $A \text{,}$ $B \text{,}$ and $C$ are all $m \times m$ matrices.

• How many floating point operations (flops) are required to compute this operation?

• If the matrices are initially stored in main memory and (if necessary) are written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)

• What is the ratio of flops to memops for this lower bound?

Solution
• How many floating point operations are required to compute this operation?

$C := A B + C$ requires $m^3$ multiplies and $m^3$ additions, for a total of $2 m^3$ flops.

• If the matrices are initially stored in main memory and (if necessary) are written back to main memory, how many reads and writes (memops) are required? (Give a simple lower bound.)

To come up with a reasonably tight lower bound, we observe that every element of $A \text{,}$ $B \text{,}$ and $C$ must be read and every element of $C$ must be written. Thus, a lower bound is $4 m^2$ memops.

• What is the ratio of flops to memops?

\begin{equation*} \frac{2 m^3 \mbox{ flops}}{ 4 m^2 \mbox{ memops}} \approx \frac{m}{2} \frac{\mbox{flops}}{\mbox{memops}}. \end{equation*}

The lower bounds that we give in this unit are simple, but useful. There is actually a tight lower bound on the number of reads and writes that must be performed by a matrix-matrix multiplication. Details can be found in

•  Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn, A Tight I/O Lower Bound for Matrix Multiplication, arxiv.org:1702.02017v2, 2019. (To appear in ACM Transactions on Mathematical Software.)

The bottom line: operations like matrix-matrix multiplication exhibit the opportunity for reuse of data.