Unit1.5.3Counting flops

Floating point multiplies or adds are examples of floating point operation (flops). What we have noticed is that for all of our computations (dots, axpy, gemv, ger, and gemm) every floating point multiply is paired with a floating point add into a fused multiply-add (FMA).

Determining how many floating point operations are required for the different operations is relatively straight forward: If $x$ and $y$ are of size $n \text{,}$ then

• $\gamma := x^T y + \gamma = \chi_0 \psi_0 + \chi_1 \psi_1 + \cdots + \chi_{n-1} \psi_{n-1} + \gamma$ requires $n$ FMAs and hence $2 n$ flops.

• $y := \alpha x + y$ requires $n$ FMAs (one per pair of elements, one from from $x$ and one from $y$) and hence $2 n$ flops.

Similarly, it is pretty easy to establish that if $A$ is $m \times n \text{,}$ then
• $y := A x + y$ requires $m n$ FMAs and hence $2 mn$ flops.

$n$ axpy operations each of size $m$ for $n \times 2 m$ flops or $m$ \dots\ operations each of size $n$ for $m \times 2 n$ flops.

• $A := x y^T + A$ required $m n$ FMAs and hence $2 mn$ flops.

$n$ axpy operations each of size $m$ for $n \times 2 m$ flops or $m$ axpy operations each of size $n$ for $m \times 2 n$ flops.

Finally, if $C$ is $m \times n \text{,}$ $A$ is $m \times k \text{,}$ and $B$ is $k \times n \text{,}$ then $C := A B + C$ requires $2 m n k$ flops. We can estiblish this by recognizing that if $C$ is updated one column at a time, this takes $n$ matrix-vector multiplication operations each with a matrix of size $m \times k \text{,}$ for a total of $n \times 2 m k \text{.}$ Alternatively, if $C$ is updated with rank-1 updates, then we get $k \times 2 m n \text{.}$

When we run experiments, we tend to execute with matrices that are $n \times n \text{,}$ where $n$ ranges from small to large. Thus, the total operations required equal

\begin{equation*} \sum_{n = {\tt n_{\tt first}}}^{\tt n_{\tt last}} 2 n^3 ~\mbox{flops}, \end{equation*}

where $n_{\tt first}$ is the first problem size, $n_{\tt last}$ the last, and the summation is in increments of $n_{\tt inc}\text{.}$

If ${\tt n_{\tt last}}={\tt n_{\tt inc}}\times N$ and $n_{\tt inc}=n_{\tt first} \text{,}$ then we get

\begin{equation*} \sum_{n = n_{\tt first}}^{n_{\tt last}} 2 n^3 = \sum_{i=1}^{N} 2{(i\times n_{\tt inc})^3} = 2 n_{\tt inc}^3 \sum_{i=1}^N i^3. \end{equation*}

A standard trick is to recognize that

\begin{equation*} \sum_{i=1}^N i^3 \approx \int_0^N x^3 dx = \frac{1}{4} N^4 . \end{equation*}

So,

\begin{equation*} \sum_{n = n_{\tt first}}^{n_{\tt last}} 2 n^3 \approx \frac{1}{2} n_{\tt inc}^3 N^4 = \frac{1}{2} \frac{n_{\tt inc}^4 N^4}{n_{\tt inc}} = \frac{1}{2n_{\tt inc}} n_{\tt last}^4. \end{equation*}

The important thing is that every time we double $n_{\tt last} \text{,}$ we have to wait, approximately, sixteen times as long for our experiments to finish...

An interesting question is why we count flops rather than FMAs. By now, you have noticed we like to report the rate of computation in billions of floating point operations per second, GFLOPS. Now, if we counted FMAs rather than flops, the number that represents the rate of computation would be cut in half. For marketing purposes, bigger is better and hence flops are reported! Seriously!