## Unit4.4.2Ahmdahl's law

Ahmdahl's law is a simple observation about the limits on parallel speedup and efficiency. Consider an operation that requires sequential time

\begin{equation*} T \end{equation*}

for computation. What if fraction $f$ of this time is not parallelized (or cannot be parallelized), and the other fraction $(1-f)$ is parallelized with $t$ threads. We can then write

\begin{equation*} T = \begin{array}[t]{c} \underbrace{( 1 - f ) T} \\ \mbox{part that is parallelized} \end{array} + \begin{array}[t]{c} \underbrace{f T} \\ \mbox{part that is not parallelized} \end{array} \end{equation*}

and, assuming super-linear speedup is not expected, the total execution time with $t$ threads, $T_t \text{,}$ obeys the inequality

\begin{equation*} T_t \geq \frac{(1-f) T }{t} + f T \geq f T. \end{equation*}

This, in turn, means that the speedup is bounded by

\begin{equation*} S_t = \frac{T}{T_t} \leq \frac{T}{f T} = \frac{1}{f}. \end{equation*}

What we notice is that the maximal speedup is bounded by the inverse of the fraction of the execution time that is not parallelized. If, for example, $1/5$ the code is not optimized, then no matter how many threads you use, you can at best compute the result $1 / ( 1/ 5 ) = 5$ times faster. This means that one must think about parallelizing all parts of a code.

###### Remark4.4.3.

Ahmdahl's law says that if one does not parallelize a part of the sequential code in which fraction $f$ time is spent, then the speedup attained regardless of how many threads are used is bounded by $1/f \text{.}$

The point is: so far, we have focused on parallelizing the computational part of matrix-matrix multiplication. We should also parallelize the packing, since it requires a nontrivial part of the total execution time.

For parallelizing matrix-matrix multiplication, there is some hope. In our situation, the execution time is a function of the problem size, which we can model by the computation time plus overhead. A high-performance implementation may have a cost function that looks something like

\begin{equation*} T( n ) = 2 n^3 \gamma + C n^2 \beta, \end{equation*}

where $\gamma$ is the time required for executing a floating point operation, $\beta$ is some constant related to the time required to move a floating point number, and $C$ is some other constant. (Notice that we have not done a thorough analysis of the cost of the final implementation in Week 3, so this formula is for illustrative purposes.) If now the computation is parallelized perfectly among the threads and, in a pessimistic situation, the time related to moving data is not at all, then we get that

\begin{equation*} T_t( n ) = \frac{ 2 n^3 \gamma}{t} + C n^2 \beta. \end{equation*}

so that the speedup is given by

\begin{equation*} S_t( n ) = \frac{ 2 n^3 \gamma + C n^2 \beta} {\frac{ 2 n^3 \gamma}{t} + C n^2 \beta} \end{equation*}

and the efficiency is given by

\begin{equation*} E_t( n ) = \frac{S_t( n )}{t} = \frac{ 2 n^3 \gamma + C n^2 \beta} { 2 n^3 \gamma + t C n^2 \beta} = \frac{ 2 n^3 \gamma} { 2 n^3 \gamma + t C n^2 \beta} + \frac{C n^2 \beta}{ 2 n^3 \gamma + t C n^2 \beta} \end{equation*}
\begin{equation*} = \frac{ 1 } { 1 + \frac{t C n^2 \beta}{2 n^3 \gamma}} + \frac{1}{ \frac{2 n^3 \gamma}{C n^2 \beta} + t} = \frac{ 1 } { 1 + \left( \frac{t C \beta}{2 \gamma} \right) \frac{1}{n}} + \frac{1}{ \left( \frac{2 \gamma}{C \beta} \right) n+ t}. \end{equation*}

Now, if $t$ is also fixed, then the expressions in parentheses are constants and, as $n$ gets large, the first term converges to $1$ while the second term converges to $0 \text{.}$ So, as the problem size gets large, the efficiency starts approaching 100%.

What this means is that whether all or part of the overhead can also be parallelized affects how fast we start seeing high efficiency rather than whether we eventually attain high efficiency for a large enough problem. Of course, there is a limit to how large of a problem we can store and/or how large of a problem we actually want to compute.