PfHP Speedup, efficiency, etc.

Unit 4.4.1 Speedup, efficiency, etc.

Subsubsection 4.4.1.1 Sequential time

We denote the sequential time for an operation by

\begin{equation*} T( n ), \end{equation*}

where \(n \) is a parameter that captures the size of the problem being solved. For example, if we only time problems with square matrices, then \(n \) would be the size of the matrices. \(T(n) \) is equal the time required to compute the operation on a single core. If we consider all three sizes of the matrices, then \(n\) might be a vector with three entries.

A question is "Which sequential time?" Should we use the time by a specific implementation? Should we use the time for the fastest implementation? That's a good question! Usually it is obvious which to use from context.

Subsubsection 4.4.1.2 Parallel time

The parallel time for an operation and a given implementation using \(t \) threads is given by

\begin{equation*} T_t( n ). \end{equation*}

This allows us to define the sequential time for a specific implementation by \(T_1(n)\text{.}\) Itis the sequential time of the parallel implementation when using only one thread.

Subsubsection 4.4.1.3 Speedup

In the launch we already encountered the concept of speedup. There, we used it more informally.

The speedup for an operation and a given implementation using \(t \) threads is given by

\begin{equation*} S_t(n) = T( n ) / T_t( n ). \end{equation*}

It measures how much speedup is attained by using \(t \) threads.

Often, the ratio

\begin{equation*} T_1(n) / T_t(n) \end{equation*}

is used to report speedup (as we did in Unit 4.1.1). We now understand that this may give an optimistic result: If the sequential execution of the given implementation yields poor performance, it is still possible to attain very good speedup. The fair thing is to use \(T(n)\) of the best implementation when computing speedup.

Remark 4.4.1.

Intuitively, it would seem that

\begin{equation*} S_t( n ) = T( n ) / T_t( n ) \leq t . \end{equation*}

when you have at least \(t\) cores. In other words, that speedup is less than or equal to the number of threads/cores at your disposal. After all, it would seem that \(T(n) \le t T_t(n)\) since one could create a sequential implementation from the parallel implementation by mapping all computations performed by the threads to one thread, and this would remove the overhead of coordinating between the threads.

Let's use a simple example to show that speedup can be greater than \(t\text{:}\) Generally speaking, when you make a bed with two people, it takes less than half the time than when you do it by yourself. Why? Because usually two people have two sets of hands and the "workers" can therefore be on both sides of the bed simultaneously. Overhead incurrred when you make the bed by yourself (e.g., walking from one side to the other) is avoided. The same is true when using \(t\) threads on \(t\) cores: The task to be completed now has \(t\) sets of registers, \(t\) L1 caches, etc.

Attaining better than a factor \(t\) speedup with \(t\) threads and cores is called super-linear speedup. It should be viewed as the exception rather than the rule.

Subsubsection 4.4.1.4 Efficiency

The efficiency for an operation and a given implementation using \(t \) threads is given by

\begin{equation*} E_t(n) = S_t(n) / t . \end{equation*}

It measures how efficiently the \(t \) threads are being used.

Subsubsection 4.4.1.5 GFLOPS/thread

We like to report GFLOPS (billions of floating point operations per second). Notice that this gives us an easy way of judging efficiency: If we know what the theoretical peak of a core is, in terms of GFLOPS, then we can easily judge how efficiently we are using the processor relative to the theoretical peak. How do we compute the theoretical peak? We can look up the number of cycles a core can perform per second and the number of floating point operations it can perform per cycle. This can then be converted to a peak rate of computation for a single core (in billions of floating point operations per second). If there are multiple cores, we just multiply this peak by the number of cores.

Remark 4.4.2.

For current processors, it is almost always the case that as more threads are used, the frequency (cycles/second) of the cores is "clocked down" to keep the processor from overheating. This, obviously, complicates matters...