Unit 4.4.1 Speedup, efficiency, etc.
ΒΆAmdahl's law is a simple observation regarding how much speedup can be expected when one parallelizes a code. First some definitions.

Sequential time.
The sequential time for an operation is given by
\begin{equation*} T( n ), \end{equation*}where \(n \) is a parameter that captures the size of the problem being solved. For example, if we only time problems with square matrices, then \(n \) would be the size of the matrices. \(T(n) \) is equal the time required to compute the operation on a single core.

Parallel time.
The parallel time for an operation and a given implementation using \(t \) threads is given by
\begin{equation*} T_t( n ). \end{equation*} 
Speedup.
The speedup for an operation and a given implementation using \(t \) threads is given by
\begin{equation*} T( n ) / T_t( n ). \end{equation*}It measures how much speedup is attained by using \(t \) threads. Generally speaking, it is assumed that
\begin{equation*} S_t( n ) = T( n ) / T_t( n ) \leq t . \end{equation*}This is not always the case: sometimes \(t \) threads collective have resources (e.g., more cache memory) that allows speedup to exceed \(t \text{.}\) Think of the case where you make a bed with one person versus two people: You can often finish the job more than twice as fast because two people do not need to walk back and forth from one side of the bed to the other.

Efficiency.
The efficiency for an operation and a given implementation using \(t \) threads is given by
\begin{equation*} E_t(n) = S_t(n) / t . \end{equation*}It measures how efficiently the \(t \) threads are being used.

GFLOPS.
We like to report GFLOPS (billions of floating point operations per second). Notice that this gives us an easy way of judging efficiency: If we know what the theoretical peak of a core is, in terms of GFLOPS, then we can easily judge how efficiently we are using the processor relative to the theoretical peak. How do we compute the theoretical peak? We can look up the number of cycles a core can perform per second and the number of floating point operations it can before per cycle. This can then be converted to a peak rate of computation for a single core (in billions of floating point operations per second). If there are multiple cores, we just multiply this peak by the number of cores.
Of course, this is all complicated by the fact that as more cores are utilized, each core often executes at a lower clock speed so as to keep the processor from overheating.