ALAFF Error in storing a real number as a floating point number

Subsection 6.2.2 Error in storing a real number as a floating point number

Remark 6.2.2.1.

We consider the case where a real number is truncated to become the stored floating point number. This makes the discussion a bit simpler.

Let positive \(\chi \) be represented by

\begin{equation*} \chi = . \delta_0 \delta_1 \cdots \times 2^e, \end{equation*}

where \(\delta_i \) are binary digits and \(\delta_0 = 1 \) (the mantissa is normalized). If \(t \) binary digits are stored by our floating point system, then

\begin{equation*} \check \chi = . \delta_0 \delta_1 \cdots \delta_{t-1} \times 2^e \end{equation*}

is stored (if truncation is employed). If we let \(\delta\!\chi = \chi - \check \chi \text{.}\) Then

\begin{equation*} \begin{array}{rcl} \delta\!\chi \amp = \amp \begin{array}[t]{c} \underbrace{. \delta_0 \delta_1 \cdots \delta_{t-1} \delta_t \cdots \times 2^e} \\ \chi \end{array} - \begin{array}[t]{c} \underbrace{. \delta_0 \delta_1 \cdots \delta_{t-1} \times 2^e} \\ \check \chi \end{array} \\ \amp = \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 00} \\ t \end{array} \delta_t \cdots \times 2^e \\ \amp \lt \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 01} \\ t \end{array} \times 2^e = 2^{-t} 2^{e} . \end{array} \end{equation*}

Since \(\chi \) is positive and \(\delta_0 = 1 \text{,}\)

\begin{equation*} \chi = . \delta_0 \delta_1 \cdots \times 2^e \geq \frac{1}{2} \times 2^e . \end{equation*}

Thus,

\begin{equation*} \frac{\delta\!\chi}{\chi} \leq \frac{2^{-t} 2^{e}}{\frac{1}{2} 2^{e}} = 2^{-(t-1)} , \end{equation*}

which can also be written as

\begin{equation*} \delta\!\chi \leq 2^{-(t-1)} \chi. \end{equation*}

A careful analysis of what happens when \(\chi \) equals zero or is negative yields

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-(t-1)} \vert \chi \vert. \end{equation*}

Example 6.2.2.2.

The number \(4/3 = 1.3333\cdots \) can be written as

\begin{equation*} \begin{array}{l} 1.3333 \cdots \\ ~~~ = ~~~~ \\ 1 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \cdots \\ ~~~ = ~~~~ \lt \mbox{ convert to binary representation } \gt \\ 1.0101 \cdots \times 2^0 \\ ~~~ = ~~~~ \lt \mbox{ normalize } \gt \\ .10101 \cdots \times 2^1 \end{array} \end{equation*}

Now, if \(t = 4 \) then this would be truncated to

\begin{equation*} .1010 \times 2^1, \end{equation*}

which equals the number

\begin{equation*} \begin{array}{l} .101 \times 2^1 ~~~ = ~~~~ \\ \frac{1}{2} + \frac{0}{4} + \frac{1}{8} + \frac{0}{16} \times 2^1 \\ ~~~ = ~~~~ \\ 0.625 \times 2 ~~~ = ~~~~ \lt \mbox{ convert to decimal } \gt \\ 1.25 \end{array} \end{equation*}

The relative error equals

\begin{equation*} \frac{1.333\cdots - 1.25} {1.333\cdots} = 0.0625. \end{equation*}

If \(\check \chi \) is computed by rounding instead of truncating, then

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-t} \vert \chi \vert . \end{equation*}

We can abstract away from the details of the base that is chosen and whether rounding or truncation is used by stating that storing \(\chi \) as the floating point number \(\check \chi \) obeys

\begin{equation*} \vert \delta\!\chi \vert \leq \meps \vert \chi \vert \end{equation*}

where \(\meps \) is known as the machine epsilon or unit roundoff. When single precision floating point numbers are used \(\meps \approx 10^{-8} \text{,}\) yielding roughly eight decimal digits of accuracy in the stored value. When double precision floating point numbers are used \(\meps \approx 10^{-16} \text{,}\) yielding roughly sixteen decimal digits of accuracy in the stored value.

Example 6.2.2.3.

The number \(4/3 = 1.3333\cdots \) can be written as

Now, if \(t = 4 \) then this would be rounded to

\begin{equation*} .1011 \times 2^1, \end{equation*}

which is equals the number

\begin{equation*} \begin{array}{l} .1011 \times 2^1 ~~~ = ~~~~ \\ \frac{1}{2} + \frac{0}{4} + \frac{1}{8} + \frac{1}{16} \times 2^1 \\ ~~~ = ~~~~ \\ 0.6875 \times 2 ~~~ = ~~~~ \lt \mbox{ convert to decimal } \gt \\ 1.375 \end{array} \end{equation*}

The relative error equals

\begin{equation*} \frac{\vert 1.333\cdots - 1.375 \vert} {1.333\cdots} = 0.03125. \end{equation*}

Definition 6.2.2.4. Machine epsilon (unit roundoff).

The machine epsilon (unit roundoff), \(\meps \text{,}\) is defined as the smallest positive floating point number \(\chi \) such that the floating point number that represents \(1 + \chi \) is greater than one.

Remark 6.2.2.5.

The quantity \(\meps\) is machine dependent. It is a function of the parameters characterizing how a specific architecture converts reals to floating point numbers.

Homework 6.2.2.1.

Assume a floating point number system with \(\beta = 2 \text{,}\) a mantissa with \(t \) digits, and truncation when storing.

Write the number \(1 \) as a floating point number in this system.
What is the \(\meps \) for this system?

Solution

Write the number \(1 \) as a floating point number.

Answer:

\begin{equation*} \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \\ \mbox{ digits} \end{array} \times 2^1. \end{equation*}
What is the \(\meps \) for this system?

Answer:

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ \gt 1 \end{array} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \lt 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \mbox{ truncates to } 1 \end{array} \end{equation*}

Notice that

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1 \end{equation*}

can be represented as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 0 \cdots 0} \\ t \mbox{ digits} \end{array} \times 2^{-(t-2)} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1 \end{equation*}

as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 1 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^{-(t-1)} \end{equation*}

Hence \(\meps = 2^{-(t-1)} \text{.}\)