## Subsection6.2.2Error in storing a real number as a floating point number

###### Remark6.2.2.1.

We consider the case where a real number is truncated to become the stored floating point number. This makes the discussion a bit simpler.

Let positive $\chi$ be represented by

\begin{equation*} \chi = . \delta_0 \delta_1 \cdots \times 2^e, \end{equation*}

where $\delta_i$ are binary digits and $\delta_0 = 1$ (the mantissa is normalized). If $t$ binary digits are stored by our floating point system, then

\begin{equation*} \check \chi = . \delta_0 \delta_1 \cdots \delta_{t-1} \times 2^e \end{equation*}

is stored (if truncation is employed). If we let $\delta\!\chi = \chi - \check \chi \text{.}$ Then

\begin{equation*} \begin{array}{rcl} \delta\!\chi \amp = \amp \begin{array}[t]{c} \underbrace{. \delta_0 \delta_1 \cdots \delta_{t-1} \delta_t \cdots \times 2^e} \\ \chi \end{array} - \begin{array}[t]{c} \underbrace{. \delta_0 \delta_1 \cdots \delta_{t-1} \times 2^e} \\ \check \chi \end{array} \\ \amp = \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 00} \\ t \end{array} \delta_t \cdots \times 2^e \\ \amp \lt \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 01} \\ t \end{array} \times 2^e = 2^{-t} 2^{e} . \end{array} \end{equation*}

Since $\chi$ is positive and $\delta_0 = 1 \text{,}$

\begin{equation*} \chi = . \delta_0 \delta_1 \cdots \times 2^e \geq \frac{1}{2} \times 2^e . \end{equation*}

Thus,

\begin{equation*} \frac{\delta\!\chi}{\chi} \leq \frac{2^{-t} 2^{e}}{\frac{1}{2} 2^{e}} = 2^{-(t-1)} , \end{equation*}

which can also be written as

\begin{equation*} \delta\!\chi \leq 2^{-(t-1)} \chi. \end{equation*}

A careful analysis of what happens when $\chi$ equals zero or is negative yields

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-(t-1)} \vert \chi \vert. \end{equation*}
###### Example6.2.2.2.

The number $4/3 = 1.3333\cdots$ can be written as

\begin{equation*} \begin{array}{l} 1.3333 \cdots \\ ~~~ = ~~~~ \\ 1 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \cdots \\ ~~~ = ~~~~ \lt \mbox{ convert to binary representation } \gt \\ 1.0101 \cdots \times 2^0 \\ ~~~ = ~~~~ \lt \mbox{ normalize } \gt \\ .10101 \cdots \times 2^1 \end{array} \end{equation*}

Now, if $t = 4$ then this would be truncated to

\begin{equation*} .1010 \times 2^1, \end{equation*}

which equals the number

\begin{equation*} \begin{array}{l} .101 \times 2^1 ~~~ = ~~~~ \\ \frac{1}{2} + \frac{0}{4} + \frac{1}{8} + \frac{0}{16} \times 2^1 \\ ~~~ = ~~~~ \\ 0.625 \times 2 ~~~ = ~~~~ \lt \mbox{ convert to decimal } \gt \\ 1.25 \end{array} \end{equation*}

The relative error equals

\begin{equation*} \frac{1.333\cdots - 1.25} {1.333\cdots} = 0.0625. \end{equation*}

If $\check \chi$ is computed by rounding instead of truncating, then

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-t} \vert \chi \vert . \end{equation*}

We can abstract away from the details of the base that is chosen and whether rounding or truncation is used by stating that storing $\chi$ as the floating point number $\check \chi$ obeys

\begin{equation*} \vert \delta\!\chi \vert \leq \meps \vert \chi \vert \end{equation*}

where $\meps$ is known as the machine epsilon or unit roundoff. When single precision floating point numbers are used $\meps \approx 10^{-8} \text{,}$ yielding roughly eight decimal digits of accuracy in the stored value. When double precision floating point numbers are used $\meps \approx 10^{-16} \text{,}$ yielding roughly sixteen decimal digits of accuracy in the stored value.

###### Example6.2.2.3.

The number $4/3 = 1.3333\cdots$ can be written as

\begin{equation*} \begin{array}{l} 1.3333 \cdots \\ ~~~ = ~~~~ \\ 1 + \frac{0}{2} + \frac{1}{4} + \frac{0}{8} + \frac{1}{16} + \cdots \\ ~~~ = ~~~~ \lt \mbox{ convert to binary representation } \gt \\ 1.0101 \cdots \times 2^0 \\ ~~~ = ~~~~ \lt \mbox{ normalize } \gt \\ .10101 \cdots \times 2^1 \end{array} \end{equation*}

Now, if $t = 4$ then this would be rounded to

\begin{equation*} .1011 \times 2^1, \end{equation*}

which is equals the number

\begin{equation*} \begin{array}{l} .1011 \times 2^1 ~~~ = ~~~~ \\ \frac{1}{2} + \frac{0}{4} + \frac{1}{8} + \frac{1}{16} \times 2^1 \\ ~~~ = ~~~~ \\ 0.6875 \times 2 ~~~ = ~~~~ \lt \mbox{ convert to decimal } \gt \\ 1.375 \end{array} \end{equation*}

The relative error equals

\begin{equation*} \frac{\vert 1.333\cdots - 1.375 \vert} {1.333\cdots} = 0.03125. \end{equation*}
###### Definition6.2.2.4. Machine epsilon (unit roundoff).

The machine epsilon (unit roundoff), $\meps \text{,}$ is defined as the smallest positive floating point number $\chi$ such that the floating point number that represents $1 + \chi$ is greater than one.

###### Remark6.2.2.5.

The quantity $\meps$ is machine dependent. It is a function of the parameters characterizing how a specific architecture converts reals to floating point numbers.

###### Homework6.2.2.1.

Assume a floating point number system with $\beta = 2 \text{,}$ a mantissa with $t$ digits, and truncation when storing.

• Write the number $1$ as a floating point number in this system.

• What is the $\meps$ for this system?

Solution
• Write the number $1$ as a floating point number.

\begin{equation*} \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \\ \mbox{ digits} \end{array} \times 2^1. \end{equation*}
• What is the $\meps$ for this system?

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ \gt 1 \end{array} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \lt 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \mbox{ truncates to } 1 \end{array} \end{equation*}

Notice that

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1 \end{equation*}

can be represented as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 0 \cdots 0} \\ t \mbox{ digits} \end{array} \times 2^{-(t-2)} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1 \end{equation*}

as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 1 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^{-(t-1)} \end{equation*}

Hence $\meps = 2^{-(t-1)} \text{.}$