## Subsection6.2.2Error in storing a real number as a floating point number

###### Remark6.2.2.1.

We consider the case where a real number is trunctated to become the stored floating point number. This makes the discussion a bit simpler.

Let positive $\chi$ be represented by

\begin{equation*} \chi = . d_0 d_1 \cdots \times 2^e, \end{equation*}

where $d_i$ are binary digits and $d_0 = 1$ (the mantissa is normalized). If $t$ binary digits are stored by our floating point system, then

\begin{equation*} \check \chi = . d_0 d_1 \cdots d_{t-1} \times 2^e \end{equation*}

is stored (if truncation is employed). If we let $\delta\!\chi = \chi - \check \chi \text{.}$ Then

\begin{equation*} \begin{array}{rcl} \delta\!\chi \amp = \amp \begin{array}[t]{c} \underbrace{. d_0 d_1 \cdots d_{t-1} d_t \cdots \times 2^e} \\ \chi \end{array} - \begin{array}[t]{c} \underbrace{. d_0 d_1 \cdots d_{t-1} \times 2^e} \\ \check \chi \end{array} \\ \amp = \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 00} \\ t \end{array} d_t \cdots \times 2^e \\ \amp \lt \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 01} \\ t \end{array} \times 2^e = 2^{-t} 2^{e} . \end{array} \end{equation*}

Since $\chi$ is positive and $d_0 = 1 \text{,}$

\begin{equation*} \chi = . d_0 d_1 \cdots \times 2^e \geq \frac{1}{2} \times 2^e . \end{equation*}

Thus,

\begin{equation*} \frac{\delta\!\chi}{\chi} \leq \frac{2^{-t} 2^{e}}{\frac{1}{2} 2^{e}} = 2^{-(t-1)} , \end{equation*}

which can also be written as

\begin{equation*} \delta\!\chi \leq 2^{-(t-1)} \chi. \end{equation*}

A careful analysis of what happens when $\chi$ equals zero or is negative yields

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-(t-1)} \vert \chi \vert. \end{equation*}

If $\check \chi$ is computed by rounding instead of truncating, then

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-t} \vert \chi \vert . \end{equation*}

We can abstract away from the details of the base that is chosen and whether rounding or truncation is used by stating that storing $\chi$ as the floating point number $\check \chi$ obeys

\begin{equation*} \vert \delta\!\chi \vert \leq \meps \vert \chi \vert \end{equation*}

where $\meps$ is known as the machine epsilon or unit roundoff. When using single precision floating point numbers are used $\meps \approx 10^{-8} \text{,}$ yielding roughly eight decimal digits of accuracy in the stored value. When using double precision floating point numbers are used $\meps \approx 10^{-16} \text{,}$ yielding roughly sixteen decimal digits of accuracy in the stored value.

###### Definition6.2.2.2. Machine epsilon (unit roundoff).

The machine epsilon (unit roundoff), $\meps \text{,}$ is defined as the smallest floating point number $\chi$ such that the floating point number that represents $1 + \chi$ is greater than one.

###### Remark6.2.2.3.

The quantity $\meps$ is machine dependent. It is a function of the parameters characterizing how a specific architecture converts reals to floating point numbers.

###### Homework6.2.2.1.

Assume a floating point number system with $\beta = 2 \text{,}$ a mantissa with $t$ digits, and truncation when storing.

• Write the number $1$ as a floating point number in this system.

• What is the $\meps$ for this system?

Solution
• Write the number $1$ as a floating point number.

\begin{equation*} \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \\ \mbox{ digits} \end{array} \times 2^1\text{.} \end{equation*}
• What is the $\meps$ for this system?

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ \gt 1 \end{array} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \lt 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \mbox{ truncates to } 1 \end{array} \end{equation*}

Notice that

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1 \end{equation*}

can be represented as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 0 \cdots 0} \\ t \mbox{ digits} \end{array} \times 2^{-(t-2)} \end{equation*}

and

\begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1 \end{equation*}

as

\begin{equation*} \begin{array}[t]{c} \underbrace{ .1 1 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^{-(t-1)} \end{equation*}

Hence $\meps = 2^{-(t-1)} \text{.}$