Skip to main content

Subsection 6.2.2 Error in storing a real number as a floating point number

Remark 6.2.2.1.

We consider the case where a real number is trunctated to become the stored floating point number. This makes the discussion a bit simpler.

Let positive \(\chi \) be represented by

\begin{equation*} \chi = . d_0 d_1 \cdots \times 2^e, \end{equation*}

where \(d_i \) are binary digits and \(d_0 = 1 \) (the mantissa is normalized). If \(t \) binary digits are stored by our floating point system, then

\begin{equation*} \check \chi = . d_0 d_1 \cdots d_{t-1} \times 2^e \end{equation*}

is stored (if truncation is employed). If we let \(\delta\!\chi = \chi - \check \chi \text{.}\) Then

\begin{equation*} \begin{array}{rcl} \delta\!\chi \amp = \amp \begin{array}[t]{c} \underbrace{. d_0 d_1 \cdots d_{t-1} d_t \cdots \times 2^e} \\ \chi \end{array} - \begin{array}[t]{c} \underbrace{. d_0 d_1 \cdots d_{t-1} \times 2^e} \\ \check \chi \end{array} \\ \amp = \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 00} \\ t \end{array} d_t \cdots \times 2^e \\ \amp \lt \amp \begin{array}[t]{c} \underbrace{. 0 \cdots 01} \\ t \end{array} \times 2^e = 2^{-t} 2^{e} . \end{array} \end{equation*}

Since \(\chi \) is positive and \(d_0 = 1 \text{,}\)

\begin{equation*} \chi = . d_0 d_1 \cdots \times 2^e \geq \frac{1}{2} \times 2^e . \end{equation*}

Thus,

\begin{equation*} \frac{\delta\!\chi}{\chi} \leq \frac{2^{-t} 2^{e}}{\frac{1}{2} 2^{e}} = 2^{-(t-1)} , \end{equation*}

which can also be written as

\begin{equation*} \delta\!\chi \leq 2^{-(t-1)} \chi. \end{equation*}

A careful analysis of what happens when \(\chi \) equals zero or is negative yields

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-(t-1)} \vert \chi \vert. \end{equation*}

If \(\check \chi \) is computed by rounding instead of truncating, then

\begin{equation*} \vert \delta\!\chi \vert \leq 2^{-t} \vert \chi \vert . \end{equation*}

We can abstract away from the details of the base that is chosen and whether rounding or truncation is used by stating that storing \(\chi \) as the floating point number \(\check \chi \) obeys

\begin{equation*} \vert \delta\!\chi \vert \leq \meps \vert \chi \vert \end{equation*}

where \(\meps \) is known as the machine epsilon or unit roundoff. When using single precision floating point numbers are used \(\meps \approx 10^{-8} \text{,}\) yielding roughly eight decimal digits of accuracy in the stored value. When using double precision floating point numbers are used \(\meps \approx 10^{-16} \text{,}\) yielding roughly sixteen decimal digits of accuracy in the stored value.

Definition 6.2.2.2. Machine epsilon (unit roundoff).

The machine epsilon (unit roundoff), \(\meps \text{,}\) is defined as the smallest floating point number \(\chi \) such that the floating point number that represents \(1 + \chi \) is greater than one.

Remark 6.2.2.3.

The quantity \(\meps\) is machine dependent. It is a function of the parameters characterizing how a specific architecture converts reals to floating point numbers.

Homework 6.2.2.1.

Assume a floating point number system with \(\beta = 2 \text{,}\) a mantissa with \(t \) digits, and truncation when storing.

  • Write the number \(1 \) as a floating point number in this system.

  • What is the \(\meps \) for this system?

Solution
  • Write the number \(1 \) as a floating point number.

    Answer:

    \begin{equation*} \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \\ \mbox{ digits} \end{array} \times 2^1\text{.} \end{equation*}
  • What is the \(\meps \) for this system?

    Answer:

    \begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1} \\ \gt 1 \end{array} \end{equation*}

    and

    \begin{equation*} \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ . 1 0\cdots 0} \\ t \mbox{ digits} \end{array} \times 2^1} \\ 1 \end{array} + \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \lt 2^{-(t-1)} \end{array} = \begin{array}[t]{c} \underbrace{ \begin{array}[t]{c} \underbrace{ .1 0\cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1} \\ \mbox{ truncates to } 1 \end{array} \end{equation*}

    Notice that

    \begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^1 \end{equation*}

    can be represented as

    \begin{equation*} \begin{array}[t]{c} \underbrace{ .1 0 \cdots 0} \\ t \mbox{ digits} \end{array} \times 2^{-(t-2)} \end{equation*}

    and

    \begin{equation*} \begin{array}[t]{c} \underbrace{ .0 0 \cdots 0} \\ t \mbox{ digits} \end{array} 1 1 \cdots \times 2^1 \end{equation*}

    as

    \begin{equation*} \begin{array}[t]{c} \underbrace{ .1 1 \cdots 1} \\ t \mbox{ digits} \end{array} \times 2^{-(t-1)} \end{equation*}

    Hence \(\meps = 2^{-(t-1)} \text{.}\)