ALAFF Storing real numbers as floating point numbers

Subsection 6.2.1 Storing real numbers as floating point numbers

Only a finite number of (binary) digits can be used to store a real number in the memory of a computer. For so-called single-precision and double-precision floating point numbers, 32 bits and 64 bits are typically employed, respectively.

Recall that any real number can be written as \(\mu \times \beta^e \text{,}\) where \(\beta \) is the base (an integer greater than one), \(\mu \in [ -1, 1 ] \) is the mantissa, and \(e \) is the exponent (an integer). For our discussion, we will define the set of floating point numbers, \(F \text{,}\) as the set of all numbers \(\chi = \mu \times \beta^e \) such that

\(\beta = 2 \text{,}\)
\(\mu = \pm. \delta_0 \delta_1 \cdots \delta_{t-1}\) (\(\mu \) has only \(t \) (binary) digits), where \(\delta_j \in \{ 0, 1 \} \)),
\(\delta_0 = 0 \) iff \(\mu = 0 \) (the mantissa is normalized), and
\(-L \leq e \leq U \text{.}\)

With this, the elements in \(F \) can be stored with a finite number of (binary) digits.

Example 6.2.1.1.

Let \(\beta = 2 \text{,}\) \(t = 3 \text{,}\) \(\mu = .101 \text{,}\) and \(e = 1 \text{.}\) Then

\begin{equation*} \begin{array}{l} \mu \times \beta^e \\ ~~~ = ~~~~ \\ .101 \times 2^1 \\ ~~~ = ~~~~ \\ \left( 1 \times 2^{-1} + 0 \times 2^{-2} + 1 \times 2^{-3} \right) \times 2^1 \\ ~~~ = ~~~~ \\ \left( \frac{1}{2} + \frac{0}{4} + \frac{1}{8} \right) \times 2 \\ ~~~ = ~~~~ \\ 1.25 \end{array} \end{equation*}

Observe that

There is a largest number (in absolute value) that can be stored. Any number with larger magnitude "overflows". Typically, this causes a value that denotes a NaN (Not-a-Number) to be stored.
There is a smallest number (in absolute value) that can be stored. Any number that is smaller in magnitude "underflows". Typically, this causes a zero to be stored.

In practice, one needs to be careful to consider overflow and underflow. The following example illustrates the importance of paying attention to this.

Example 6.2.1.2.

Computing the (Euclidean) length of a vector is an operation we will frequently employ. Careful attention must be paid to overflow and underflow when computing it.

Given \(x \in \Rn \text{,}\) consider computing

\begin{equation} \| x \|_2 = \sqrt{\sum_{i=0}^{n-1} \chi_i^2} .\label{eqn-norm2}\tag{6.2.1} \end{equation}

Notice that

\begin{equation*} \| x \|_2 \leq \sqrt{n} \max_{i=0}^{n-1} \vert \chi_i \vert \end{equation*}

and hence, unless some \(\chi_i \) is close to overflowing, the result will not overflow. The problem is that if some element \(\chi_i \) has the property that \(\chi_i^2 \) overflows, intermediate results in the computation in (6.2.1) will overflow. The solution is to determine \(k \) such that

\begin{equation*} \vert \chi_k \vert = \max_{i=0}^{n-1} \vert \chi_i \vert \end{equation*}

and to then instead compute

\begin{equation*} \| x \|_2 = \vert \chi_k \vert \sqrt{\sum_{i=0}^{n-1} \left( \frac{\chi_i}{\chi_k} \right)^2} . \end{equation*}

It can be argued that the same approach also avoids underflow if underflow can be avoided.

In our discussion, we mostly ignore this aspect of floating point computation.

Remark 6.2.1.3.

Any time a real number is stored in our computer, it is stored as a nearby floating point number (element in \(F \)) (either through rounding or truncation). Nearby, of course, could mean that it is stored as the exact number if it happens to also be a floating point number.