## Subsection6.2.1Storing real numbers as floating point numbers

Only a finite number of (binary) digits can be used to store a real number in the memory of a computer. For so-called single-precision and double-precision floating point numbers, 32 bits and 64 bits are typically employed, respectively.

Recall that any real number can be written as $\mu \times \beta^e \text{,}$ where $\beta$ is the base (an integer greater than one), $\mu \in ( -1, 1 )$ is the mantissa, and $e$ is the exponent (an integer). For our discussion, we will define the set of floating point numbers, $F \text{,}$ as the set of all numbers $\chi = \mu \beta^e$ such that

• $\beta = 2 \text{,}$

• $\mu = \pm. \delta_0 \delta_1 \cdots \delta_{t-1}$ ($\mu$ has only $t$ (binary) digits), where $\delta_j \in \{ 0, 1 \}$),

• $\delta_0 = 0$ iff $\mu = 0$ (the mantissa is normalized), and

• $-L \leq e \leq U \text{.}$

With this, the elements in $F$ can be stored with a finite number of (binary) digits.

Observe that

• There is a largest number (in absolute value) that can be stored. Any number with larger magnitude "overflows". Typically, this causes a value that denotes a NaN (Not-a-Number) to be stored.

• There is a smallest number (in absolute value) that can be stored. Any number that is smaller in magnitude "underflows". Typically, this causes a zero to be stored.

In practice, one needs to be careful to consider overflow and underflow. The following example illustrates the importance of paying attention to this.

###### Example6.2.1.1.

Computing the (Euclidean) length of a vector is an operation we will frequently employ. Careful attention must be paid to overflow and underflow when computing it.

Given $x \in \Rn \text{,}$ consider computing

\begin{equation} \| x \|_2 = \sqrt{\sum_{i=0}^{n-1} \chi_i^2} .\label{eqn-norm2}\tag{6.2.1} \end{equation}

Notice that

\begin{equation*} \| x \|_2 \leq \sqrt{n} \max_{i=0}^{n-1} \vert \chi_i \vert \end{equation*}

and hence, unless some $\chi_i$ is close to overflowing, the result will not overflow. The problem is that if some element $\chi_i$ has the property that $\chi_i^2$ overflows, intermediate results in the computation in (6.2.1) will overflow. The solution is to determine $k$ such that

\begin{equation*} \vert \chi_k \vert = \max_{i=0}^{n-1} \vert \chi_i \vert \end{equation*}

Any time a real number is stored in our computer, it is stored as a nearby floating point number (element in $F$) (either through rounding or truncation).