Lecture 4: IEEE Floating Point

🎯 Lecture Goals

1 Highlight the IEEE-754 standard.
2 Understand precision.
3 Explore loss of significance.
4 Analyze cancellation errors.

Why this matters

IEEE-754 is the universal standard for hardware/software floating point arithmetic.
Floating point arithmetic appears in nearly every piece of code.
Even modest operations can yield loss of significant bits.
We must avoid pitfalls in common mathematical expressions.

1

Floating Point Representation

Normalized Floating-Point Form

\[ x = \pm(d_0.d_1d_2d_3\dots d_m)_{\beta} \times \beta^{e} \]

$d_1d_2\dots d_m$ is the mantissa.
$e$ is the exponent (negative, positive, or zero).
In normalized form, we assume $d_0 = 1$ (and often do not store this bit).

Base 10 Example

$1000.12345 \rightarrow (0.100012345)_{10} \times 10^4$
$0.000812345 \rightarrow (0.812345)_{10} \times 10^{-3}$

A Toy Model: 3-bit Mantissa

Suppose we have 3 bits for a mantissa and 2 bits for an exponent:

e1

e2

.d1

d2

d3

Produces discrete values: $0, \frac{1}{16}, \frac{2}{16}, \dots, \frac{7}{8}$.

Visualizing gaps between representable numbers.

Effect of normalization on representable range.

Underflow

Computations too close to zero. Often fall back to 0.

Overflow

Computations too large to represent. Considered a severe error.

2

IEEE Floating Point (v. 754)

Typical word lengths are 32-bit (Single) and 64-bit (Double). The goal is to use these bits to best represent the normalized number $x = \pm q \times 2^m$.

Single Precision (32-bit)

1

8

23

Sign: 1 bit
Exponent: 8 bits ($|m|$). Bias = 127.
Mantissa: 23 bits ($q$). Hidden leading bit ($1.f$).
Precision: ~6 decimal digits.
Range: $\approx 10^{-38}$ to $10^{38}$.

Double Precision (64-bit)

1

11

52

Sign: 1 bit
Exponent: 11 bits. Bias = 1023.
Mantissa: 52 bits.
Precision: ~15 decimal digits.
Range: $\approx 10^{-308}$ to $10^{308}$.

Example: Converting $x = -52.125$ to Single Precision

Convert to binary: $52 = 110100_2$, $0.125 = 0.001_2$. So $x = -(110100.001)_2$.
Normalize ($1.f$ form): $x = -1.10100001 \times 2^5$.
Exponent: $5 + 127 (\text{bias}) = 132 = (10000100)_2$.
Mantissa: $10100001000\dots0$ (drop leading 1).

1 10000100 10100001000000000000000

3

Machine Epsilon ($\epsilon_m$)

The machine epsilon $\epsilon_m$ is the smallest number such that $fl(1+\epsilon_m) \neq 1$.

Key Values

Double Precision: $\approx 2^{-52} \approx 2.22 \times 10^{-16}$
Single Precision: $\approx 2^{-23} \approx 1.19 \times 10^{-7}$

Matlab Check

                        >> eps
                        ans = 2.2204e-16
                        >> eps('single')
                        ans = 1.1921e-07
                    

Rounding Modes

To Nearest (Default)

Toward 0

Toward $+\infty$ (Up)

Toward $-\infty$ (Down)

4

Floating Point Arithmetic

The set of representable machine numbers is FINITE. Basic algebra breaks down:

$a + (b + c) \neq (a + b) + c$

Arithmetic Rules with Error

Rule 1: $fl(x) = x(1 + \epsilon), \quad |\epsilon| \le \epsilon_m$

Rule 2: $fl(x \odot y) = (x \odot y)(1 + \epsilon_\odot), \quad |\epsilon_\odot| \le \epsilon_m$

Insight: Order of Summation

Errors amplify based on the magnitude of intermediate sums. To sum $n$ numbers more accurately:

Start with the small numbers first!

5

Catastrophic Cancellation

One of the most serious problems. It occurs when two large and close-by numbers are subtracted. The result carries very few accurate digits.

The Problem

$a = x.xxxx1\dots$
$b = x.xxxx0\dots$
$a - b = 0.00001\dots$

The leading significant digits cancel out, leaving only the "garbage" digits at the end.

Example: Quadratic Formula

To find the root of $x^2 + 2px - q = 0$ with smallest absolute value:

Bad: $y = -p + \sqrt{p^2 + q}$

Good: $y = \frac{q}{p + \sqrt{p^2 + q}}$

The "Good" version avoids subtracting two large numbers.

Example: Function Rearrangement

Consider $f(x) = \sqrt{x^2 + 1} - 1$. For $x \approx 0$, this subtracts two nearly equal numbers ($\approx 1 - 1$).

Fix: Multiply by the conjugate:

\[ f(x) = (\sqrt{x^2 + 1} - 1) \cdot \frac{\sqrt{x^2+1}+1}{\sqrt{x^2+1}+1} = \frac{x^2}{\sqrt{x^2+1}+1} \]

The result involves addition in the denominator, which is safe from cancellation.

Loss of Precision Theorem

If $2^{-p} \le 1 - y/x \le 2^{-q}$, then the significant binary digits lost in calculating $x-y$ is between $q$ and $p$.

6

Case Study: Intel Pentium Bug

June 1994

Intel engineers discover division error. Managers keep it internal.

Oct 1994

Dr. Thomas Nicely (Lynchburg College) discovers the bug while calculating prime reciprocals.

Oct 30, 1994

Nicely sends an email detailing the error: 0001/824633702441.0 is calculated incorrectly.

"It appears that there is a bug in the floating point unit... of many, and perhaps all, Pentium processors."

Dec 20, 1994

Intel admits fault. Sets aside $420 million to fix it.

7

Interactive Playground

Explore machine epsilon and precision limitations. You can run MATLAB code via Octave Online or Python code directly in your browser.

MATLAB / Octave (Online)

Copy & Paste into Terminal:

🚀 Open Octave Online

Python (Client-Side)

Output

Loading Python environment...

IEEE Floating Point Standard