IEEE Floating Point Standard
Precision, loss of significance, and the hidden dangers of computer arithmetic.
🎯 Lecture Goals
- 1 Highlight the IEEE-754 standard.
- 2 Understand precision.
- 3 Explore loss of significance.
- 4 Analyze cancellation errors.
Why this matters
- IEEE-754 is the universal standard for hardware/software floating point arithmetic.
- Floating point arithmetic appears in nearly every piece of code.
- Even modest operations can yield loss of significant bits.
- We must avoid pitfalls in common mathematical expressions.
Floating Point Representation
Normalized Floating-Point Form
- $d_1d_2\dots d_m$ is the mantissa.
- $e$ is the exponent (negative, positive, or zero).
- In normalized form, we assume $d_0 = 1$ (and often do not store this bit).
Base 10 Example
- $1000.12345 \rightarrow (0.100012345)_{10} \times 10^4$
- $0.000812345 \rightarrow (0.812345)_{10} \times 10^{-3}$
A Toy Model: 3-bit Mantissa
Suppose we have 3 bits for a mantissa and 2 bits for an exponent:
Produces discrete values: $0, \frac{1}{16}, \frac{2}{16}, \dots, \frac{7}{8}$.
Visualizing gaps between representable numbers.
Effect of normalization on representable range.
Underflow
Computations too close to zero. Often fall back to 0.
Overflow
Computations too large to represent. Considered a severe error.
IEEE Floating Point (v. 754)
Typical word lengths are 32-bit (Single) and 64-bit (Double). The goal is to use these bits to best represent the normalized number $x = \pm q \times 2^m$.
Single Precision (32-bit)
- Sign: 1 bit
- Exponent: 8 bits ($|m|$). Bias = 127.
- Mantissa: 23 bits ($q$). Hidden leading bit ($1.f$).
- Precision: ~6 decimal digits.
- Range: $\approx 10^{-38}$ to $10^{38}$.
Double Precision (64-bit)
- Sign: 1 bit
- Exponent: 11 bits. Bias = 1023.
- Mantissa: 52 bits.
- Precision: ~15 decimal digits.
- Range: $\approx 10^{-308}$ to $10^{308}$.
Example: Converting $x = -52.125$ to Single Precision
- Convert to binary: $52 = 110100_2$, $0.125 = 0.001_2$. So $x = -(110100.001)_2$.
- Normalize ($1.f$ form): $x = -1.10100001 \times 2^5$.
- Exponent: $5 + 127 (\text{bias}) = 132 = (10000100)_2$.
- Mantissa: $10100001000\dots0$ (drop leading 1).
Machine Epsilon ($\epsilon_m$)
The machine epsilon $\epsilon_m$ is the smallest number such that $fl(1+\epsilon_m) \neq 1$.
Key Values
- Double Precision: $\approx 2^{-52} \approx 2.22 \times 10^{-16}$
- Single Precision: $\approx 2^{-23} \approx 1.19 \times 10^{-7}$
Matlab Check
Rounding Modes
Floating Point Arithmetic
The set of representable machine numbers is FINITE. Basic algebra breaks down:
Arithmetic Rules with Error
Insight: Order of Summation
Errors amplify based on the magnitude of intermediate sums. To sum $n$ numbers more accurately:
Start with the small numbers first!
Catastrophic Cancellation
One of the most serious problems. It occurs when two large and close-by numbers are subtracted. The result carries very few accurate digits.
The Problem
$a = x.xxxx1\dots$
$b = x.xxxx0\dots$
$a - b = 0.00001\dots$
The leading significant digits cancel out, leaving only the "garbage" digits at the end.
Example: Quadratic Formula
To find the root of $x^2 + 2px - q = 0$ with smallest absolute value:
Bad: $y = -p + \sqrt{p^2 + q}$
Good: $y = \frac{q}{p + \sqrt{p^2 + q}}$
The "Good" version avoids subtracting two large numbers.
Example: Function Rearrangement
Consider $f(x) = \sqrt{x^2 + 1} - 1$. For $x \approx 0$, this subtracts two nearly equal numbers ($\approx 1 - 1$).
Fix: Multiply by the conjugate:
The result involves addition in the denominator, which is safe from cancellation.
Loss of Precision Theorem
If $2^{-p} \le 1 - y/x \le 2^{-q}$, then the significant binary digits lost in calculating $x-y$ is between $q$ and $p$.
Case Study: Intel Pentium Bug
June 1994
Intel engineers discover division error. Managers keep it internal.
Oct 1994
Dr. Thomas Nicely (Lynchburg College) discovers the bug while calculating prime reciprocals.
Oct 30, 1994
Nicely sends an email detailing the error: 0001/824633702441.0 is
calculated incorrectly.
Dec 20, 1994
Intel admits fault. Sets aside $420 million to fix it.
🧠 Final Challenge +20 XP
Why is calculating $y = -p + \sqrt{p^2 + q}$ problematic when $p \gg q$ and $p > 0$?
Interactive Playground
Explore machine epsilon and precision limitations. You can run MATLAB code via Octave Online or Python code directly in your browser.
Python (Client-Side)
Loading Python environment...