FloatingPoint Handout
FloatingPoint Handout
Jeff Arnold
CERN openlab
9 May 2017
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 1
Agenda
• Introduction
• Standards
• Properties
• Error-Free Transformations
• Summation Techniques
• Dot Products
• Polynomial Evaluation
• Value Safety
• Pitfalls and Gremlins
• Tools
• References and Bibliography
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 2
Why is Floating-Point Arithmetic Important?
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 3
Important to Teach About Floating-Point Arithmetic
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 4
Reasoning about Floating-Point Arithmetic
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 5
Classification of real numbers
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 6
Some Properties of Floating-Point Numbers
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 7
Floating-Point Numbers are Rational Numbers
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 8
How Many Floating-Point Numbers Are There?
• ∼ 2p+1 (2emax + 1)
• Single-precision: ∼ 4.3 × 109
• Double-precision: ∼ 1.8 × 1019
• Number of protons circulating in LHC: ∼ 6.7 × 1014
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 9
Standards
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 10
IEEE 754-2008
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 11
Operations Specified by IEEE 754-2008
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 12
Other Operations Specified by IEEE 754-2008
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 13
Special Values
• Zero
• zero is signed
• Infinity
• infinity is signed
• Subnormals
• NaN (Not a Number)
• Quiet NaN
• Signaling NaN
• NaNs do not have a sign
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 14
Rounding Modes in IEEE 754-2008
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 15
Exceptions Specified by IEEE 754-2008
• Underflow
• Absolute value of a non-zero result is less than the smallest
non-zero finite floating-point number
• Result is 0
• Overflow
• Absolute value of a result is greater than the largest finite
floating-point number
• Result is ±∞
• Division by Zero
• x/y where x is finite and non-zero and y = 0
• Inexact
• The result, after rounding, is different than the
infinitely-precise result
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 16
Exceptions Specified by IEEE 754-2008
• Invalid
• An
√ operand is a NaN
• x where x < 0
√
• however, −0 = −0
• (±∞) ± (±∞)
• (±0) × (±∞)
• (±0)/(±0)
• (±∞)/(±∞)
• some floating-point→integer or decimal conversions
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 17
Formats Specified in IEEE 754-2008
Formats
• Basic Formats:
• Binary with sizes of 32, 64 and 128 bits
• Decimal with sizes of 64 and 128 bits
• Other formats:
• Binary with a size of 16 bits
• Decimal with a size of 32 bits
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 18
Transcendental and Algebraic Functions
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 19
We’re Not Going to Consider Everything...
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 20
Storage Format of a Binary Floating-Point Number
p − 1 bits
w bits - -
s E significand
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 21
The Value of a Floating-Point Number
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 22
The Value of a Floating-Point Number
with
0≤m<β
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 23
The Value of a Floating-Point Number
0 ≤ M < βp
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 24
The Value of a Floating-Point Number
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 25
Requiring Uniqueness
p−1
X
x = (−)s β e xi β −i
i=0
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 26
Requiring Uniqueness
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 27
Subnormal Floating-Point Numbers
p−1
X
m= xi β −i
i=0
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 28
Why have Subnormal Floating-Point Numbers?
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 29
Why p − 1?
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 30
A Walk Through the Doubles
0x0000000000000000 plus 0
0x0000000000000001 smallest subnormal
...
0x000fffffffffffff largest subnormal
0x0010000000000000 smallest normal
...
0x001fffffffffffff
0x0020000000000000 2× smallest normal
...
0x7fefffffffffffff largest normal
0x7ff0000000000000 +∞
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 31
A Walk Through the Doubles
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 32
A Walk Through the Doubles
0x8000000000000000 minus 0
0x8000000000000001 smallest -subnormal
...
0x800fffffffffffff largest -subnormal
0x8010000000000000 smallest -normal
...
0x801fffffffffffff
...
0xffefffffffffffff largest -normal
0xfff0000000000000 −∞
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 33
A Walk Through the Doubles
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 34
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 35
Notation
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 36
Some Inconvenient Properties of Floating-Point Numbers
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 37
The Fused Multiply-Add Instruction (FMA)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 38
The Fused Multiply-Add Instruction (FMA)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 39
The Fused Multiply-Add Instruction (FMA)
x = 0 x1 .3333333333333 p +0;
x1 = x * x = 0 x1 .70 a3d70a3d70ap +0
x2 = fma (x ,x ,0) = 0 x1 .70 a3d70a3d70ap +0
x3 = fma (x ,x , - x * x ) = -0 x1 . eb851eb851eb8p -55
x3 is the difference between the exact value of x*x and its value
converted to double precision. The relative error is ≈ 0.24 ulp
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 40
The Fused Multiply-Add Instruction (FMA)
Floating-point contractions
• Evaluate an expression as though it were a single operation
double a , b, c, d;
// Single expression ; maybe replaced
// by a = FMA (b , c , d )
a = b * c + d;
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 41
The Fused Multiply-Add Instruction (FMA)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 42
Forward and Backward Errors
f (x) → y
f (x̂) → ŷ
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 43
Forward and Backward Errors
For example, if
f (x) = sin(x) and x = π
then
y=0
However, if
x̂ = M PI
then
x̂ 6= x and f (x̂) 6= f (x)
Note we are assuming that if x̂ ≡ x then std::sin(^
x) ≡ sin(x)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 44
Forward and Backward Errors
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 45
Forward and Backward Errors
By J.G. Nagy, Emory University. From Brief Notes on Conditioning, Stability and Finite Precision Arithmetic
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 46
Condition Number
relative change in y
condition number =
relative change in x
∆y
y
= ∆x
x0
xf (x)
≈
f (x)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 47
Condition Number
• ln x for x ≈ 1
0
xf (x) 1
Condition number ≈ = →∞
f (x) ln x
• sin x for x ≈ π
x
Condition number ≈ →∞
sin x
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 48
Error Measures
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 49
IEEE 754 and ulps
IEEE 754 requires that all results be correctly rounded from the
infinitely-precise result.
If x is the infinitely-precise result and x̂ is the
“round-to-even”result, then
|x − x̂| ≤ 0.5ulp(x̂)
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 50
Approximation Error
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 51
Approximation Error
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 52
Approximating π
This explains why sin(M PI) is not zero: the argument is not
exactly π
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 53
Associativity
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 54
Distributivity
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 55
The “Pigeonhole”Principle
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 56
The “Pigeonhole”Principle
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 57
Catastrophic Cancellation
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 58
Sterbenz’s Lemma
Lemma
Let a and b be floating-point numbers with
b/2 ≤ a ≤ 2b
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 59
Error-Free Transformations
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 60
Error-Free Transformations
EFTs are most useful when they can be implemented using only
the precision of the floating-point numbers involved.
EFTs exist for
• Addition: a + b = s + t where s = a ⊕ b
• Multiplication: a × b = s + t where s = a ⊗ b
• Splitting: a = s + t
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 61
An EFT for Addition
1: s←a⊕b
2: z ←s a
3: t ← (a (s z) ⊕ (b z)
4: return (s, t)
Ensure: a + b = s + t where s = a ⊕ b and t are floating-point
numbers
A possible implementation
void
TwoSum ( const double a , const double b ,
double * const s , double * const t ) {
// No unsafe optimizations !
*s = a+b;
double z = *s - a ;
* t = (a -(* s - z ))+( b - a );
return ;
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 63
Comparing FastSum and TwoSum
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 64
Precise Splitting Algorithm
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 65
Precise Splitting EFT
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 66
Precise Splitting EFT
Possible implementation
void
Split ( const double x , const int delta ,
double * const x_h , double * const x_l ) {
// No unsafe optimizations !
double c = ( double )((1 UL << delta ) + 1);
* x_h = ( c * x ) + ( x - ( c * x ));
* x_l = x - * x_h ;
return ;
}
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 67
Precise Multiplication
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 68
Precise Multiplication EFT
• Traditional
• Sorting and Insertion
• Compensated
• Reference: Higham: Accuracy and Stability of Numerical
Algorithms
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 70
Summation Techniques
Condition number: P
|xi |
Csum = Pi
| i xi |
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 71
Traditional Summation
n−1
X
s= xi
i=0
double
Sum ( const double * x , const unsigned int n )
{ // No unsafe optimizations !
double sum = x [0]
for ( unsigned int i = 1; i < n ; i ++) {
sum += x [ i ];
}
return ;
}
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 72
Sorting and Insertion
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 73
Compensated Summation
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 74
Compensated (Kahan) Summation
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 75
Compensated (Kahan) Summation
double
Kahan ( const double * x , const unsigned int n )
{ // No unsafe optimizations !
double s = x [0];
double t = 0.0;
for ( int i = 1; i < n_values ; i ++ ) {
double y = x [ i ] - t ;
double z = s + y ;
t = ( z - s ) - y;
s = z;
}
return s ;
}
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 76
Compensated Summation
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 77
Choice of Summation Technique
• Performance
• Error Bound
• Is it (weakly) dependent on n?
• Condition Number
• Is it known?
• Is it difficult to determine?
• Some algorithms allow it to be determined simultaneously with
an estimate of the sum
• Permits easy evaluation of the suitability of the result
• No one technique fits all situations all the time
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 78
Dot Product
S = x| y
n−1
X
= x i · yi
i=0
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 79
Dot Product
Traditional algorithm
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 80
Dot Product
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 81
Dot Product
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 82
Dot Product
Recall
• Sum(x, y) computes s and t with x + y = s + t and s = x ⊕ y
• P rod(x, y) computes s and t with x + y = s + t and s = x ⊗ y
Since each individual product in the sum for the dot product is
transformed using P rod(x, y) into the sum of two floating-point
numbers, the dot product of 2 vectors can be reduced to
computing the sum of 2N floating-point numbers.
To accurately compute that sum, Sum(x, y) is used.
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 83
Dot Product
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 84
Polynomial Evaluation
Evaluate
n
X
p(x) = ai xi
i=0
= a0 + a1 x + a2 x2 + · · · + an−1 xn−1 + an xn
Condition number
Pn i
i=0 |ai | |x|
C(p, x) = P
| ni=0 ai xi |
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 85
Horner’s Scheme
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 86
Horner’s Scheme
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 87
Horner’s Scheme
A possible implementation
double
Horner ( const double x ,
const double * const a ,
const int n ) {
double s = a [ n ];
for ( int i = n - 1; i >= 0; i - -} (
// s = s * x + a [ i ];
s = FMA (s , x , a [ i ]);
}
return s ;
}
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 88
Applying EFTs to Horner’s Scheme
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 89
Applying EFTs to Horner’s Scheme
s0 + (π(x) + σ(x))
is an improved approximation to
n
X
ai xi
i=0
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 90
Second Order Horner’s Scheme
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 91
Estrin’s Method
Isolate subexpressions of the form (ak + ak+1 x) and x2n from p(x):
p(x) = (ao +a1 x)+(a2 +a3 x)x2 +((a4 +a5 x)+(a6 +a7 x)x2 )x4 +· · ·
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 92
Value Safety
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 93
Value Safety
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 94
A Note on Compiler Options
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 95
Optimizations Affecting Value Safety
• Expression rearrangements
• Flush-to-zero
• Approximate division and square root
• Math library accuracy
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 96
Expression Rearrangements
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 97
Subnormal Numbers and Flush-To-Zero
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 98
Subnormal Numbers and Flush-To-Zero
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 99
Reductions
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 100
The Hardware Floating-Point Environment
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 101
Precise Exceptions
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 102
Math Library Features – icc
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 103
Tools
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 104
Tools
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 105
Tools
MPFR
• a C library for multiple-precision floating-point computations
• all results are correctly rounded
• used by gcc and g++
• C++ interface available
• free with a GNU LGPL license
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 106
Tools
CRlibm
• a C library
• all results are correctly rounded
• C++ interface available
• Python bindings available
• free with a GNU LGPL license
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 107
Tools
• limits
• defines characteristics of arithmetic types
• provides the template for the class numeric limits
• #include <limits>
• requires -std=c++11
• specializations for each fundamental type
• compiler and platform specific
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 108
Tools
• cmath
• functions to compute common mathematical operations and
transformations
• #include <cmath>
• frexp
• get exponent and significand
• ldexp
• create value from exponent and significand
• Note: frexp and ldexp assume a different
“normalization”than usual: 1/2 ≤ m < 1
• nextafter
• create next representable value
• fpclassify
• returns one of FP INFINITE, FP NAN, FP ZERO,
F SUBNORMAL, FP NORMAL
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 109
Pitfalls and Gremlins
Catastrophic Cancellation
• x2 − y 2 for x ≈ y
• (x + y)(x − y) may be preferable
• x − y is computed with no round-off error (Sterbenz’s Lemma)
• x + y is computed with relatively small error
• FMA(x,x,-y*y) can be very accurate
• However FMA(x,x,-x*x) is not usually 0!
• similarly 1 − x2 for x ≈ 1
• -FMA(x,x,-1.0) is very accurate
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 110
Pitfalls and Gremlins
“Gratuitous”Overflow
√
Consider x2 + 1 for large x
√
• x2 + 1 → |x| as |x| → ∞
p
• |x| 1 + 1/x2 may be preferable
2
• if xp overflows, 1/x → 0
• |x| 1 + 1/x2 → |x|
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 111
Pitfalls and Gremlins
√
Consider the Newton-Raphson iteration for 1/ x:
yn + yn (1 − xyn2 )/2,
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 112
Pitfalls and Gremlins
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 113
Pitfalls and Gremlins
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 114
Pitfalls and Gremlins
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 115
Pitfalls and Gremlins
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 116
Bibliography
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 117
Bibliography
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 118
Bibliography
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 119
Bibliography
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 120
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 121
2017
c Jeffrey M. Arnold Floating-Point Arithmetic and Computation 122