Finite Word Length Effects
(Number representation in register)
Sugumar D
FINITE word length effect
The Digital Signal Processors have finite width
of the data bus.
The word-length after mathematical operations,
if exceeds the bus width, will have to be
omitted.
This is the source of Serious Errors.
We now discuss attributes that cause such
errors.
First we will discuss about
Number representation and
Quantization error.
1.Number representation in registers
The Binary Number System
In conventional digital computers - integers
represented as binary numbers of fixed length n
An ordered sequence
of binary digits
Each digit x i (bit) is 0 or 1
The above sequence represents the integer value X
Upper case letters represent numerical values or
sequences of digits
Lower case letters, usually indexed, represent
individual digits
Radix of a Number System
The weight of the digit xi is the i th power of 2
2 is the radix of the binary number system
Binary numbers are radix-2 numbers allowed digits are 0,1
Decimal numbers are radix-10 numbers allowed digits are 0,1,2,,9
Radix indicated in subscript as a decimal number
Example:
(101) 10 - decimal value 101
(101)2 - decimal value 5
Range of Representations
Operands and results are stored in registers of
fixed length n - finite number of distinct
values that can be represented within an
arithmetic unit
Xmin ; Xmax - smallest and largest
representable values
[Xmin,Xmax] - range of the representable
numbers
A result larger then Xmax or smaller than Xmin
- incorrectly represented
The arithmetic unit should indicate that the
generated result is in error an overflow indication
Signed-magnitude Representation
Uses the high-order bit to indicate the sign
0 for positive
1 for negative
remaining low-order bits indicate the magnitude of the
value
Signed magnitude representation of +41 and -41
0 0 1 0
32
+
1 0 0 1
+ 8 +
41
1 0 1 0
32
1
-
1 0 0 1
+ 8 +
41
Disadvantage of the Signed-Magnitude
Representation
Operation may depend on the signs of the operands
Example - adding a positive number X and a negative
number -Y :
X+(-Y)
If Y>X, final result is -(Y-X)
Calculation switch order of operands
perform subtraction rather than addition
attach the minus sign
A sequence of decisions must be made, costing
excess control logic and execution time
This is avoided in the complement representation
methods
Complement Representations of
Negative Numbers
Two alternatives -
Radix complement (called two's complement in the
binary system)
Diminished-radix complement (called one's complement
in the binary system)
In both complement methods - positive numbers
represented as in the signed-magnitude method
Advantage of Complement Representation
No decisions made before executing addition or
subtraction
No need to interchange the order of the two
operands
Ones Complement
Ones complement replaced signed magnitude
because the circuitry was too complicated.
Negative numbers are represented in ones
complement form by complementing each bit
even the sign
bit is
reversed
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
each 1 is
replaced
with a 0
each 0 is
replaced
with a 1
Twos Complement
The twos complement form of a negative integer
is created by adding one to the ones complement
representation.
0 0 1 0
1 0 0 1
0 0 1 0
1 0 0 1
1 1 0 1
0 1 1 0
+ 1 = 1 1 0 1
0 1 1 1
Twos complement representation has a single
(positive) value for zero.
The sign is represented by the most significant
bit.
The notation for positive integers is identical to
their signed-magnitude representations.
The Twos Complement
Representation
Representation of Mixed Numbers
A sequence of n digits in a register - not
necessarily representing an integer
Can represent a mixed number with a fractional
part and an integral part
The n digits are partitioned into two - k in the
integral part and m in the fractional part (k+m=n)
The value of an n-tuple with a radix point between
the k most significant digits and the m least
significant digits
is
Fractional Binary Numbers
2i
2i1
4
2
1
bi bi1
b2 b1 b0 . b1 b2 b3
1/2
1/4
1/8
bj
Representation
2j
Bits to right of binary point represent fractional powers
of 2
i
k
b
2
Represents rational number: k
k j
Fractional Binary Number Examples
Value
5.3/4
2.7/8
63/64
Observation
Representation
101.112
10.1112
0.1111112
Divide by 2 by shifting right
Numbers of form 0.1111112 just below 1.0
Use notation 1.0
Limitation
Can only exactly represent numbers of the form x/2k
Other numbers have repeating bit representations
Value
Representation
1/3
1/5
1/10
0.0101010101[01]2
0.001100110011[0011]2
0.0001100110011[0011]2
Fixed Point Representations
Radix point not stored in register - understood to
be in a fixed position between the k most
significant digits and the m least significant digits
These are called fixed-point representations
One bit is used for the sign and the remaining bits
for the magnitude.
Clearly there is a restriction to the numbers which
can be represented.
With 7 bits reserved for the magnitude, the
largest and smallest numbers represented are +127
and 127.
Sign bit (+ve number)
+127
-127
= 0 1 1 1 1 1 1 1
= 1 1 1 1 1 1 1 1
Sign bit (-ve number)
Fixed Point Representations
Things to note:
1. Fixed point numbers are represented as exact.
2. Arithmetic between fixed point numbers is also
exact provided the answer is within range.
3. Division is also exact if interpreted as
producing an integer and discarding any
remainder.
Floating Point
floating point representation consists of
A Sign Bit s
An Exponent e
Mantissa /fraction M or F
In floating point representation, numbers are represented by a sign
bit s, an integer component e, a positive integer mantissa M.
Eg of floating pt.
s M B e or (-1)s F Be
S (1 bit) Exponent (3 e bit) Fraction (4 M or F bit)
e-exact exponent.
B- base, usually 2 or 16.
-E bias : fixed int and machine dependent.
If mantissa is assumed to be 1.xxxxx (thus, one bit of the
mantissa is implied as 1)
This is called a normalized representation
8-bit floating point format (2)
sign
1 bit
0
exponent significand number number
3 bits
base 2
base 10
4 bits
001
1001
1.001x21 2.25
011
1100
1.1 x 23
111
1110
1.11 x 27 224.0
001
1110
1.11 x 2-1 0.875
12.0
Distribution of Floating Point Numbers
e = -1
1.00 X 2^(-1) =
1.01 X 2^(-1) =
1.10 X 2^(-1) =
1.11 X 2^(-1) =
1/2
5/8
3/4
7/8
e=0
1.00 X 2^0 =
1.01 X 2^0 =
1.10 X 2^0 =
1.11 X 2^0 =
1
5/4
3/2
7/4
e=1
1.00 X 2^1 = 2
1.01 X 2^1 = 5/2
1.10 X 2^1= 3
1.11 X 2^1 = 7/2
3 bit mantissa
Exponent 2 bit {-1,0,1}
Father of the Floating point standard
IEEE Standard
754 for Binary
Floating-Point
Arithmetic.
1989
ACM Turing
Award Winner!
Prof. Kahan
www.cs.berkeley.edu/~wkahan/
/ieee754status/754story.html
IEEE Floating Point
Defined by IEEE Std 754-1985
Developed in response to divergence of representations
(Established in 1985 as uniform standard for floating point
arithmetic >> Before that, many idiosyncratic formats )
Portability issues for scientific code
Supported by all major CPUs
Now almost universally adopted
Two representations
Single precision (32-bit)
Double precision (64-bit)
Driven by Numerical Concerns
Nice standards for rounding, overflow, underflow
Hard to make go fast
Numerical analysts predominated over hardware types in
defining standard
IEEE 754 Floating-Point Format
single precision
single: 8 bits
double: 11 bits
single: 23 bits
double: 52 bits
S Exponent Fraction
x (1)S (1 Fraction) 2( Exponent Bias)
31
Sign
30
Biased exponent
23
(-1)s F 2E-127
22
Normalized Mantissa (implicit 23rd bit = 1)
S: sign bit (0 non-negative, 1 negative)
Normalize significand: 1.0 |significand| < 2.0
Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
Significand is Fraction with the 1. restored
Exponent: excess representation: actual exponent + Bias
Ensures exponent is unsigned
Single: Bias = 127; Double: Bias = 1203
2.Quantization error
in
number representation
Quantization
1. Fixed-point: truncation
To truncate a fixed-point number from
(+1) bits to (b+1) bits, we just discard
the least significant (-b) bits. The
truncation error is denoted by
t Q( X ) X
Here Q(X) is the truncated version of the number X. For a positive X, the
error is equal to zero if all bits being discarded are zeros and is largest if all
discarded bits are ones.
(2b 2 ) t 0
Quantization
For a negative X, the truncation error will be different for three different
formats:
1) Sign-Magnitude:
0 2 b 2
t
2) Ones-complement:
0 t 2 b 2
3) Twos-complement:
2 b 2 t 0
Quantization
2. Fixed-point: rounding
In case of rounding, the number is quantized to the nearest quantization
level. The rounding error does not depend on the format used to represent
negative numbers:
1 b
1 b
2 2 r 2 2
2
2
In practice, >> b, therefore, 2- 0 in all expressions considered.
Quantization Noise
Quantization mechanisms: (Fixed Point)
Rounding
Truncation
2s Complement
All Positive Numbers
Truncation
Sign Magnitude
1s Complement
output
input
2b
probability
-2-b/2
2-b/2
2b
error
-2-b
2b/2
-2-b
2-b
Quantization Noise
Quantization mechanisms: (Floating Point)
Rounding
Truncation
Sign Magnitude
1s Complement
Truncation
2s Complement
All Positive Numbers
output
input
2b/2
probability
-2-b
2-b
2b/4
2b/2
error
-2.2-b
-2.2-b
2.2-b
Quantization
3. Floating-point
Considering a floating-point representation
Q X 2E Q M
X 2E M
of a number
Quantization is carried out on the mantissa only in case of floating-point
numbers. Therefore, it is more reasonable to consider the relative error.
Q X X QM M
X
M
In practice, a rounding quantizer can be modeled as follows:
Q X 2 B round X 2 B