0% found this document useful (0 votes)
46 views

CEF352 Lect2

The document outlines a course on floating-point arithmetic and the IEEE 754 specification. It discusses how computers represent numbers in floating-point format using a sign bit, exponent field, and mantissa field. Numbers are stored as normalized binary numbers. The IEEE 754 standard defines single and double precision floating-point formats. Special values like infinity, NaN, normalized, denormalized and inexact numbers are also covered. Operations like multiplication and division are performed by adding exponents and multiplying mantissas.

Uploaded by

Tabi Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

CEF352 Lect2

The document outlines a course on floating-point arithmetic and the IEEE 754 specification. It discusses how computers represent numbers in floating-point format using a sign bit, exponent field, and mantissa field. Numbers are stored as normalized binary numbers. The IEEE 754 standard defines single and double precision floating-point formats. Special values like infinity, NaN, normalized, denormalized and inexact numbers are also covered. Operations like multiplication and division are performed by adding exponents and multiplying mantissas.

Uploaded by

Tabi Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Course outline: CEF352

Chapter 1: Floating-point arithmetic (with IEEE 754


specifications)

1. Computer representation of numbers: special numbers

2. Floating-point formats, accuracy requirements and floating-point exceptions

3. Ranges and precisions in decimal representation

4. Machine epsilon

5. Number rounding: direction, precision, significant figures and round-off error


Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Normalized scientific notation

Scientific or exponential notation: Writing a number in the form ± M × bP , where M


is a fractional number with a single digit to the left of the decimal point.

Illustration: the numbers 0.7e-2, 0.7× 10-2 and 0.007 are the same

Example (in decimal): 5.4× 10-5, 1.25× 10-5, 0.125× 10-4, 0.0125× 10-2

The two first numbers are normalized while the two latter are not.

Scientific notation is said to be normalized when the number has no leading zeros.

Example of normalized number (in binary): 1.01× 2-5, any number in the form

1. m1m2… × 2 p1 p2...
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: definitions

Bit = 0 or 1 (binary digit)


Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: signed: 1, 2, 4, or 8 byte
unsigned: 1, 2, 4, or 8 byte
Machines store real numbers as normalized binary numbers in a very specific way called
floating point, with two IEEE formats: Single precision and Double precision

A floating point format can only present a finite amount of numbers (written as per the
specifications of the format)
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point formats, overflow/underflow


s P
r =( −1 ) × M × b , M =mantissa , b=base , P=exponent , s=sign .
Examples:
Decimal: 0.00527 10 =0.527 10 × 10−2 10 31 30 ... … 23 22 ... 0
2 10
Binary: 10.12=0.1012 × 2 =0.1012 × 2
10 2

b=2 (most modern computers) sign exponent mantissa


b=10 (our mind, shop’s calculator)
Single-precision format (uses 32 bits)
Bit No. 0 1–8 9 - 31 ±10-38 ... 1038 Bit index: 31, 30, …, 0
(single
precision)
(1 bit) (8 bits) (23 bits) Exponent too large to be represented in
the Exponent field ==> Overflow
Bit No. 0 1 – 11 12 - 63 ±10-308 ... 10308
(double
precision)
(1 bit) (11 bits) (52 bits)
Exponent too small to be represented in
Field name Sign (s) Exponent Mantissa Approx the Exponent field ==> Underflow
(P) (M) Range
bit= binary digit
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point number


Convention 1: Since the normalized binary mantissa always writes 1.x...x (except for 0),
omit the leading 1, store only the fractional part of the mantissa (as IEEE mantissa, f).
Convention 2: If the actual exponent is p then it is represented as E=p + bias.
IEEE floating-point number representation: s
r =( −1 ) × ( 1+0. f ) × 2
E −bias
,
f=IEEE mantissa (fractional part of the standard mantissa), p= exponent, s= sign, bias=127
for single precision and 1023 for double. E is the so-called biased exponent,
1.f=significand, and the dot is the radix point (binary pt for base 2, decimal pt for base 10)
Advantage of IEEE formatting

Increase the number storage, f (Single precision: 23+1 bits, Double precision: 52+1 bits)

Avoid comparison issues with negative exponents
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point numbers


Example of computer numbers: Find the sign, mantissa, bias exponent and give the single
precision representation of the number 1.0.
1.5 positive, s=0. We have (in decimal) 1.5 =1×1.5×1 = (- 1)0 ×1.5×2(127-127);
Alternatively convert the number into binary and use the normalized notation in base 2:
1.510=1.12=1.1×20=(- 1)0 ×1.1×2(127-127)

Hence, s = 0, e = 12710=0111 1111, 0.f = 0.100...=> f = 1000 0000 0000 0000 0000 000

Bin: 0 011 1111 1 100 0000 0000 0000 0000 0000


Coursework: Find the sign, mantissa, bias exponent and give the single precision
representation of the numbers: 2.0 and 0.5
2.0 = 2 = 21 =(- 1)0 ×1.0×2(128-127), 0.5 = 1/2 = 2-1 =(- 1)0 ×1.0×2(126-127)
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point numbers


Inexact numbers:
In decimal system, only rational numbers whose denominator can be factorized in terms of
2 and 5 ( i.e., a/(2n×5m) ) will terminate while others will not.

Similarly in binary system, only rational numbers whose denominator is a power of 2 will
terminate while others will not.
Example:
-1/3=(0.0101010101….)2=(-1)1×(1.01010101...)×2-2=(-1)1×(1+0.01010101...)×2125-127,

The single precision representation of -1/3 is then 10111110101010101010101010101011.


1/10=(0.00011001100110011….)2
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)
1. Computer representation of numbers: special numbers
Floating number representation: IEEE floating-point number, underflow, overflow
Comparison issues with negative exponents

Number Sign Exponent Mantissa

1.0 × 2-1 0 11111111 0000000 00000000


00000000

1.0 × 2+1 0 00000001 0000000 00000000


00000000

1.0 × 2-1 < 1.0 × 2+1, but 11111111>00000001 !


So the first exponent shows a "larger" binary number, making direct comparison more difficult.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)
1. Computer representation of numbers: special numbers
Floating number representation: IEEE floating-point number, underflow, overflow
Comparison issues with negative exponents: the bias exponent

Number Exponent Bias Exponent (dec) Bias Exponent (bin)

1.0 × 2-1 -1 -1+127=126 011111102

1.0 × 2+1 +1 +1+127=128 100000002

1.0 × 2-1 < 1.0 × 2+1, and 01111110<10000000, difficulty solved, with bias exponent !
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: range of the mantissa

−1 0
Convention: General range of the mantissa: b ⩽ M <b
−1 0
Decimal: 10 ⩽ M <10 ⇔ 0.1⩽ M <1 ⇒ min ( M )=0.1 , max ( M )=0.9999 ... 9
−1 0
Binary: 2 ⩽ M <2 ⇔ 0.510 ⩽ M <110 ⇒ min ( M ) =0.5 10 , max ( M )=0.9999 ...910

Case b=2 Max mantissa Min mantissa Max exponent Min exponent  Range of be
Binary 0.111...1 0.10...0 27 - 1 -27 [2-128 , 2127]
in decimal 0.999...9 0.5 127 -128 [2.9·10-39 , 1.7·1038]

Note: Computer arithmetic that supports normalized binary numbers is called


Floating Point Arithmetic.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: special numbers


Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: special numbers


Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Multiplication and Division


e1 e2 e 1 +e 2
r 1 =m 1 b ,r 2 =m2 b ⇒ r 1 ×r 2=m 1 m 2 b , m 1 m 2 <1
Example 1 (decimal): Multiply the following two numbers in normalized scientific
notation : 1.110 × 1010 and 9.200 × 10-4

1. Add the exponents: e=10+(-4)=6

2. Multiply the mantissas: m=1.110 ×9.200=10.212000

3. Keeping only 3 digits in the fractional part: m=10.212, and thus r=10.212 × 10 6

4. Normalize the result: r=1.021 × 107


Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Multiplication and Division

1.000
Example 2 (binary): Multiply the following two binary × 1.110
normalized numbers 1.000 × 2-1 and -1.110 × 2-2 -----------
1. Sign= minus 0000
1000
2. Add the exponents: e=-1+(-2)=-3 1000
+ 1000
3. Multiply the mantissas: m=1.000 ×1.110=1.110000 -----------
4. Keeping only 3 digits in the fractional part: m=1.110, and 1110000 ==> 1.110000
thus r=-1.110 × 2-3
Result is already normalized!
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Addition and subtraction


r 1 =m 1 b , r 2 =m2 b , e1 > e2 ⇒ r 1 ± r 2=( m1 ± m 2 b )b ,
e1 e2 e 2 − e1 e1

Example 1 (decimal): Add the following two numbers in normalized scientific


notation : 9.970 × 101 and 8.740 × 10-1
1. Rewrite the smaller number using the exponent of the larger number: 8.740 × 10-1
=(8.740 × 10-2)× 101=0.08740 × 101
2. Add the mantissas: m=9.970 +0.08740=10.05740 ==> r=10.05740× 10 1
3. Normalize the result (if necessary, shift the mantissa and adjust exponent):
r=10.05740× 101=1.005740× 102
4. Check for overflow/underflow of the exponent and round off: r=1.006 × 102
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Addition and subtraction


Example 2 (binary): Write the binary normalized notation of the numbers and add
them : 0.5 and -0.4375.
1. Binary normalized notation: 0.510=0.12=0.1 × 20 =1.000 × 2-1, and -0.437510= -
0.01112=-0.0111 × 20 =-1.110 × 2-2

2. Rewrite the smaller number using the exponent of the larger number: -1.110 × 2-2 =-
(1.110 × 2-1)× 2-1=-0.111 × 2-1
3. Add the mantissas: m=1.000 +(-0.111)=0.001 ==> r=0.001× 2-1
4. Normalize the result : r=1.000× 2-4, − 4 ∈ [ −126 , 127 ]
5. No overflow/underflow, no rounding required: r=1.000 × 2-4
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Coursework

Exercise 1: Find the sign, mantissa, bias exponent and write the single-precision
representation of the decimal numbers: -1.5, 0.2 and 4.
Exercise 2: Find the sign, mantissa, bias exponent and write the single-precision
representation of the binary numbers: -0.1 and 0.00101.
Exercise 3: Write the binary normalized notation of the numbers and add them : 1.5
and -0.6375.
Exercise 4: Write the binary normalized notation of the numbers and multiply them :
12.0 and -0.2375.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Coursework

Exercise 5: On changing from IEEE single- to double-precision, how quantitatively do the


numbers of bits representing the mantissa and exponent change? Deduce whether the
change prioritizes the precision or the range of expressible numbers.

You might also like