0% found this document useful (0 votes)

46 views

CEF352 Lect2

The document outlines a course on floating-point arithmetic and the IEEE 754 specification. It discusses how computers represent numbers in floating-point format using a sign bit, exponent field, and mantissa field. Numbers are stored as normalized binary numbers. The IEEE 754 standard defines single and double precision floating-point formats. Special values like infinity, NaN, normalized, denormalized and inexact numbers are also covered. Operations like multiplication and division are performed by adding exponents and multiplying mantissas.

Uploaded by

Tabi Joseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

CEF352 Lect2

Uploaded by

Tabi Joseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Course outline: CEF352

Chapter 1: Floating-point arithmetic (with IEEE 754

specifications)

1. Computer representation of numbers: special numbers

2. Floating-point formats, accuracy requirements and floating-point exceptions

3. Ranges and precisions in decimal representation

4. Machine epsilon

5. Number rounding: direction, precision, significant figures and round-off error

Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Normalized scientific notation

Scientific or exponential notation: Writing a number in the form ± M × bP , where M

is a fractional number with a single digit to the left of the decimal point.

Illustration: the numbers 0.7e-2, 0.7× 10-2 and 0.007 are the same

Example (in decimal): 5.4× 10-5, 1.25× 10-5, 0.125× 10-4, 0.0125× 10-2

The two first numbers are normalized while the two latter are not.

Scientific notation is said to be normalized when the number has no leading zeros.

Example of normalized number (in binary): 1.01× 2-5, any number in the form

1. m1m2… × 2 p1 p2...
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: definitions

Bit = 0 or 1 (binary digit)

Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: signed: 1, 2, 4, or 8 byte
unsigned: 1, 2, 4, or 8 byte
Machines store real numbers as normalized binary numbers in a very specific way called
floating point, with two IEEE formats: Single precision and Double precision

A floating point format can only present a finite amount of numbers (written as per the
specifications of the format)
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point formats, overflow/underflow

s P
r =( −1 ) × M × b , M =mantissa , b=base , P=exponent , s=sign .
Examples:
Decimal: 0.00527 10 =0.527 10 × 10−2 10 31 30 ... … 23 22 ... 0
2 10
Binary: 10.12=0.1012 × 2 =0.1012 × 2
10 2

b=2 (most modern computers) sign exponent mantissa

b=10 (our mind, shop’s calculator)
Single-precision format (uses 32 bits)
Bit No. 0 1–8 9 - 31 ±10-38 ... 1038 Bit index: 31, 30, …, 0
(single
precision)
(1 bit) (8 bits) (23 bits) Exponent too large to be represented in
the Exponent field ==> Overflow
Bit No. 0 1 – 11 12 - 63 ±10-308 ... 10308
(double
precision)
(1 bit) (11 bits) (52 bits)
Exponent too small to be represented in
Field name Sign (s) Exponent Mantissa Approx the Exponent field ==> Underflow
(P) (M) Range
bit= binary digit
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point number

Convention 1: Since the normalized binary mantissa always writes 1.x...x (except for 0),
omit the leading 1, store only the fractional part of the mantissa (as IEEE mantissa, f).
Convention 2: If the actual exponent is p then it is represented as E=p + bias.
IEEE floating-point number representation: s
r =( −1 ) × ( 1+0. f ) × 2
E −bias
,
f=IEEE mantissa (fractional part of the standard mantissa), p= exponent, s= sign, bias=127
for single precision and 1023 for double. E is the so-called biased exponent,
1.f=significand, and the dot is the radix point (binary pt for base 2, decimal pt for base 10)
Advantage of IEEE formatting

Increase the number storage, f (Single precision: 23+1 bits, Double precision: 52+1 bits)

Avoid comparison issues with negative exponents
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point numbers

Example of computer numbers: Find the sign, mantissa, bias exponent and give the single
precision representation of the number 1.0.
1.5 positive, s=0. We have (in decimal) 1.5 =1×1.5×1 = (- 1)0 ×1.5×2(127-127);
Alternatively convert the number into binary and use the normalized notation in base 2:
1.510=1.12=1.1×20=(- 1)0 ×1.1×2(127-127)

Hence, s = 0, e = 12710=0111 1111, 0.f = 0.100...=> f = 1000 0000 0000 0000 0000 000

Bin: 0 011 1111 1 100 0000 0000 0000 0000 0000

Coursework: Find the sign, mantissa, bias exponent and give the single precision
representation of the numbers: 2.0 and 0.5
2.0 = 2 = 21 =(- 1)0 ×1.0×2(128-127), 0.5 = 1/2 = 2-1 =(- 1)0 ×1.0×2(126-127)
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: IEEE floating-point numbers

Inexact numbers:
In decimal system, only rational numbers whose denominator can be factorized in terms of
2 and 5 ( i.e., a/(2n×5m) ) will terminate while others will not.

Similarly in binary system, only rational numbers whose denominator is a power of 2 will
terminate while others will not.
Example:
-1/3=(0.0101010101….)2=(-1)1×(1.01010101...)×2-2=(-1)1×(1+0.01010101...)×2125-127,

The single precision representation of -1/3 is then 10111110101010101010101010101011.

1/10=(0.00011001100110011….)2
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)
1. Computer representation of numbers: special numbers
Floating number representation: IEEE floating-point number, underflow, overflow
Comparison issues with negative exponents

Number Sign Exponent Mantissa

1.0 × 2-1 0 11111111 0000000 00000000

00000000

1.0 × 2+1 0 00000001 0000000 00000000

00000000

1.0 × 2-1 < 1.0 × 2+1, but 11111111>00000001 !

So the first exponent shows a "larger" binary number, making direct comparison more difficult.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)
1. Computer representation of numbers: special numbers
Floating number representation: IEEE floating-point number, underflow, overflow
Comparison issues with negative exponents: the bias exponent

Number Exponent Bias Exponent (dec) Bias Exponent (bin)

1.0 × 2-1 -1 -1+127=126 011111102

1.0 × 2+1 +1 +1+127=128 100000002

1.0 × 2-1 < 1.0 × 2+1, and 01111110<10000000, difficulty solved, with bias exponent !
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: range of the mantissa

−1 0
Convention: General range of the mantissa: b ⩽ M <b
−1 0
Decimal: 10 ⩽ M <10 ⇔ 0.1⩽ M <1 ⇒ min ( M )=0.1 , max ( M )=0.9999 ... 9
−1 0
Binary: 2 ⩽ M <2 ⇔ 0.510 ⩽ M <110 ⇒ min ( M ) =0.5 10 , max ( M )=0.9999 ...910

Case b=2 Max mantissa Min mantissa Max exponent Min exponent  Range of be
Binary 0.111...1 0.10...0 27 - 1 -27 [2-128 , 2127]
in decimal 0.999...9 0.5 127 -128 [2.9·10-39 , 1.7·1038]

Note: Computer arithmetic that supports normalized binary numbers is called

Floating Point Arithmetic.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: special numbers

Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating number representation: special numbers

Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Multiplication and Division

e1 e2 e 1 +e 2
r 1 =m 1 b ,r 2 =m2 b ⇒ r 1 ×r 2=m 1 m 2 b , m 1 m 2 <1
Example 1 (decimal): Multiply the following two numbers in normalized scientific
notation : 1.110 × 1010 and 9.200 × 10-4

1. Add the exponents: e=10+(-4)=6

2. Multiply the mantissas: m=1.110 ×9.200=10.212000

3. Keeping only 3 digits in the fractional part: m=10.212, and thus r=10.212 × 10 6

4. Normalize the result: r=1.021 × 107

Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Multiplication and Division

1.000
Example 2 (binary): Multiply the following two binary × 1.110
normalized numbers 1.000 × 2-1 and -1.110 × 2-2 -----------
1. Sign= minus 0000
1000
2. Add the exponents: e=-1+(-2)=-3 1000
+ 1000
3. Multiply the mantissas: m=1.000 ×1.110=1.110000 -----------
4. Keeping only 3 digits in the fractional part: m=1.110, and 1110000 ==> 1.110000
thus r=-1.110 × 2-3
Result is already normalized!
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Addition and subtraction

r 1 =m 1 b , r 2 =m2 b , e1 > e2 ⇒ r 1 ± r 2=( m1 ± m 2 b )b ,
e1 e2 e 2 − e1 e1

Example 1 (decimal): Add the following two numbers in normalized scientific

notation : 9.970 × 101 and 8.740 × 10-1
1. Rewrite the smaller number using the exponent of the larger number: 8.740 × 10-1
=(8.740 × 10-2)× 101=0.08740 × 101
2. Add the mantissas: m=9.970 +0.08740=10.05740 ==> r=10.05740× 10 1
3. Normalize the result (if necessary, shift the mantissa and adjust exponent):
r=10.05740× 101=1.005740× 102
4. Check for overflow/underflow of the exponent and round off: r=1.006 × 102
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Addition and subtraction

Example 2 (binary): Write the binary normalized notation of the numbers and add
them : 0.5 and -0.4375.
1. Binary normalized notation: 0.510=0.12=0.1 × 20 =1.000 × 2-1, and -0.437510= -
0.01112=-0.0111 × 20 =-1.110 × 2-2

2. Rewrite the smaller number using the exponent of the larger number: -1.110 × 2-2 =-
(1.110 × 2-1)× 2-1=-0.111 × 2-1
3. Add the mantissas: m=1.000 +(-0.111)=0.001 ==> r=0.001× 2-1
4. Normalize the result : r=1.000× 2-4, − 4 ∈ [ −126 , 127 ]
5. No overflow/underflow, no rounding required: r=1.000 × 2-4
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Coursework

Exercise 1: Find the sign, mantissa, bias exponent and write the single-precision
representation of the decimal numbers: -1.5, 0.2 and 4.
Exercise 2: Find the sign, mantissa, bias exponent and write the single-precision
representation of the binary numbers: -0.1 and 0.00101.
Exercise 3: Write the binary normalized notation of the numbers and add them : 1.5
and -0.6375.
Exercise 4: Write the binary normalized notation of the numbers and multiply them :
12.0 and -0.2375.
Chapter 1: Floating-point arithmetic (with IEEE 754 specifications)

Floating Arithmetic operations: Coursework

Exercise 5: On changing from IEEE single- to double-precision, how quantitatively do the

numbers of bits representing the mantissa and exponent change? Deduce whether the
change prioritizes the precision or the range of expressible numbers.

Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
8.1.4 Data representation - Floatng point numbers
No ratings yet
8.1.4 Data representation - Floatng point numbers
3 pages
COA
No ratings yet
COA
14 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Numerical_Methods_Chap1
No ratings yet
Numerical_Methods_Chap1
14 pages
Floating Points
No ratings yet
Floating Points
31 pages
Part 5 Floating Point Add Sub Mul
No ratings yet
Part 5 Floating Point Add Sub Mul
20 pages
Floating Point
No ratings yet
Floating Point
16 pages
Module2.1 of nothing
No ratings yet
Module2.1 of nothing
7 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
7 pages
5268882
No ratings yet
5268882
23 pages
181
No ratings yet
181
11 pages
Chap-03 Computer Arithmetics
No ratings yet
Chap-03 Computer Arithmetics
16 pages
Floating Point
No ratings yet
Floating Point
26 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Floating Point Alu
No ratings yet
Floating Point Alu
11 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
COA-Module6-FloatingPoint
No ratings yet
COA-Module6-FloatingPoint
17 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
arch1-LECTURE-NUMBER REPRESENTATION
No ratings yet
arch1-LECTURE-NUMBER REPRESENTATION
42 pages
Review: How To Represent Real Numbers
No ratings yet
Review: How To Represent Real Numbers
9 pages
Floating Point Representation: Major: All Engineering Majors Authors: Autar Kaw, Matthew Emmons
No ratings yet
Floating Point Representation: Major: All Engineering Majors Authors: Autar Kaw, Matthew Emmons
21 pages
Complete Floating Point (Blog)
No ratings yet
Complete Floating Point (Blog)
18 pages
Doc-20240730-Wa0013 240730 165456
No ratings yet
Doc-20240730-Wa0013 240730 165456
21 pages
Lecture 4 - Computer Arithmetic
No ratings yet
Lecture 4 - Computer Arithmetic
18 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Lab 1
100% (1)
Lab 1
10 pages
Floating Point
No ratings yet
Floating Point
26 pages
BCSE205L-Module 2 Division and Floating Point Arithmetic
No ratings yet
BCSE205L-Module 2 Division and Floating Point Arithmetic
36 pages
IEEE Paper On Floating Point
No ratings yet
IEEE Paper On Floating Point
28 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
floating-point-numbers-237045407-237045407
No ratings yet
floating-point-numbers-237045407-237045407
20 pages
2.4 Floating Point Representation
No ratings yet
2.4 Floating Point Representation
7 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
Lab 3
No ratings yet
Lab 3
5 pages
Floating Point Tutorial
No ratings yet
Floating Point Tutorial
15 pages
4.4_1 New Floating Point.pptx
No ratings yet
4.4_1 New Floating Point.pptx
22 pages
Number System
No ratings yet
Number System
38 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
30 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
104596_Floating-Point Arithmetic Operations(Aligning the Mantissas_Biased Exponent_Overflow )
No ratings yet
104596_Floating-Point Arithmetic Operations(Aligning the Mantissas_Biased Exponent_Overflow )
18 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
Computer Organization
No ratings yet
Computer Organization
22 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Computer Organisation
No ratings yet
Computer Organisation
4 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Week 5: IEEE Floating Point Revision Guide For Phase Test
No ratings yet
Week 5: IEEE Floating Point Revision Guide For Phase Test
23 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Fixed and Floating Point Numbers: Dr. Ashish GUPTA Sense, Vit-Ap Ashish - Gupta@vitap - Ac.in
No ratings yet
Fixed and Floating Point Numbers: Dr. Ashish GUPTA Sense, Vit-Ap Ashish - Gupta@vitap - Ac.in
34 pages
Computer Science An Overview 12th Edition by Brookshear ISBN Test Bank
100% (44)
Computer Science An Overview 12th Edition by Brookshear ISBN Test Bank
12 pages
Module 1c Number Systems and Presentation
No ratings yet
Module 1c Number Systems and Presentation
53 pages
UNIT 1
No ratings yet
UNIT 1
21 pages
Pulse Modulation
No ratings yet
Pulse Modulation
19 pages
2 Units and Measurements Version 1
No ratings yet
2 Units and Measurements Version 1
57 pages
CO Unit-V
No ratings yet
CO Unit-V
10 pages
Lecture 5 Representing Numerical Data
No ratings yet
Lecture 5 Representing Numerical Data
26 pages
Computer Organisation & Architecture
100% (1)
Computer Organisation & Architecture
134 pages
Unit 1
No ratings yet
Unit 1
40 pages
DDCA Ch5
No ratings yet
DDCA Ch5
101 pages
Unit-2 Arithmetic Logic Unit (ALU)
No ratings yet
Unit-2 Arithmetic Logic Unit (ALU)
13 pages
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
No ratings yet
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
14 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Computer Arithmetic (5 Hours)
No ratings yet
Computer Arithmetic (5 Hours)
27 pages
Machine Level Representation of Data Part 3
100% (1)
Machine Level Representation of Data Part 3
32 pages
Module 1 DSPA Chapter 2
No ratings yet
Module 1 DSPA Chapter 2
8 pages
SDM230 Protocol
No ratings yet
SDM230 Protocol
12 pages
Module 2 COA
No ratings yet
Module 2 COA
53 pages
13 Data Representation
No ratings yet
13 Data Representation
24 pages
Data Representation in Computer Systems Is The Process of Encoding
No ratings yet
Data Representation in Computer Systems Is The Process of Encoding
3 pages
3.1 Binary Addition: Chapter Three
No ratings yet
3.1 Binary Addition: Chapter Three
28 pages
Computer Organization and Architecture: Web Course Developed For NPTEL
No ratings yet
Computer Organization and Architecture: Web Course Developed For NPTEL
610 pages
Programming The 8086 8088
100% (2)
Programming The 8086 8088
336 pages
5118.numerical Computing With IEEE Floating Point Arithmetic by Michael L. Overton
No ratings yet
5118.numerical Computing With IEEE Floating Point Arithmetic by Michael L. Overton
121 pages
Session 7 and 8
No ratings yet
Session 7 and 8
26 pages
13.3-Floating-Point-Numbers-Notes-2024
No ratings yet
13.3-Floating-Point-Numbers-Notes-2024
8 pages
Lecture Wk4 Ch3
No ratings yet
Lecture Wk4 Ch3
40 pages
Unit - 2 Arithmetic Unit
No ratings yet
Unit - 2 Arithmetic Unit
71 pages
ECE232: Hardware Organization and Design: MULTIPLY (Unsigned)
No ratings yet
ECE232: Hardware Organization and Design: MULTIPLY (Unsigned)
18 pages

CEF352 Lect2

Uploaded by

CEF352 Lect2

Uploaded by

Course outline: CEF352

Chapter 1: Floating-point arithmetic (with IEEE 754

1. Computer representation of numbers: special numbers

2. Floating-point formats, accuracy requirements and floating-point exceptions

3. Ranges and precisions in decimal representation

5. Number rounding: direction, precision, significant figures and round-off error

Normalized scientific notation

Scientific or exponential notation: Writing a number in the form ± M × bP , where M

Floating number representation: definitions

Bit = 0 or 1 (binary digit)

Floating number representation: IEEE floating-point formats, overflow/underflow

b=2 (most modern computers) sign exponent mantissa

Floating number representation: IEEE floating-point number

Floating number representation: IEEE floating-point numbers

Bin: 0 011 1111 1 100 0000 0000 0000 0000 0000

Floating number representation: IEEE floating-point numbers

The single precision representation of -1/3 is then 10111110101010101010101010101011.

Number Sign Exponent Mantissa

1.0 × 2-1 0 11111111 0000000 00000000

1.0 × 2+1 0 00000001 0000000 00000000

1.0 × 2-1 < 1.0 × 2+1, but 11111111>00000001 !

Number Exponent Bias Exponent (dec) Bias Exponent (bin)

1.0 × 2-1 -1 -1+127=126 011111102

1.0 × 2+1 +1 +1+127=128 100000002

Floating number representation: range of the mantissa

Note: Computer arithmetic that supports normalized binary numbers is called

Floating number representation: special numbers

Floating number representation: special numbers

Floating Arithmetic operations: Multiplication and Division

1. Add the exponents: e=10+(-4)=6

2. Multiply the mantissas: m=1.110 ×9.200=10.212000

4. Normalize the result: r=1.021 × 107

Floating Arithmetic operations: Multiplication and Division

Floating Arithmetic operations: Addition and subtraction

Example 1 (decimal): Add the following two numbers in normalized scientific

Floating Arithmetic operations: Addition and subtraction

Floating Arithmetic operations: Coursework

Floating Arithmetic operations: Coursework

Exercise 5: On changing from IEEE single- to double-precision, how quantitatively do the

You might also like