Arithmetic Pipeline
• Main topics in Pipeline processing is
• Arithmetic pipeline :
• fixed Arithmetic pipeline
• floating point
• Vector processing : adder/multiplier pipeline
• Array processing : array processor
• Attached array processor
• SIMD Array Processor
Parallel Processing Adder-subtractor
Integer multiply
• Simultaneous data processing tasks
for the purpose of increasing the Logic unit
computational speed
• Perform concurrent data Shift unit
processing to achieve faster To Memory
execution time Incrementer
Processor
• Multiple Functional Unit : registers
Floatint-point
add-
subtract
• Separate the execution unit Floatint-point
into eight functional units multiply
operating in parallel. Floatint-point
divide
Pipelining: Laundry
Example
A B C D
Small laundry has one washer, one
dryer and one operator, it takes 90
minutes to finish one load:
Washer takes 30 minutes
Dryer takes 40 minutes
“operator folding” takes 20 minutes
Sequential
Laundry
6 PM 7 8 9 11 Midnight
10 This operator scheduled his
loads to be delivered to
Time • the laundry every 90
minutes which is the time
required to finish one load.
30 40 20 30 40 20 30 40 20 30 40 20
T
a A • In other words he will not
start a new task unless
s he is already done with
B the previous task
k
O
r
C • The process is sequential.
Sequential laundry takes 6
d 90 min hours for 4 loads
D
e
r
Efficiently scheduled
laundry: Pipelined Laundry
Operator
6 PM 7 8 9 10
11 • Another operator
Time20 asks for the delivery
30 40 40 40 40 of loads to the
40 40 40
T laundry every 40
a A minutes!?.
s
B • Pipelined laundry
k
O takes 3.5 hours for 4
r loads
C
d
e
D
r
• Multiple tasks operating
Pipelining simultaneously
Facts6 PM
7 8 9 • Pipelining doesn’t help
Time latency (response time) of
single task, it helps throughput
T of entire workload
a 30 40 40 40 40
20
s A • Pipeline rate limited by slowest
k
O
pipeline stage
r B
• Potential speedup = Number of
d
C The washer
waits for the
pipe stages
e dryer for 10
minutes
D • Unbalanced lengths of pipe
r
stages reduces speedup
Pipelining
Decomposing a sequential process into suboperations
Each subprocess is executed in a special dedicated segment concurrently
• Instruction execution is divided into k segments or stages
• Instruction exits pipe stage k-1 and proceeds into pipe stage k
• All pipe stages take the same amount of time; called one processor cycle
• Length of the processor cycle is determined by the slowest pipe stage
k segments
Pipelinin
g
• Suppose we want to perform the combined multiply and add
operations with a stream of numbers:
• Ai * Bi + Ci for i =1,2,3,…,7
• The sub operations performed in each segment of the pipeline are as
follows:
• R1 Ai
R2 Bi
,
• R3 R1 * R2 R4 Ci
• R5 R3 + R4
Arithmetic
•Pipeline
Pipeline arithmetic units are usually found in very high speed computers.
• Arithmetic pipelines are constructed for :
simple fixed-point
floating-point arithmetic operations.
• For implementing the arithmetic pipelines we generally use following two types
of adder:
• i) Carry propagation adder (CPA): It adds two numbers such that carries
generated in successive digits are propagated.
• ii)Carry save adder (CSA): It adds two numbers such that carries
generated are
not propagated rather these are saved in a carry vector.
Fixed Arithmetic
pipeline
• We take the example of multiplication of fixed numbers.
• Two fixed-point numbers are added by the ALU using add and shift
operations.
• This sequential execution makes the multiplication a slow process.
• Observe that this is the process of adding the multiple copies of
shifted multiplicands as show below:
Fixed Arithmetic
pipeline
Now, we can identify the following stages for
the pipeline:
•The first stage generates the partial product of the numbers, which form the six
rows of shifted multiplicands.
•In the second stage, the six numbers are given to the two CSAs merging into four
numbers.
• In the third stage, there is a single CSA merging the numbers into 3numbers.
• In the fourth stage, there is a single number merging three numbers into
2numbers.
•In the fifth stage, the last two numbers are added through a CPA to get the final
product.
Floating point
operations.
• The inputs to floating point adder pipeline are two normalized
floating point numbers.
Mantissa Exponent
• A and B are mantissas and a and b are the exponents.
• The floating point addition and subtraction can be performed in four
segments.
Mantissa Exponent
Floating-Point
Add/Subtracti
on Pipeline:
Vector
Processing
• Science and Engineering Applications
• Long-range weather forecasting,
• Petroleum explorations,
• Seismic data analysis
• Medical diagnosis ,
• Aerodynamics and space flight simulators,
• Artificial intelligence and expert systems,
• Mapping the human genome, Image processing
Vector
Processing
Vector Instruction Format :
Operation Base address Base address Base address Vector
code source 1 source 2 destination
length
ADD A B C 100
Matrix Multiplication
3 x 3 matrices multiplication : n2 = 9 inner product
a11 a12 a13 b11 b12 b13 c11 c12 c13
a a a b21 b c c
21 22 23 22
b23 c
21 22 23
a31 a32 a33
b32 b
33
: inner productc329
c11 a11 b11b3a1 12 b21 a13 b31 c31
Cumulative multiply-add operation : n3 = 27c multiply-add
33
c ca : Three such multiply-add
b
therefore 9 X 3 multiply-add = 27
c11 c11 a11 b11 a12 b21 a13 b31
C11 initial value = 0
• Pipeline for calculating an inner product :
• Floating point multiplier pipeline : 4 segment
• Floating point adder pipeline : 4 segment
• Example: C A1B1 A2 B2 A3B3 Ak Bk
• after 1st clock input
• after 4th clock input
Source
Source
A
A
A A4B4 A3B3 A2B 2 A1B1
1B1
Source Multiplier Adder Source Multiplier Adder
B pipeline pipeline B pipeline pipeline
• after 8th clock input • after 9th, 10th, 11th ,...
Source Source
A A
A8B8 A7B7 A6B 6 A5B5 A4B4 A3B3 A2B 2 A1B 1
A A7B7 A6B6 A5B5 A4B4 A3B B2 A1B1
8B8 3 A2
Source Source Multiplier Adder
Multiplier Adder B
B pipeline pipeline pipeline pipeline
C A1B1 A5B5 A9 B9 A13B13 A2 B2 A6B6 A1B1 A5B5
• The four partial sum are added A2 B2 A6 B6 A10B10 A14B14 ,,,
to form the final sum A3B3 A7 B7 A11B11 A15B15
A4 B4 A8 B8 A12B12 A16B16
Memory Interleaving
• Memory Interleaving :
• Simultaneous access to memory from two or more source using one memory bus system.
• Select one of 4 memory modules using lower 2 bits of AR
• Example) Even / Odd Address Memory Access
Address bus
AR AR AR AR
Memory Memory Memory
Memory array array
array array
DR DR DR
DR
D a t a bus
Array
Processor
• Processor that performs the computations on large arrays of
data.
Vector processing : Adder/Multiplier pipeline use
Array processing: using a separate array processor
• There are two different types of (array processor)
:
• Attached Array Processor
• SIMD Array Processor
Attached Array
•Processor
It is designed as a peripheral for a conventional host computer.
• Its purpose is to enhance the performance of the computer by
providing vector processing.
• It achieves high performance by means of parallel processing with
multiple functional units.
General-purpose Input-Output Attached array
computer interface Processor
Main memory Local memory
High-speed memory to-
memory bus
SIMD Array
•Processor
It is processor which consists of multiple processing unit operating in
parallel.
• The processing units are synchronized to perform the same task
under control of common control unit.
• Each processor elements(PE) includes an ALU , a floating point
arithmetic unit and working register.
PE 1 M1
Master control
unit
PE 2 M2
PE 3 M3
Main memory
PE n Mn