EA 2004
Computer Architecture - II
Pipelining
Parallel processing
A parallel processing system is able to perform
simultaneous data processing to achieve faster
execution time
The system may have two or more ALUs and be able
to execute two or more instructions at the same time
Goal is to increase the throughput the amount of
processing that can be accomplished during a given
interval of time
Parallel processing
Parallel processing can happen in two basic streams:
instruction stream - consists of the sequence of
instructions read from memory
data stream - encapsulates the operations performed
on the memory
computers can be classified into 4 different categories more
focused on the behavioral aspects of Parallel Processing
Parallel processing classification
Single instruction stream, single data stream SISD
Single instruction stream, multiple data stream SIMD
Multiple instruction stream, single data stream MISD
Multiple instruction stream, multiple data stream MIMD
Single instruction stream, single
data stream SISD
A single control unit
A Processor Unit
A memory unit
Instructions are executed sequentially. Parallel processing
may be achieved by means of multiple functional
units or by pipeline processing
Single instruction stream,
multiple data stream SIMD
A single control unit
Many Processor Units
A memory unit
Includes multiple processing units with a single control
unit. All processors receive the same instruction, but
operate on different data.
Multiple instruction stream,
single data stream MISD
Many Processor Units
Which on its own contains
A control unit
A local memory
Theoretical only
processors receive different instructions, but operate
on same data.
i.e. Space shuttle flight control systems
Multiple instruction stream,
multiple data stream MIMD
Many Processor Units
Many Control Units
A computer system capable of processing several
programs at the same time.
Most multiprocessor and supercomputer systems can
be classified in this category
Parallel processing also be classified via pipelining which
concerns operational and structural interconnections
What is a Pipeline
Pipelining is used by all modern microprocessors to
enhance performance by overlapping the execution
of instructions.
A common analogue for a pipeline is a factory
assembly line. Assume that there are three stages:
o Welding
o Painting
o Polishing
For simplicity, assume that each task takes one hour.
What is a Pipeline
A single person would take three hours to produce one
product.
Three people, one person could work on each stage,
upon completing their stage they could pass their
product on to the next person (since each stage takes
one hour there will be no waiting).
Then produce one product per hour assuming the
assembly line has been filled.
Pipelining: Laundry Example
Small laundry has one
washer, one dryer and one
operator, it takes 90 A B C D
minutes to finish one load:
Washer takes 30 minutes
Dryer takes 40 minutes
operator folding takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Facts Multiple tasks operating
simultaneously
Pipelining doesnt help
6 PM latency of single task, it
7 8 9
helps throughput of entire
Time workload
T
a 30 40 40 40 40 20 Pipeline rate limited by
s slowest pipeline stage
k A
Potential speedup =
O Number of pipe stages
r B
d Unbalanced lengths of pipe
e The washer stages reduces speedup
r C waits for the
dryer for 10
minutes Time to fill pipeline and
D time to drain it reduces
speedup
Building a Car
Unpipelined Start and finish a job before moving to the next
Parallelism = 1 car
24 hrs.
Latency= 24 hrs.
Throughput = 1/24 hrs.
24 hrs.
Jobs
24 hrs.
Time
Latency the amount of time that a single operation takes to execute
Throughput the rate at which operations get executed (generally
expressed as operations/second or operations/cycle)
The Assembly Line
Pipelined Break the job into smaller stages
Eng. Body Paint 8h
A B C Parallelism = 3 cars
8h Eng. Body Paint Latency= 24 hrs.
A B C
Throughput = 1/8 hrs.
Eng. Body Paint
A B C
Jobs
Eng. Body Paint
3X
A B C
Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE
FET DEC EXE
Jobs
Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C
Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE
3 ns FET DEC EXE
Jobs
Clock Speed = 1/3ns = 333 MHz
Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1 Clock Speed = 1/1ns = 1 GHz
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C
1ns
3 ns
Time
Pipelining
Latency the amount of time that a single operation takes to execute
Throughput the rate at which operations get executed (generally
expressed as operations/second or operations/cycle)
Clocks and Latches
Stage 1 Stage 2
Clocks and Latches
Stage 1 L Stage 2 L
Clk
Clocks and Latches
Stage 1 L Stage 2 L
Clk
Four segment pipeline:
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Example
Assume a 2 ns flip-flop delay
Characteristics Of Pipelining
Decomposes a sequential process into segments.
Divide the processor into segment processors each one
is dedicated to a particular segment.
Each segment is executed in a dedicated segment-
processor operates concurrently with all other segments.
Information flows through these multiple hardware
segments.
If the stages of a pipeline are not balanced and one
stage is slower than another, the entire throughput of
the pipeline is affected
Pipelining
Instruction execution is divided into k segments or
stages
Instruction exits pipe stage k-1 and proceeds into pipe
stage k
All pipe stages take the same amount of time; called
one processor cycle
Length of the processor cycle is determined by the
slowest pipe stage
k segments
Pipeline Performance
n:instructions n is equivalent to number of loads in
the laundry example
k: stages in
k is the stages (washing, drying and
pipeline folding.
: clock cycle Clock cycle is the slowest task time
Tk: total time
Tk (k (n 1))
n
T1 nk
Speedup
Tk k (n 1) k
Efficiently scheduled laundry: Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Speedup
Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)
It takes k clock cycles to fill the pipeline and get the
first result from the output of the pipeline.
After that the remaining (n - 1) results will come out
at each clock cycle.
It therefore takes (k + n - 1) clock cycles to complete
the task.
Speedup
If we execute the same task sequentially in a
single processing unit, it takes (k * n) clock
cycles.
The speedup gained by using the pipeline is:
S = k * n / (k + n - 1 )
Speedup
S = k * n / (k + n - 1 )
For n >> k (such as 1 million data sets on a 3-stage
pipeline),
S~k
So we can gain the speedup which is equal to the
number of functional units for a large data sets. This
is because the multiple functional units can work in
parallel except for the filling and cleaning-up cycles.
Speedup
Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system with 4
identical function units
Example of Pipelining
Suppose we want to perform the combined
multiply and add operations with a stream
of numbers:
Ai * Bi + Ci for i =1,2,3,,7
Example of Pipelining
The sub-operations performed in each
segment of the pipeline are as follows:
R1 Ai, R2 Bi
R3 R1 * R2 R4 Ci
R5 R3 + R4
Example of Pipelining
Ai Bi Ci
R1 Ai , R2 Bi R1 R2
Input Ai and Bi
R3 R1 * R2, R4 Ci
Multiplier
Multiply and input Ci
R5 R3 + R4 R3 R4
Add Ci to product
Adder
R5
Content of registers in pipeline example
Clock
Pulse
Segment1 Segment2 Segment3
number R1 R2 R3 R4 R5
1 A1 B1 ---- ---- ----
2 A2 B2 A1*B1 C1 ----
3 A3 B3 A2*B2 C2 A1*B1+C1
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
6 A6 B6 A5*B5 C5 A4*B4+C4
7 A7 B7 A6*B6 C6 A5*B5+C5
8 ---- ---- A7*B7 C7 A6*B6+C6
9 ---- ---- ---- ---- A7*B7+C7
Exercise: Looking at the above example define how the operation of
Ai*Bi + Ci*Di+ Ei
is executed using a pipeline
Arithmetic Pipeline
From the early times of computing arithmetics withheld an
important aspect, yet arithmetic operations happen to
consume much of the time with in the arithmetic and logic
unit.
Thus pipelining is used to boost the performance of ALUs
and has opened up to many means of High performance of
computing.
Arithmetic pipelines are generally used for fixed point
operations and floating point operations.
Arithmetic Pipeline: Floating Point Adder
A generic floating point number can be stated as
X = A * 2a
Where X happens to be a binary value.
A is defined to be the mantissa and a is called the
exponent.
Arithmetic Pipeline: Floating Point Adder
X = A * 2a
Y = B * 2b
A floating point adder can be executed via 4 simple
sub operations
Compare the exponents.
Align the mantissas.
Add or subtract the mantissas.
Normalize the result.
Arithmetic Pipeline: Floating Point Adder
Given below is a simple demonstration of how two
decimal floats are added.
Consider the two input floats of X and Y
X = 0.9832* 103
Y = 0.8929* 102
Note: Decimal numbers are used for simplicity of explanation
Arithmetic Pipeline: Floating Point Adder
X = 0.9832* 103
Y = 0.8929* 102
In the initial segment the two exponents are compared.
The larger exponent is 3 and thus it is chosen as the
exponent for the result.
difference between the two exponents is 1 (3-2).
Arithmetic Pipeline: Floating Point Adder
X = 0.9832* 103
Y = 0.8929* 102
Since Y is with the lesser exponent its mantissa is
shifted to the right and the two gained values are,
X = 0.9832* 103
Y = 0.08929* 103
Afterwards the two mantissas are simply added and the
value Z is gained
Z = 1.07249* 103
Finally the gained result is normalized in manner which
staples a mantissa with a fraction with a none zero
value for the first decimal point.
Arithmetic Pipeline for Floating Point Adder
Exponents
Mantissas
a b
A B
R
R
Compare
Difference
Segment 1 Exponent
By subtraction
Align mantissas
R
R
Segment 2 Choose exponent
Add or subtract
Segment 3 mantissas
R R
Normalize
Segment 4 Adjust
Exponent result
R R
Arithmetic Pipeline for Floating Point Adder
Instruction Pipeline
An Instruction pipeline works in a similar manner
to the Arithmetic Pipeline even though it works
with an instruction field as suppose to a data
stream.
Instruction Pipeline
process of an instruction requires the following sequence
of steps.
Fetch the instruction from memory.
Decode the instruction.
Calculate the effective address.
Fetch the operands from memory.
Execute the instruction.
Store the result in the proper place.
Instruction Pipeline
Consider the following specification of a pipeline mean to
have 4 separate segments
In such a system up to 4 different instructions can be
processed at the same time.
Pipeline Conflicts
Difficulties in general can be caused due to the reasons
specified below.
Resource conflicts
when two segments access memory at the same
time.
Data dependency conflicts
occur when an instruction is dependent of a result of a
previous instruction which is not available yet
Branch difficulties conflicts
when branching and other instructions that change the
value of the PC.
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Fetch instruction
Segment 1 from memory
Decode instruction
And calculate
Segment 2
Effective address
yes
Branch?
no
Fetch operand
Segment 3 From memory
Execute
Segment 4
instruction
Interrupt yes
handling Interrupt?
no
Update PC
Empty pipe
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Timing of Instruction Pipeline
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction: 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI -- -- FI DA FO EX
5 -- -- -- FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Four-segment CPU pipeline for overcome
Pipeline Conflicts
The four segments illustrated in above table have the following
meanings:
FI is the segment that fetches an instruction.
DA is the segment that decodes the instruction and
calculate the effective address.
FO is the segment that fetches the operand.
EX is the segment that executes the instruction.
Thank You