CS212
Computer Organization
UNIT-5
Parallel Processing &
Multiprocessor
Topics to be covered
• Flynn's taxonomy
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• Vector Processing
• Array Processors
Flynn’s Taxonomy
Data Stream
Single Multiple
Single SISD SIMD
Instruction
Stream
Multiple MISD MIMD
Single Instruction Single Data (SISD)
• SISD represents the organization of a single
computer containing a control unit, a processor
unit, and a memory unit.
• Instructions are executed sequentially and the
system may or may not have internal parallel
processing capabilities.
Single Instruction Multiple Data (SIMD)
• SIMD represents an organization that includes
many processing units under the supervision of a
common control unit.
• All processors receive the same instruction from
the control unit but operate on different items of
data.
Multiple Instruction Single Data (MISD)
• There is no computer at present that can be
classified as MISD.
• MISD structure is only of theoretical interest since
no practical system has been constructed using
this organization.
Multiple Instruction Multiple Data (MIMD)
• MIMD organization refers to a computer system
capable of processing several programs at the
same time.
• Most multiprocessor and multicomputer systems
can be classified in this category.
• Contains multiple processing units.
• Execution of multiple instructions on multiple
data.
Parallel Processing
• Parallel processing is a term used to denote a large
class of techniques that are used to provide
simultaneous data-processing tasks for the purpose
of increasing the computational speed of a
computer system.
• Purpose of parallel processing is to speed up the
computer processing capability and increase its
throughput.
• Throughput:
The amount of processing that can be
accomplished during a given interval of time.
Pipelining
• Pipeline is a technique of decomposing a sequential process into
sub operations, with each sub process being executed in a
special dedicated segment that operates concurrently with all
other segments.
• A pipeline can be visualized as a collection of processing
segments through which binary information flows.
• Each segment performs partial processing dictated by the way
the task is partitioned.
• The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
• The registers provide isolation between each segment.
• The technique is efficient for those applications that need to
repeat the same task many times with different sets of data.
Pipelining example
for
R1 R2
Multiplier
R3 R4
Adder
R5
Pipelining
• General structure of four segment pipeline
Clock
Input
S1 R1 S2 R2 S3 R3 S4 R4
Space-time Diagram
Segment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 4
2 4
3 4
4 4
Non Pipelined Architecture = 16
Segment 1 2 3 4 5 6 7
Clock cycles
1 4
2 1 Pipelined Architecture
3 1
4 1 =7
Speedup
• Speedup of a pipeline processing over an equivalent
non-pipeline processing is𝑛defined
𝑡𝑛 by the ratio
𝑆=
( 𝑘+ 𝑛− 1) 𝑡 𝑝
• If number of tasks in pipeline increases w.r.t. number of
segments then becomes larger than , 𝑛 under
𝑡 𝑛 𝑡 𝑛 this
𝑆= =
condition speedup becomes 𝑛 𝑡𝑝 𝑡 𝑝
• Assuming time to process a task
𝑆=in
𝑘𝑡pipeline
𝑝
=𝑘 and non-
pipeline circuit is same then 𝑡𝑝
• Theoretically maximum speedup achieved is the
number of segments in the pipeline.
Arithmetic Pipeline
• Usually found in high speed computers.
• Used to implement floating point operations,
multiplication of fixed point numbers and similar
operations.
• Consider an example of floating point addition
and subtraction.
• A and B are two fractions that represent the
mantissas and a and b are the exponents.
Example of Arithmetic Pipeline
• Consider the two normalized floating-point numbers:
X = 0.9504 x 103 Y = 0.8200 x 102
• Segment-1: The larger exponent is chosen as the
exponent of result.
• Segment-2: Aligning the mantissa numbers
X = 0.9504 x 103 Y = 0.0820 x 103
• Segment-3: Addition of the two mantissas produces the
sum
Z = 1.0324 x 103
• Segment-4: Normalize the result
Z = 0.10324 x 104
Example of Arithmetic Pipeline
• The sub-operations that are performed in the
four segments are:
1. Compare the exponents
2. Align the mantissas
3. Add or subtract the mantissas
4. Normalize the result
a Exponents b A Mantissas B
R R
Segment 1: Compare exponents Difference
by subtraction
Segment 2: Choose exponent Align mantissas
Add or subtract
Segment 3:
mantissas
R R
Segment 4: Adjust exponent Normalize result
R R
Instruction Pipeline
• In the most general case, the computer needs to process each
instruction with the following sequence of steps
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
• Different segments may take different times to operate on the
incoming information.
• Some segments are skipped for certain operations.
• The design of an instruction pipeline will be most efficient if the
instruction cycle is divided into segments of equal duration.
Instruction Pipeline
• Assume that the decoding of the instruction can be
combined with the calculation of the effective address into
one segment.
• Assume further that most of the instructions place the
result into a processor registers so that the instruction
execution and storing of the result can be combined into
one segment.
• This reduces the instruction pipeline into four segments.
1. FI: Fetch an instruction from memory
2. DA: Decode the instruction and calculate the effective address
of the operand
3. FO: Fetch the operand
4. EX: Execute the operation
Four segment CPU pipeline
Fetch instruction
Segment1: from memory
Segment2: Decode instruction &
calculate effective address
yes
Branch?
no
Fetch operand
Segment3:
from memory
Segment4: Execute instruction
Interrupt yes no
Interrupt?
handling
Update PC
Empty pipe
Space-time Diagram
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
FI DA FO EX
2
FI DA FO EX
3
FI DA FO EX
4 FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7
Space-time Diagram
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
FI DA FO EX
2
FI DA FO EX
(Branch) 3
FI - - FI DA FO EX
4 - - - FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7
Pipeline Conflict
• There are three major difficulties that cause the
instruction pipeline conflicts.
1. Resource conflicts caused by access to memory by
two segments at the same time. Most of these
conflicts can be resolved by using separate
instruction and data memories.
2. Data dependency conflicts arise when an instruction
depends on the result of a previous instruction, but
this result is not yet available.
3. Branch difficulties arise from branch and other
instructions that change the value of PC.
Vector Processing
• In many science and engineering applications, the
problems can be formulated in terms of vectors and
matrices that lend themselves to vector processing.
• Applications of Vector processing
1. Long-range weather forecasting
2. Petroleum explorations
3. Seismic data analysis
4. Medical diagnosis
5. Aerodynamics and space flight simulations
6. Artificial intelligence and expert systems
7. Mapping the human genome
8. Image processing
Vector Processing
Matrix Multiplication
• Matrix multiplication is one of the most
computationally intensive operations performed
in computers with vector processors.
• An n x m matrix of numbers has n rows and m
columns and may be considered as constituting a
set of n row vectors or a set of m column vectors.
• Consider, for example, the multiplication of two
3x3 matrices A and B.
Vector Processing
• The product matrix C is a 3 x 3 matrix whose elements are related
to the elements of A and B by the inner product:
• The total number of multiplications or additions required to
compute the matrix product is 9 x 3 = 27.
• The values of A and B are either in memory or in processor
registers.
Source A
Source B Multiplier Pipeline Adder Pipeline
SIMD Array Processor
PE1 M1
Master control
PE2 M2
unit
PE3 M3
Main memory
PEn Mn
Tightly coupled V/S Loosely coupled
Tightly Coupled System Loosely Coupled System
Tasks and/or processors Tasks or processors do not
communicate in a highly communicate in a synchronized
synchronized fashion. fashion.
Communicates through a Communicates by message
common shared memory. passing packets.
Shared memory system. Distributed memory system.
Overhead for data exchange is Overhead for data exchange is
lower comparatively. higher comparatively.
Interconnection Structures
1. Time-shared common bus
2. Multiport memory
3. Crossbar switch
4. Multistage switching network
5. Hypercube system
1. Time-shared common bus
Memory unit
CPU 1 CPU 2 CPU 3 IOP 1 IOP 2
2. Multiport Memory
Memory modules
MM 1 MM 2 MM 3 MM 4
CPU 1
CPU 2
CPU 3
CPU 4
3. Crossbar switch
Memory modules
MM 1 MM 2 MM 3 MM 4
CPU 1
CPU 2
CPU 3
CPU 4
4. Multistage switching network
Operation of 2 X 2 interchange switch
0 0
A A
1 1
B B
A connected to 0 A connected to 1
0 0
A A
1 1
B B
B connected to 0 B connected to 1
4. Multistage switching network
0 000
1 001
0
1
0
010
0 1 011
P1
1
P2
0
100
1 101
0
1
0 110
1 111
5. Hypercube Interconnection
011 111
0 01 11 010 110
001 101
1 00 10 000 100
One-cube Two-cube Three-cube
010 x-or 001 = 011
Cache Coherence Problem
Main Memory X 100
230
50 Write through
Write back
Cache X 100
50 X 50 X 50
120
115
230
Processor P1 P2 P3
Cache Coherence Solution
• Write Update
• Write Invalidate
• Software approaches
– Compiler based cache coherence mechanism
• Hardware approaches
– Directory protocols
– Snoopy protocols
Shared Memory Architecture
Local bus
Common System
Local
shared bus CPU IOP
memory
memory controller
System bus
System System
Local
bus CPU IOP bus CPU
memory
controller controller
Local bus Local bus