0% found this document useful (0 votes)
5 views38 pages

CS212 Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

CS212 Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

CS212

Computer Organization

UNIT-5
Parallel Processing &
Multiprocessor
Topics to be covered
• Flynn's taxonomy
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• Vector Processing
• Array Processors
Flynn’s Taxonomy
Data Stream

Single Multiple

Single SISD SIMD


Instruction
Stream
Multiple MISD MIMD
Single Instruction Single Data (SISD)
• SISD represents the organization of a single
computer containing a control unit, a processor
unit, and a memory unit.
• Instructions are executed sequentially and the
system may or may not have internal parallel
processing capabilities.
Single Instruction Multiple Data (SIMD)
• SIMD represents an organization that includes
many processing units under the supervision of a
common control unit.
• All processors receive the same instruction from
the control unit but operate on different items of
data.
Multiple Instruction Single Data (MISD)
• There is no computer at present that can be
classified as MISD.
• MISD structure is only of theoretical interest since
no practical system has been constructed using
this organization.
Multiple Instruction Multiple Data (MIMD)
• MIMD organization refers to a computer system
capable of processing several programs at the
same time.
• Most multiprocessor and multicomputer systems
can be classified in this category.
• Contains multiple processing units.
• Execution of multiple instructions on multiple
data.
Parallel Processing
• Parallel processing is a term used to denote a large
class of techniques that are used to provide
simultaneous data-processing tasks for the purpose
of increasing the computational speed of a
computer system.
• Purpose of parallel processing is to speed up the
computer processing capability and increase its
throughput.
• Throughput:
The amount of processing that can be
accomplished during a given interval of time.
Pipelining
• Pipeline is a technique of decomposing a sequential process into
sub operations, with each sub process being executed in a
special dedicated segment that operates concurrently with all
other segments.
• A pipeline can be visualized as a collection of processing
segments through which binary information flows.
• Each segment performs partial processing dictated by the way
the task is partitioned.
• The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
• The registers provide isolation between each segment.
• The technique is efficient for those applications that need to
repeat the same task many times with different sets of data.
Pipelining example
for

R1 R2

Multiplier

R3 R4

Adder

R5
Pipelining
• General structure of four segment pipeline
Clock

Input
S1 R1 S2 R2 S3 R3 S4 R4
Space-time Diagram
Segment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 4

2 4

3 4

4 4

Non Pipelined Architecture = 16

Segment 1 2 3 4 5 6 7
Clock cycles
1 4

2 1 Pipelined Architecture
3 1
4 1 =7
Speedup
• Speedup of a pipeline processing over an equivalent
non-pipeline processing is𝑛defined
𝑡𝑛 by the ratio
𝑆=
( 𝑘+ 𝑛− 1) 𝑡 𝑝
• If number of tasks in pipeline increases w.r.t. number of
segments then becomes larger than , 𝑛 under
𝑡 𝑛 𝑡 𝑛 this
𝑆= =
condition speedup becomes 𝑛 𝑡𝑝 𝑡 𝑝

• Assuming time to process a task


𝑆=in
𝑘𝑡pipeline
𝑝
=𝑘 and non-
pipeline circuit is same then 𝑡𝑝

• Theoretically maximum speedup achieved is the


number of segments in the pipeline.
Arithmetic Pipeline
• Usually found in high speed computers.
• Used to implement floating point operations,
multiplication of fixed point numbers and similar
operations.
• Consider an example of floating point addition
and subtraction.

• A and B are two fractions that represent the


mantissas and a and b are the exponents.
Example of Arithmetic Pipeline
• Consider the two normalized floating-point numbers:
X = 0.9504 x 103 Y = 0.8200 x 102
• Segment-1: The larger exponent is chosen as the
exponent of result.
• Segment-2: Aligning the mantissa numbers
X = 0.9504 x 103 Y = 0.0820 x 103
• Segment-3: Addition of the two mantissas produces the
sum
Z = 1.0324 x 103
• Segment-4: Normalize the result
Z = 0.10324 x 104
Example of Arithmetic Pipeline
• The sub-operations that are performed in the
four segments are:
1. Compare the exponents
2. Align the mantissas
3. Add or subtract the mantissas
4. Normalize the result
a Exponents b A Mantissas B

R R

Segment 1: Compare exponents Difference


by subtraction

Segment 2: Choose exponent Align mantissas

Add or subtract
Segment 3:
mantissas

R R

Segment 4: Adjust exponent Normalize result

R R
Instruction Pipeline
• In the most general case, the computer needs to process each
instruction with the following sequence of steps
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
• Different segments may take different times to operate on the
incoming information.
• Some segments are skipped for certain operations.
• The design of an instruction pipeline will be most efficient if the
instruction cycle is divided into segments of equal duration.
Instruction Pipeline
• Assume that the decoding of the instruction can be
combined with the calculation of the effective address into
one segment.
• Assume further that most of the instructions place the
result into a processor registers so that the instruction
execution and storing of the result can be combined into
one segment.
• This reduces the instruction pipeline into four segments.
1. FI: Fetch an instruction from memory
2. DA: Decode the instruction and calculate the effective address
of the operand
3. FO: Fetch the operand
4. EX: Execute the operation
Four segment CPU pipeline
Fetch instruction
Segment1: from memory

Segment2: Decode instruction &


calculate effective address
yes
Branch?
no
Fetch operand
Segment3:
from memory

Segment4: Execute instruction

Interrupt yes no
Interrupt?
handling

Update PC

Empty pipe
Space-time Diagram
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
FI DA FO EX
2
FI DA FO EX
3
FI DA FO EX
4 FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7
Space-time Diagram
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
FI DA FO EX
2
FI DA FO EX
(Branch) 3
FI - - FI DA FO EX
4 - - - FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7
Pipeline Conflict
• There are three major difficulties that cause the
instruction pipeline conflicts.
1. Resource conflicts caused by access to memory by
two segments at the same time. Most of these
conflicts can be resolved by using separate
instruction and data memories.
2. Data dependency conflicts arise when an instruction
depends on the result of a previous instruction, but
this result is not yet available.
3. Branch difficulties arise from branch and other
instructions that change the value of PC.
Vector Processing
• In many science and engineering applications, the
problems can be formulated in terms of vectors and
matrices that lend themselves to vector processing.
• Applications of Vector processing
1. Long-range weather forecasting
2. Petroleum explorations
3. Seismic data analysis
4. Medical diagnosis
5. Aerodynamics and space flight simulations
6. Artificial intelligence and expert systems
7. Mapping the human genome
8. Image processing
Vector Processing
Matrix Multiplication
• Matrix multiplication is one of the most
computationally intensive operations performed
in computers with vector processors.
• An n x m matrix of numbers has n rows and m
columns and may be considered as constituting a
set of n row vectors or a set of m column vectors.
• Consider, for example, the multiplication of two
3x3 matrices A and B.
Vector Processing
• The product matrix C is a 3 x 3 matrix whose elements are related
to the elements of A and B by the inner product:

• The total number of multiplications or additions required to


compute the matrix product is 9 x 3 = 27.
• The values of A and B are either in memory or in processor
registers.

Source A

Source B Multiplier Pipeline Adder Pipeline


SIMD Array Processor

PE1 M1

Master control
PE2 M2
unit

PE3 M3

Main memory
PEn Mn
Tightly coupled V/S Loosely coupled
Tightly Coupled System Loosely Coupled System
Tasks and/or processors Tasks or processors do not
communicate in a highly communicate in a synchronized
synchronized fashion. fashion.
Communicates through a Communicates by message
common shared memory. passing packets.
Shared memory system. Distributed memory system.
Overhead for data exchange is Overhead for data exchange is
lower comparatively. higher comparatively.
Interconnection Structures
1. Time-shared common bus
2. Multiport memory
3. Crossbar switch
4. Multistage switching network
5. Hypercube system
1. Time-shared common bus
Memory unit

CPU 1 CPU 2 CPU 3 IOP 1 IOP 2


2. Multiport Memory
Memory modules

MM 1 MM 2 MM 3 MM 4

CPU 1

CPU 2

CPU 3

CPU 4
3. Crossbar switch
Memory modules

MM 1 MM 2 MM 3 MM 4

CPU 1

CPU 2

CPU 3

CPU 4
4. Multistage switching network
Operation of 2 X 2 interchange switch

0 0
A A
1 1
B B

A connected to 0 A connected to 1

0 0
A A
1 1
B B

B connected to 0 B connected to 1
4. Multistage switching network
0 000
1 001
0
1
0
010
0 1 011
P1
1
P2
0
100
1 101
0
1
0 110
1 111
5. Hypercube Interconnection
011 111

0 01 11 010 110

001 101

1 00 10 000 100

One-cube Two-cube Three-cube

010 x-or 001 = 011


Cache Coherence Problem
Main Memory X 100
230
50 Write through

Write back

Cache X 100
50 X 50 X 50
120
115
230
Processor P1 P2 P3
Cache Coherence Solution
• Write Update
• Write Invalidate
• Software approaches
– Compiler based cache coherence mechanism
• Hardware approaches
– Directory protocols
– Snoopy protocols
Shared Memory Architecture
Local bus

Common System
Local
shared bus CPU IOP
memory
memory controller

System bus

System System
Local
bus CPU IOP bus CPU
memory
controller controller

Local bus Local bus

You might also like