0% found this document useful (0 votes)
5 views

M3

The document discusses the importance and fundamentals of parallel computing, including its applications, architectures, and performance factors. It highlights the need for parallel processing in various fields such as AI, IoT, and computational science, and outlines the differences between serial and parallel computing. Additionally, it covers hardware architectures, programming models, and the evolution of computing performance over time.

Uploaded by

Prince lalulucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

M3

The document discusses the importance and fundamentals of parallel computing, including its applications, architectures, and performance factors. It highlights the need for parallel processing in various fields such as AI, IoT, and computational science, and outlines the differences between serial and parallel computing. Additionally, it covers hardware architectures, programming models, and the evolution of computing performance over time.

Uploaded by

Prince lalulucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Generic Questions

Q1.Why do you need more than one VM? What if you are provided one powerful
VM than two or more?
Q2.Your CEO asks you, how many VMs should we buy from AWS for an App?
Q3. How do you get confidence on serverless computing ?
Q4. How do you compare the performance and cost benefits with your on-premise
infrastructure and Aws?
Q5. Dou you think buying more VMs results in improving the application
performance ? If not then how to get?
Q6. What are the factors affect application performance ?
Q7. How would you get confidence in future computational advancements in Cloud
offered by Service providers?
Q8. What are the fundamental computing limitations and bounds ?
Q9. How would you think of designing several load balancers and scheduling
algorithms as an R & D person?
Q10. How would you optimize the cost and performance of your on-Premise or
Cloud infrastructure?
An Introduction to Parallel Computing
Agenda
Motivating Factor: The Need and
01 Human Brain
02 Feasibility of Parallel
Computing

Elements of Parallel
03 Moore’s Law 04 Computing

Factors Affecting
05 Parallel System 06 Parallel
Programming
Performance Models
Computational Two Eras of
07 Power Improvement 08 Computing
Agenda
Hardware architectures Dependency Analysis &
10 for parallel processing
11 Conditions of
Parallelism

Levels of Software
12 Parallelism in Program
13 Software Parallelism Types
Execution

14 Laws of cautions 15 The Goal of Parallel


Processing

16 Amdahl’s Law 17 Gustafson’s Law

Optimal Cost Model


18
List of books

• Mastering Cloud computing, Rajkumar Buyya, Christian Vacchiola, S Thamarai


Selvi, McGraw Hill.
• Advance Computer Architect: Parallelism, Scalability, Programmability or
Scalable Parallel Computing: Technology, Architecture, Programming by Kai
hawang at el.
• Cloud Computing Principles and Paradigms, Rajkumar Buyya, James Broberg,
Andrzej Goscinski, Wiley Publishers
• Rest of book resources are mentioned in your Syllabus.
Why Do We Need to Study
Parallel Computing?
Evolution of Computer Performance/Cost

From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.

Mental power in four scales


The Speed-of-Light Argument
The speed of light is about 30 cm/ns.

Signals travel at 40-70% speed of light (say, 15 cm/ns).

If signals must travel 1.5 cm during the execution of an


instruction, that instruction will take at least 0.1 ns;
thus, performance will be limited to 10 GIPS.
This limitation is eased by continued miniaturization,
architectural methods such as cache memory, etc.;
however, a fundamental limit does exist.

How does parallel processing help? Wouldn’t multiple


processors need to communicate via signals as well?
Trends in Processor Chip Density, Performance,
Clock Speed, Power, and Number of Cores

Year of Introduction
Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Why High-Performance/Parallel Computing?
Higher speed (solve problems faster)
1 Important when there are “hard” or “soft” deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy
Categories of supercomputers
Uniprocessor; vector machine
Multiprocessor; centralized or distributed shared memory
Multicomputer; communicating via message passing
Massively parallel processor (MPP; 1K or more processors)
Need of Parallel Computing

Application Technology Architecture Economics


Demand Trends Trends
Applications need Parallel Processing

Here are some challenging application in Applied


science/engineering
ü IoT
ü Blockchain
ü AI/ML
ü Astrophysics
ü Atmospheric and Ocean Modeling
ü Bioinformatics
ü Biomolecular simulation: Protein folding
ü Computational Chemistry
ü Computational Fluid Dynamics (CFD)
ü Computational Physics
ü Computer vision and image understanding
ü Data Mining and Data-intensive Computing
ü Engineering analysis (CAD/CAM)
ü Global climate modeling and forecasting
Apps. Computation and Memory Need

Scientific computing methods


Parallel Computing in Cloud

1.Parallel Computing is the base for Cloud computing


2.Designing Resource Provisioning Algorithms
3.Designing scheduling Algorithms
4.Load balancing Algorithms
5.Performance Optimization
6.Resource Identification/recommendation
7.Cloud Infrastructure Deployment
8.Cloud Resource Utilization/migration
9.Prerequisite for R & D team
10.Application specific infrastructure selection and cost optimization
Parallel Computing
Parallel Computing

Parallel computing is a type of computing architecture in which several processors simultaneously


execute multiple, smaller calculations broken down from an overall larger, complex problem.

Serial computing Parallel computing


Motivating Factor: Human Brain

The human brain consists of a large number (more than a billion) of neural cells that process
information. Each cell works like a simple processor and only the massive interaction among all cells
and their parallel processing makes the brain's abilities possible.

Individual neuron response speed is slow. Aggregated speed with which complex calculations
carried out by (billions of) neurons demonstrate feasibility of parallel processing.
Computing elements

Applications

Threads Interface Programming paradigms

Microkernel Operating System

Multi-Processor Computing System


P
Hardware
P P P P P P

P Processor Thread Process


A Generic Parallel Computer
Architecture
Parallel computing architecture

2 Parallel Machine Network Network Interconnects


(Custom or industry standard)
°°°
1 Processing (compute)
nodes Processing Nodes
Communication
Mem assist (CA)
Operating System Network Interface Communication Assist (CA)
Parallel Programming $ (custom or industry standard)
Environments
P
One or more processing elements or processors
2-8 cores per chip per node: Custom or commercial microprocessors.
Single or multiple processors per chip
Homogenous or heterogonous
Parallel computing architecture

Processing Nodes:

Each processing node contains one or more processing elements (PEs) or processor(s), memory
system, plus communication assist: (Network interface and communication controller)

Parallel machine network (System Interconnects).

Function of a parallel machine network is to efficiently (reduce communication cost) transfer


information (data, results .. ) from source node to destination node as needed to allow cooperation
among parallel processing nodes to solve large computational problems divided into a number parallel
computational tasks.
Dual-Core Chip-
Multiprocessor (CMP)
Architectures
Dual Core Chip Multiprocessor(CMP)
Dual Core Chip Multiprocessor(CMP)
Single Die Single Die Two Dies – Shared Package
Shared L2 Cache Private Caches Private Caches
Shared System Interface Private System Interface

Cores communicate using shared cache Cores communicate using on-chip Cores communicate over external
(Lowest communication latency) Interconnects (shared system interface) Front Side Bus (FSB)
(Highest communication latency)

Examples: Examples: Examples:

• IBM POWER4/5 • AMD Dual Core Opteron, • Intel Pentium D,


• Intel Pentium Core Duo (Yonah) • Athlon 64 X2 • Intel Quad core (two dual-core chips)
• Conroe(Core 2) • Intel Itanium2 (Montecito)
• Sun UltraSparc T1 (Niagara)
• AMD Phenom
Eight-Core CMP: Intel Nehalem-EX

Overview
Eight processor cores sharing 24 MB of level 3 (L3) cache
Each core is 2-way SMT (2 threads per core), for a total of 16 threads
Elements of Parallel Computing
• Computing Problems:
• Numerical Computing: Science and engineering numerical problems demand intensive integer and floating point
computations.
• Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches.

• Parallel Algorithms and Data Structures:


• Special algorithms and data structures are needed to specify the computations and communication present in computing
problems (from dependency analysis).
• Most numerical algorithms are deterministic using regular data structures.
• Symbolic processing may use heuristics or non-deterministic searches.
• Parallel algorithm development requires interdisciplinary interaction.
• Hardware Resources
• Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system.
• Processor connectivity (system interconnects, network), memory organization, influence the system architecture.

• Operating Systems
• Manages the allocation of resources to running processes.
• Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication.
• Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.
Elements of Parallel Computing

• System Software Support


• Needed for the development of efficient programs in high-level languages (HLLs.)
• Assemblers, loaders.
• Portable parallel programming languages/libraries
• User interfaces and tools/APIs.

• Compiler Support
• Implicit Parallelism Approach
• Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel
constructs/code.
• Source code written in conventional sequential languages

• Explicit Parallelism Approach:


• Programmer explicitly specifies parallelism using:
• Sequential compiler (conventional sequential HLL) and low-level library of the target parallel computer , or ..
• Concurrent (parallel) HLL .
• Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the
programmer. It may perform some program flow analysis, dependence checking, limited optimizations for
parallelism detection.
Factors Affecting Parallel
System Performance
Factors Affecting Parallel System
Performance
• Parallel Algorithm Related:
• Available concurrency and profile, grain size, uniformity, patterns.
• Dependencies between computations represented by dependency graph
• Type of parallelism present: Functional and/or data parallelism.
• Required communication/synchronization, uniformity and patterns.
• Communication to computation ratio (C-to-C ratio, lower is better).

• Parallel program Related:


• Programming model used.
• Resulting data/code memory requirements, locality and working set characteristics.
• Parallel task grain size.
• Assignment (mapping) of tasks to processors: Dynamic or static.
• Cost of communication/synchronization primitives.

• Hardware/Architecture related:
• Total CPU computational power available.
• Types of computation modes supported.(sequential models, functional models, and concurrent models.)
• Shared address space Vs. message passing.
• Communication network characteristics (topology, bandwidth, latency)
• Memory hierarchy properties.
Computational Power Improvement

Multiprocessor

Uniprocessor
C.P.I

1 2. . . .

No. of Processors
Two eras of computing

Architectures

Sequential Era Compilers

Applications

Problem Solving Environments

Architectures

Compilers

Parallel Era Applications

Problem Solving Environments

1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
Hardware architectures for
parallel processing

Classification based on the number of


instruction and data streams:
1. Single-instruction, single-data (SISD) stream systems
2. Single-instruction, multiple-data (SIMD) stream systems
3. Multiple-instruction, single-data (MISD) stream systems
4. Multiple-instruction, multiple-data (MIMD) stream systems
Hardware architectures for
parallel processing

Single-instruction, single-data (SISD)


systems :

An SISD computing system is a uniprocessor machine capable of executing a single instruction,


which operates on a single data stream.

Dominant representative SISD systems are IBM PC, Macintosh, and workstations.

Instruction
Stream

Data Input Data Output

Processor
Hardware architectures for
parallel processing

Single-instruction, multiple-data (SIMD)


systems :

SIMD is a multiprocessor machine capable of executing the same instruction on all the CPUs
but operating on different data streams.
SIMD models are well suited to scientific computing since they involve lots of vector and matrix
operations.
Ex. Ci=Ai*Bi;
Dominant representative SIMD system is Cray’s vector processing machine
Hardware architectures for
parallel processing

Single Instruction Stream

Data Input 1 Data Output 1

Processor 1

Data Input 2 Data Output 2

Processor 2

Data Input N Data Output N

Processor N
Hardware architectures for
parallel processing

Multiple-instruction, single-data (MISD) systems :

An MISD is a multiprocessor machine capable of executing different instructions on different


PEs but all of them operating on the same data set.

Ex: Y=sin(x)+cos(x)+tan(x)

Machines built using the MISD model are not useful in most of the applications; a few
machines are built, but none of them are available commercially
Hardware architectures for
parallel processing
Instruction Instruction Instruction
Stream 1 Stream 2 Stream N

Single Data Output Stream


Single Data Input Stream

Processor 1

Processor 2

Processor N
Hardware architectures for
parallel processing

Multiple-instruction, Multiple-data (MIMD) systems :

An MIMD is a multiprocessor machine capable of executing multiple instructions on multiple


data sets.
Each PE in the MIMD model has separate instruction and data streams;
well suited to any kind of application.
Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.
Hardware architectures for
parallel processing
Instruction Instruction Instruction
Stream 1 Stream 2 Stream N

Data Input 1 Data Output 1

Processor 1

Data Input 2 Data Output 2

Processor 2

Data Input N Data Output 3

Processor N

MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory MIMD.
Hardware architectures for
parallel processing

Shared-memory MIMD:
• All the PEs are connected to a single global memory and they all have access to it, referred
as tightly coupled multiprocessor systems.
• Communication among PEs through the shared memory.
• Modification of the data stored in the global memory by one PE is visible to all other PEs.
Ex: Silicon Graphics machines and Sun/IBM’s SMP, API: OpenMP

Distributed-memory MIMD:
• All PEs have a local memory, referred as loosely coupled multiprocessor systems,
communication between PEs through the interconnection network.
• PEs can be configured via to tree, mesh, cube topology and so on.
• Each PE operates asynchronously, and if communication/synchronization among tasks is
necessary, they can do so by exchanging messages between them. Programming API:MPI.
Hardware architectures for
parallel processing
• The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to
extend as in distributed memory MIMD model.
• Failures in a shared-memory MIMD affect the entire system, whereas this is not the case of the distributed
model.
• Shared memory MIMD architectures are less likely to scale because the addition of more PEs leads to memory
contention.
• This is a situation that does not happen in the case of distributed memory, in which each PE has its own
memory.
• As a result, distributed memory MIMD architectures are most popular today.
IPC Channel IPC Channel

Processor 1 Processor 2 Processor N Processor 1 Processor 2 Processor 2


Memory
Bus Memory Memory Memory
Bus Bus Bus
Global System Memory Local Local Local
Memory Memory Memory
Parallel Programming Models

Programming methodology used in coding parallel applications


Specifies: 1- communication and 2- synchronization

Multiprogramming or Multi-tasking (not true parallel processing)


• No communication or synchronization at program level. A number of independent programs running on different
processors in the system.

Shared memory address space (SAS):


• Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory).

Message passing:
• Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages.

Data parallel:
• More regimented, global actions on data (i.e the same operations over all elements on an array or vector)
• Can be implemented with shared address space or message passing.
Case Study: An example
Task Dependency Graph

Task: 0
0 Computation
A A 1 A Idle
1 run on one
processor 2
2
Comm 3 Comm
3 B C
4 Assume computation time for each task
4 B C B
A-G = 3
5
5 Assume communication time between parallel
C Comm 6 F tasks
6 1
7 D
Assume communication can overlap with
7 D E F
8 computation
8 D E Speedup on two processors
9 Idle T1/T2 = 21/16 = 1.3
9 Comm
10
10 Comm
E G 11
11
12 Idle G
12
13
13 F
What would the speed be with 3 …
… processors?
4 processors? 5 … ? 21
21 P0 P1 T2 =16
T1 =21 Time
Time
Session 2: Parallel
Computing
Parallel Programs: Definitions
A parallel program is comprised of a number of tasks running as threads (or processes) on a number of processing
elements that cooperate/communicate as part of a single parallel computation.

Task:
• Arbitrary piece of undecomposed work in parallel computation
• Executed sequentially on a single processor; concurrency in parallel computation is only across tasks.
Parallel or Independent Tasks:
• Tasks that with no dependencies among them and thus can run in parallel on different processing elements.
Parallel Task Grain Size: The amount of computations in a task.
Process (thread):
• Abstract program entity that performs the computations assigned to a task
• Processes communicate and synchronize to perform their tasks
Processor or (Processing Element):
• Physical computing engine on which a process executes sequentially
• Processes virtualize machine to programmer
• First write program in terms of processes, then map to processors
Communication to Computation Ratio (C-to-C Ratio): Represents the amount of resulting communication between
tasks of a parallel program
Levels of parallelism

Levels of parallelism are decided based on the lumps of code (grain size) that can be a potential
candidate for parallelism.

Goal: to boost processor efficiency by hiding latency.

• The idea is to execute concurrently two or more single-threaded applications, such as compiling, text
formatting, database searching, and device simulation.
• Major levels of granularities:
• Coarse grain (or task level)
• Medium grain (or control level)
• Fine grain (data level)
Levels of parallelism

}
According to task (grain) size Jobs or programs
Level 5
(Multiprogramming)
Coarse
Increasing i.e multi-tasking Grain

}
communications
demand and Level 4 Subprograms, job Higher
mapping/scheduling steps or related parts
of a program Degree of
overheads Medium Software
Grain Parallelism
Higher (DOP)
C-to-C Level 3 Procedures, subroutines,
Ratio or co-routines
More

}
Smaller
Level 2 Non-recursive loops or Tasks
Thread Level Parallelism unfolded iterations
(TLP) Fine
Grain
Instruction Level 1 Instructions or
Level Parallelism (ILP) statements
Computational Parallelism and
Grain Size

Task grain size (granularity) is a measure of the amount of computation involved in a task in parallel
computation:

• Instruction Level (Fine Grain Parallelism):


• At instruction or statement level.
• 20 instructions grain size or less.
• Manual parallelism detection is difficult but assisted by parallelizing compilers.
• Loop level (Fine Grain Parallelism):
• Iterative loop operations.
• Typically, 500 instructions or less per iteration.
• Optimized on vector parallel computers.
• Independent successive loop operations can be vectorized or run in SIMD mode.
Computational Parallelism and Grain Size

Procedure level (Medium Grain Parallelism):


• Procedure, subroutine levels.
• Less than 2000 instructions.
• More difficult detection of parallel than finer-grain levels.
• Less communication requirements than fine-grain parallelism.
• Relies heavily on effective operating system support.

Subprogram level (Coarse Grain Parallelism):


• Job and subprogram level.
• Thousands of instructions per grain.
• Often scheduled on message-passing multicomputer.

Job (program) level, or Multiprogramming:


• Independent programs executed on a parallel computer.
• Grain size in tens of thousands of instructions.
Conditions of Parallelism: Data &
Name Dependence

Tasks
A possible
S1
.. Assume task S2 follows task S1 in sequential program order
.. 1 True Data or Flow Dependence: Task S2 is data dependent on task S1 if an execution path exists from S1
to S2 and if at least one output variable of S1 feeds in as an input operand used by S2
S2 Represented by S1 àS2 in task dependency graphs
Program
Order
2 Anti-dependence: Task S2 is anti-dependent on S1, if S2 follows S1 in program order and if the output of
S2 overlaps the input of S1
Name Represented by S1 àS2 in dependency graphs
Dependencies

3 Output dependence: Two tasks S1, S2 are output dependent if they produce the same output variables
(or output overlaps)
Represented by S1 à S2 in task dependency graphs
Dependencies

Flow dependence Output dependence


Si : x := a + b Si : z := x + a
Sj : y := x + c Sj : z := y + b
Sj is flow dependent on Si; Si à Sj Sj is Output dependent on Si; Si à
o Sj

Anti dependence Control Dependence:


Si : x := y + a Si : if ( x = c)
Sj : y := a + b Sj : y :=1
Sj is anti dependent on Si; Si à Sj Sj is control dependent on Si;
c
Si à Sj

c
Dependency Graph Example
Here assume each instruction is treated as a task:

S1: Load R1, A / R1 ¬ Memory(A) /


S2: Add R2, R1 / R2 ¬ R1 + R2 / Sequential
Order
S3: Move R1, R3 / R1 ¬ R3 /
S4: Store B, R1 /Memory(B) ¬ R1 /

True Date Dependence: R1 ¬ Memory(A)


(S1, S2) (S3, S4)
i.e. S1 ¾® S2 S1
S3 ¾® S4
Memory(B) ¬ R1
R2 ¬ R1 + R2
Output Dependence:
(S1, S3)
S2 S4
i.e. S1 ¾® S3

Anti-dependence:
(S2, S3)
i.e. S2 ¾® S3 S3
R1 ¬ R3
Dependency graph
Dependency Graph Example
Here assume each instruction is treated as a task:
MIPS Code
Task Dependency graph
1 ADD.D F2, F1, F0
Æ
Order 2 ADD.D F4, F2, F3
1 3 ADD.D F2, F2, F4
ADD.D F2, F1, F0 4 ADD.D F4, F2, F6

True Date Dependence:


(1, 2) (1, 3) (2, 3) (3, 4)
i.e. 1 ¾® 2 1 ¾® 3
2 3 2 ¾® 3 3 ¾® 4
ADD.D F4, F2, F3 ADD.D F2, F2, F4

Output Dependence:
(1, 3) (2, 4)
i.e. 1 ¾® 3 2 ¾® 4

4 Anti-dependence:
ADD.D F4, F2, F6 (2, 3) (3, 4)
i.e. 2 ¾® 3 3 ¾® 4
Conditions of Parallelism
• Control Dependence: Algorithm & Program Related

• Order of execution cannot be determined before runtime due to


conditional statements.
• Resource Dependence:
• Concerned with conflicts in using shared resources among parallel
tasks, including:
• Functional units (integer, floating point), memory areas,
communication links etc. i.e. Results
i.e. Operands produced
• Bernstein’s Conditions of Parallelism:
Two processes P1 , P2 with input sets I1, I2 and output sets O1,
O2 can execute in parallel (denoted by P1 || P2) if:
Order of P1 , P2 ?
I1 Ç O2 = Æ
i.e no flow (data) dependence I2 Ç O1 = Æ
or anti-dependence
i.e no output dependence
(which is which?) O1 Ç O2 = Æ
Bernstein’s Conditions: An Example

• For the following statements P1, P2, P3, P4, P5 : P1


• Each instruction requires one step to execute Co-Begin
• Two adders are available P1, P3, P5
Co-End
P4
P1 : C = D x E

P2 : M = G + C Using Bernstein’s Conditions after checking statement pairs:


P1 || P5 , P2 || P3 , P2 || P5 , P3 || P5 , P4 || P5
P3 : A = B + C
D E D E
P4 : C = L + M Time
G X P1
X P1 B G E
P5 : F = G ¸ E C G
C
P1 +1 P2
X
B
L +1 P2 +2 P3 ¸ P5

P5 +2 P3 M
P2 P4 L
+1 +3
¸ A
+3 P4

E +3 P4 G C A F
C

+2 ¸ P5 Parallel execution in three steps


assuming two adders are available
P3
F per step
Data dependence (solid lines) - Resource dependence (dashed lines)
The Goal of Parallel Processing

• Parallel processing goal is to maximize parallel speedup:

Time(1) Sequential Work on one processor


Speedup = <
Time(p) Max (Work + Synch Wait Time + Comm Cost + Extra Work)

• Ideal Speedup = p = number of processors

• Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.
• Maximize parallel speedup by:
• Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads.
• Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.
• Performance Scalability:
Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of
processors) are increased.
Parallel Performance Metrics
Efficiency, Utilization, Redundancy, Quality of Parallelism
System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be
the parallel execution time in unit time steps:
• In general T(n) << O(n) (more than one operation is performed by more than one processor in unit time).

For One Processor n = number of processors


• Assume T(1) = O(1) Here O(1) = work on one processor
O(n) = total work on n processors
• Speedup factor: S(n) = T(1) /T(n)
• Ideal T(n) = T(1)/n -> Ideal speedup = n
• Parallel System efficiency E(n) for an n-processor system:
E(n) = S(n)/n = T(1)/[nT(n)]
Ideally:
Ideal speedup: S(n) = n
and thus ideal efficiency: E(n) = n /n = 1
Parallel Performance Metrics
Cost, Utilization, Redundancy, Quality of Parallelism
• Cost: The processor-time product or cost of a computation is defined as
Efficiency = S(n)/n
Cost(n) = n T(n) = n x T(1) / S(n) = T(1) / E(n)
• The cost of sequential computation on one processor n=1 is simply T(1) Speedup = T(1)/T(n)
• A cost-optimal parallel computation on n processors has a cost proportional to T(1) when:
S(n) = n, E(n) = 1 ---> Cost(n) = T(1)
• Redundancy: R(n) = O(n)/O(1)
• Ideally with no overheads/extra work O(n) = O(1) -> R(n) = 1
• Utilization: U(n) = R(n)E(n)
• ideally R(n) = E(n) = U(n)= 1
• Quality of Parallelism:
Q(n) = S(n) E(n) / R(n)
Ideally S(n) = n, E(n) = R(n) = 1 ---> Q(n) = n
Laws of cautions

• Understand technicality, how much an application benefitted


from parallelism?
• Parallelism performs multiple activities together to increase
System throughput or its speed.
• Relations that control the increment of speed are not linear.
• For example, for a given n processors, the user expects speed to
be increased by n times, but, it rarely happens because of the
communication overhead.
• Guideline 1: Speed of computation is proportional to the
square root of system cost; they never increase linearly, thus,
the faster a system becomes, the more expensive it is to
increase its speed.
• Guideline 2: Parallel computer speed increases as the
logarithm of the number of processors.
Speed UP
ts

(a) One processor fts (1-f)ts

(b) Multiple Serial section Parallelizable sections


processors ...

... p processors

tp (1-f)ts / p
Another Way: Speed UP

Speedup factor is given by:

ts 1 1
S ( p) = = lim S ( p) =
(1 - f )t s (1 - f ) p ®¥ f
fts + f +
p p

This equation is known as Amdahl’s law.

Even with infinite number of processors, maximum speedup is limited to 1/f.


Example: With only 5% of computation being serial, maximum speedup is 20, irrespective
of number of processors.
Illustration of Amdahl Effect

ts 1 1
S ( p) = = lim S ( p) =
(1 - f )t s (1 - f ) p ®¥ f
fts + f +
p p
Speedup

f = 0.01

f = 0.1

f= 0.85
Processors
Super-linear Speedup (S(n) > n )

• Most texts besides Quinn’s argue that


• Linear speedup is the maximum speedup obtainable.
• Occasionally speedup that appears to be superliner may occur:
• extra memory in parallel system.
• a sub-optimal sequential algorithm used.
• In case of algorithm that has a random aspect in its design (e.g., random
selection)
• Examples include “nonstandard” problems involving
• Real-Time requirements where meeting deadlines is part of the problem
requirements.
• All data is not initially available, but has to be processed after it arrives.
• Real life situations such as a “person who can only keep, a driveway open
during a severe snowstorm with the help of friends”.
Super-linear Speedup (S(n) > n )

(a) Searching each sub-space sequentially

Start Time

ts
ts/p
Sub-space Dt
search

Solution found
xts/p

x indeterminate
Super-linear Speedup (S(n) > n )
(b) Searching each sub-space in parallel

Speed-up then given by

ts
x ´
+ Dt
p
S(p) =
Dt
Dt

Solution found
Super-linear Speedup (S(n) > n )
Worst case for sequential search when solution found in last sub-space search. Then
parallel version offers greatest benefit, i.e.

p-1
p ´ ts + Dt
as Dt tends to go to zero
S(p) = ® ¥
Dt
Least advantage for parallel version when solution found in first sub-space search of the
sequential search, i.e. Dt
S(p) = =1
Dt
Actual speed-up depends upon which subspace holds solution but could be extremely large.
Examples

• 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum
speedup we should expect from a parallel version of the program executing on 8 CPUs?

1
y£ @ 5.9
0.05 + (1 - 0.05) / 8

• An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You
can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of
the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on
8 processors?

Answer: 1/(0.2 + (1 - 0.2)/8) @ 3.3


Other way Speedup (Gustafson’s Law)

• Machine size increases, the work load (or problem size) is also increased
• Let, Ts to perform sequential operations, Tp(n,W) to perform parallel operation of problem size or
workload W using n processors;
• Then the speedup with n processors is:
• Assuming that parallel operations achieve a linear speedup
• (i.e. these operations take 1/n of the time to perform on one processor)
• Then, Tp(1,W) = n. Tp(n,W)
• Let α be the fraction of sequential work load i.e.
• Then the speedup can be expressed as :
Speedup (Gustafson’s Law)
Cost Optimal

Cost = Parallel running time ´ #processors


• A parallel algorithm is optimal if
parallel cost = O(f(t)),
• where f(t) is the running time of an optimal sequential algorithm.
• Equivalently, a parallel algorithm for a problem is said to be cost-optimal if
its cost is proportional to the running time of an optimal sequential
algorithm for the same problem.

• By proportional, we means that


cost º tp ´ n = k ´ ts
where k is a constant and n is no. of processors.
• In cases where no optimal sequential algorithm is known, then the “fastest
known” sequential algorithm is sometimes used instead.

You might also like