M3
M3
Q1.Why do you need more than one VM? What if you are provided one powerful
VM than two or more?
Q2.Your CEO asks you, how many VMs should we buy from AWS for an App?
Q3. How do you get confidence on serverless computing ?
Q4. How do you compare the performance and cost benefits with your on-premise
infrastructure and Aws?
Q5. Dou you think buying more VMs results in improving the application
performance ? If not then how to get?
Q6. What are the factors affect application performance ?
Q7. How would you get confidence in future computational advancements in Cloud
offered by Service providers?
Q8. What are the fundamental computing limitations and bounds ?
Q9. How would you think of designing several load balancers and scheduling
algorithms as an R & D person?
Q10. How would you optimize the cost and performance of your on-Premise or
Cloud infrastructure?
An Introduction to Parallel Computing
Agenda
Motivating Factor: The Need and
01 Human Brain
02 Feasibility of Parallel
Computing
Elements of Parallel
03 Moore’s Law 04 Computing
Factors Affecting
05 Parallel System 06 Parallel
Programming
Performance Models
Computational Two Eras of
07 Power Improvement 08 Computing
Agenda
Hardware architectures Dependency Analysis &
10 for parallel processing
11 Conditions of
Parallelism
Levels of Software
12 Parallelism in Program
13 Software Parallelism Types
Execution
From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.
Year of Introduction
Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Why High-Performance/Parallel Computing?
Higher speed (solve problems faster)
1 Important when there are “hard” or “soft” deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy
Categories of supercomputers
Uniprocessor; vector machine
Multiprocessor; centralized or distributed shared memory
Multicomputer; communicating via message passing
Massively parallel processor (MPP; 1K or more processors)
Need of Parallel Computing
The human brain consists of a large number (more than a billion) of neural cells that process
information. Each cell works like a simple processor and only the massive interaction among all cells
and their parallel processing makes the brain's abilities possible.
Individual neuron response speed is slow. Aggregated speed with which complex calculations
carried out by (billions of) neurons demonstrate feasibility of parallel processing.
Computing elements
Applications
Processing Nodes:
Each processing node contains one or more processing elements (PEs) or processor(s), memory
system, plus communication assist: (Network interface and communication controller)
Cores communicate using shared cache Cores communicate using on-chip Cores communicate over external
(Lowest communication latency) Interconnects (shared system interface) Front Side Bus (FSB)
(Highest communication latency)
Overview
Eight processor cores sharing 24 MB of level 3 (L3) cache
Each core is 2-way SMT (2 threads per core), for a total of 16 threads
Elements of Parallel Computing
• Computing Problems:
• Numerical Computing: Science and engineering numerical problems demand intensive integer and floating point
computations.
• Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches.
• Operating Systems
• Manages the allocation of resources to running processes.
• Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication.
• Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.
Elements of Parallel Computing
• Compiler Support
• Implicit Parallelism Approach
• Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel
constructs/code.
• Source code written in conventional sequential languages
• Hardware/Architecture related:
• Total CPU computational power available.
• Types of computation modes supported.(sequential models, functional models, and concurrent models.)
• Shared address space Vs. message passing.
• Communication network characteristics (topology, bandwidth, latency)
• Memory hierarchy properties.
Computational Power Improvement
Multiprocessor
Uniprocessor
C.P.I
1 2. . . .
No. of Processors
Two eras of computing
Architectures
Applications
Architectures
Compilers
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
Hardware architectures for
parallel processing
Dominant representative SISD systems are IBM PC, Macintosh, and workstations.
Instruction
Stream
Processor
Hardware architectures for
parallel processing
SIMD is a multiprocessor machine capable of executing the same instruction on all the CPUs
but operating on different data streams.
SIMD models are well suited to scientific computing since they involve lots of vector and matrix
operations.
Ex. Ci=Ai*Bi;
Dominant representative SIMD system is Cray’s vector processing machine
Hardware architectures for
parallel processing
Processor 1
Processor 2
Processor N
Hardware architectures for
parallel processing
Ex: Y=sin(x)+cos(x)+tan(x)
Machines built using the MISD model are not useful in most of the applications; a few
machines are built, but none of them are available commercially
Hardware architectures for
parallel processing
Instruction Instruction Instruction
Stream 1 Stream 2 Stream N
Processor 1
Processor 2
Processor N
Hardware architectures for
parallel processing
Processor 1
Processor 2
Processor N
MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory MIMD.
Hardware architectures for
parallel processing
Shared-memory MIMD:
• All the PEs are connected to a single global memory and they all have access to it, referred
as tightly coupled multiprocessor systems.
• Communication among PEs through the shared memory.
• Modification of the data stored in the global memory by one PE is visible to all other PEs.
Ex: Silicon Graphics machines and Sun/IBM’s SMP, API: OpenMP
Distributed-memory MIMD:
• All PEs have a local memory, referred as loosely coupled multiprocessor systems,
communication between PEs through the interconnection network.
• PEs can be configured via to tree, mesh, cube topology and so on.
• Each PE operates asynchronously, and if communication/synchronization among tasks is
necessary, they can do so by exchanging messages between them. Programming API:MPI.
Hardware architectures for
parallel processing
• The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to
extend as in distributed memory MIMD model.
• Failures in a shared-memory MIMD affect the entire system, whereas this is not the case of the distributed
model.
• Shared memory MIMD architectures are less likely to scale because the addition of more PEs leads to memory
contention.
• This is a situation that does not happen in the case of distributed memory, in which each PE has its own
memory.
• As a result, distributed memory MIMD architectures are most popular today.
IPC Channel IPC Channel
Message passing:
• Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages.
Data parallel:
• More regimented, global actions on data (i.e the same operations over all elements on an array or vector)
• Can be implemented with shared address space or message passing.
Case Study: An example
Task Dependency Graph
Task: 0
0 Computation
A A 1 A Idle
1 run on one
processor 2
2
Comm 3 Comm
3 B C
4 Assume computation time for each task
4 B C B
A-G = 3
5
5 Assume communication time between parallel
C Comm 6 F tasks
6 1
7 D
Assume communication can overlap with
7 D E F
8 computation
8 D E Speedup on two processors
9 Idle T1/T2 = 21/16 = 1.3
9 Comm
10
10 Comm
E G 11
11
12 Idle G
12
13
13 F
What would the speed be with 3 …
… processors?
4 processors? 5 … ? 21
21 P0 P1 T2 =16
T1 =21 Time
Time
Session 2: Parallel
Computing
Parallel Programs: Definitions
A parallel program is comprised of a number of tasks running as threads (or processes) on a number of processing
elements that cooperate/communicate as part of a single parallel computation.
Task:
• Arbitrary piece of undecomposed work in parallel computation
• Executed sequentially on a single processor; concurrency in parallel computation is only across tasks.
Parallel or Independent Tasks:
• Tasks that with no dependencies among them and thus can run in parallel on different processing elements.
Parallel Task Grain Size: The amount of computations in a task.
Process (thread):
• Abstract program entity that performs the computations assigned to a task
• Processes communicate and synchronize to perform their tasks
Processor or (Processing Element):
• Physical computing engine on which a process executes sequentially
• Processes virtualize machine to programmer
• First write program in terms of processes, then map to processors
Communication to Computation Ratio (C-to-C Ratio): Represents the amount of resulting communication between
tasks of a parallel program
Levels of parallelism
Levels of parallelism are decided based on the lumps of code (grain size) that can be a potential
candidate for parallelism.
• The idea is to execute concurrently two or more single-threaded applications, such as compiling, text
formatting, database searching, and device simulation.
• Major levels of granularities:
• Coarse grain (or task level)
• Medium grain (or control level)
• Fine grain (data level)
Levels of parallelism
}
According to task (grain) size Jobs or programs
Level 5
(Multiprogramming)
Coarse
Increasing i.e multi-tasking Grain
}
communications
demand and Level 4 Subprograms, job Higher
mapping/scheduling steps or related parts
of a program Degree of
overheads Medium Software
Grain Parallelism
Higher (DOP)
C-to-C Level 3 Procedures, subroutines,
Ratio or co-routines
More
}
Smaller
Level 2 Non-recursive loops or Tasks
Thread Level Parallelism unfolded iterations
(TLP) Fine
Grain
Instruction Level 1 Instructions or
Level Parallelism (ILP) statements
Computational Parallelism and
Grain Size
Task grain size (granularity) is a measure of the amount of computation involved in a task in parallel
computation:
Tasks
A possible
S1
.. Assume task S2 follows task S1 in sequential program order
.. 1 True Data or Flow Dependence: Task S2 is data dependent on task S1 if an execution path exists from S1
to S2 and if at least one output variable of S1 feeds in as an input operand used by S2
S2 Represented by S1 àS2 in task dependency graphs
Program
Order
2 Anti-dependence: Task S2 is anti-dependent on S1, if S2 follows S1 in program order and if the output of
S2 overlaps the input of S1
Name Represented by S1 àS2 in dependency graphs
Dependencies
3 Output dependence: Two tasks S1, S2 are output dependent if they produce the same output variables
(or output overlaps)
Represented by S1 à S2 in task dependency graphs
Dependencies
c
Dependency Graph Example
Here assume each instruction is treated as a task:
Anti-dependence:
(S2, S3)
i.e. S2 ¾® S3 S3
R1 ¬ R3
Dependency graph
Dependency Graph Example
Here assume each instruction is treated as a task:
MIPS Code
Task Dependency graph
1 ADD.D F2, F1, F0
Æ
Order 2 ADD.D F4, F2, F3
1 3 ADD.D F2, F2, F4
ADD.D F2, F1, F0 4 ADD.D F4, F2, F6
Output Dependence:
(1, 3) (2, 4)
i.e. 1 ¾® 3 2 ¾® 4
4 Anti-dependence:
ADD.D F4, F2, F6 (2, 3) (3, 4)
i.e. 2 ¾® 3 3 ¾® 4
Conditions of Parallelism
• Control Dependence: Algorithm & Program Related
P5 +2 P3 M
P2 P4 L
+1 +3
¸ A
+3 P4
E +3 P4 G C A F
C
• Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.
• Maximize parallel speedup by:
• Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads.
• Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.
• Performance Scalability:
Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of
processors) are increased.
Parallel Performance Metrics
Efficiency, Utilization, Redundancy, Quality of Parallelism
System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be
the parallel execution time in unit time steps:
• In general T(n) << O(n) (more than one operation is performed by more than one processor in unit time).
... p processors
tp (1-f)ts / p
Another Way: Speed UP
ts 1 1
S ( p) = = lim S ( p) =
(1 - f )t s (1 - f ) p ®¥ f
fts + f +
p p
ts 1 1
S ( p) = = lim S ( p) =
(1 - f )t s (1 - f ) p ®¥ f
fts + f +
p p
Speedup
f = 0.01
f = 0.1
f= 0.85
Processors
Super-linear Speedup (S(n) > n )
Start Time
ts
ts/p
Sub-space Dt
search
Solution found
xts/p
x indeterminate
Super-linear Speedup (S(n) > n )
(b) Searching each sub-space in parallel
ts
x ´
+ Dt
p
S(p) =
Dt
Dt
Solution found
Super-linear Speedup (S(n) > n )
Worst case for sequential search when solution found in last sub-space search. Then
parallel version offers greatest benefit, i.e.
p-1
p ´ ts + Dt
as Dt tends to go to zero
S(p) = ® ¥
Dt
Least advantage for parallel version when solution found in first sub-space search of the
sequential search, i.e. Dt
S(p) = =1
Dt
Actual speed-up depends upon which subspace holds solution but could be extremely large.
Examples
• 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum
speedup we should expect from a parallel version of the program executing on 8 CPUs?
1
y£ @ 5.9
0.05 + (1 - 0.05) / 8
• An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You
can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of
the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on
8 processors?
• Machine size increases, the work load (or problem size) is also increased
• Let, Ts to perform sequential operations, Tp(n,W) to perform parallel operation of problem size or
workload W using n processors;
• Then the speedup with n processors is:
• Assuming that parallel operations achieve a linear speedup
• (i.e. these operations take 1/n of the time to perform on one processor)
• Then, Tp(1,W) = n. Tp(n,W)
• Let α be the fraction of sequential work load i.e.
• Then the speedup can be expressed as :
Speedup (Gustafson’s Law)
Cost Optimal