HPC Final PPTs
HPC Final PPTs
A bitonic merging network for n = 16. The input wires are numbered 0,1,…,
n - 1, and the binary representation of these numbers is shown. Each
column of comparators is drawn separately; the entire figure represents
a BM[16] bitonic merging network. The network takes a bitonic
sequence and outputs it in sorted order.
Sorting Networks: Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge?
(a) Communication patterns used in the 2-D block mapping. When computing di(,kj),
information must be sent to the highlighted process from two other processes along
the same row and column. (b) The row and column of √p processes that contain the
kth row and column send them along process columns and rows.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
Floyd's parallel formulation using the 2-D block mapping. P*,j denotes
all the processes in the jth column, and Pi,* denotes all the processes
in the ith row. The matrix D(0) is the adjacency matrix.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
• During each iteration of the algorithm, the kth row and kth
column of processors perform a one-to-all broadcast
along their rows/columns.
• The size of this broadcast is n/√p elements, taking time
Θ((n log p)/ √p).
• The synchronization step takes time Θ(log p).
• The computation time is Θ(n2/p).
• The parallel run time of the 2-D block mapping
formulation of Floyd's algorithm is
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
• The above formulation can use O(n2 / log2 n) processors
cost-optimally.
• The isoefficiency of this formulation is Θ(p1.5 log3 p).
• This algorithm can be further improved by relaxing the
strict synchronization after each iteration.
Floyd's Algorithm: Speeding Things Up
by Pipelining
• The synchronization step in parallel Floyd's algorithm
can be removed without affecting the correctness of the
algorithm.
• A process starts working on the kth iteration as soon as it
has computed the (k-1)th iteration and has the relevant
parts of the D(k-1) matrix.
Floyd's Algorithm: Speeding Things Up
by Pipelining
PE HPCC
2021-22
BTech IT PE HPC 1
Topic Overview
BTech IT PE HPC 2
Analytical Modeling -
Basics
A sequential algorithm is evaluated by its runtime (in general,
asymptotic runtime as a function of input size).
BTech IT PE HPC 3
Analytical Modeling -
Basics
Wall clock time the time from the start of the first processor to the
stopping time of the last processor in a parallel ensemble. But how
does this scale when the number of processors is changed of the
program is ported to another machine altogether?
How much faster is the parallel version? This begs the obvious
followup question whats the baseline serial version with which we
Compare? Can we use a suboptimal serial program to make Our
parallel program look
BTech IT PE HPC 4
Sources of Overhead in Parallel Programs
If
I use two processors, shouldnt my program run twice as fast?
No a number of overheads, includin9 wasted Computation,
Communication, idling, and contention cause degradation in
performance.
Execution Time
BTech IT PE HPC 5
Sources of Overheads in Parallel Programs
BTech IT PE HPC 6
Performance Metrics for Parallel Systems: Execution
Time
Serial runtime of a program is the time elapsed between the
beginning and the end of its execution on a sequential computer.
The parallel runtime is the time that elapses from the moment the
first processor starts to the moment the last processor finishes
execution.
BTech IT PE HPC 1
Performance Metrics for Parallel Systems: Total Parallel
Overhead
Let Ta be the total time collectively spent by all the processing
elements.
Observe that Tal - Ts is then the total time spend by all processors
Combined in non-useful work. This is called the total overhead.
T= p Tp- Ts (1)
BTech IT PE HPC
Performance Metrics for Parallel Systems: Speedup
BTech IT PE HPC 9
Performance Metrics: Example
BTech IT PE HPC 10
Performance Metrics: Example
6 10 11 12 13 4 5
14
-** ***** -*****
8 12
**
************ nneemeenmneram************
(e) Accumulation of the sum at processing element 0 after the final communicatior
BTech IT PE HPC 12
Performance Metrics: Speedup
BTech IT PE HPC 13
Performance Metrics: Speedup Example
BTech IT PE HPC 14
Performance Metrics: Speedup Bounds
BTech IT PE HPC 15
Performance Metrics: Superlinear Speedups
BTech IT PE HPC 16
Performance Metrics: Superlinear Speedups
BTech IT PE HPC 17
Performance Metrics: Efficiency
Mathematically, it is given by
E S (2)
P
BTech IT PE HPC 18
Performance Metrics: Efficiency Example
Efficiency is given by
1
4 32
log n
E 2 4 64
e (_1n
log 3 16 32
4 16 64
BTech IT PE HPC 19
Parallel Time, Speedup, and Efficiency Example
-2 02
-101
21
21
(a) (b) (C)
BTech IT PE HPC 20
Parallel Time, Speedup, and Efficiency Example
(Continued)
BTech IT PE HPC 21
Parallel Time, Speedup, and Efficiency Example
(continued)
The total time for the algorithm is therefore given by:
Tp 9te+2(t, +twn)
S = 9t n2
9t+2t,
P
+ tun)
and
1
E =
2Pts+twn)
1+ 9tn2
BTech IT PE HPC 22
Cost of a Parallel System
Cost reflects the sum of the time that each processing element
spends solving the problem.
BTech IT PE HPC 23
Cost of a Parallel System: Example
Since the serial runtime of this operation is (n), the algorithm is not
cost optimal.
BTech IT PE HPC 24
Impact of Non-Cost Optimality
BTech IT PE HPC 25
Effect of Granularity on Performance
The first log p of the log n steps of the original algorithm are
simulated in (n/ p) log p steps on p processing elements.
BTech IT PE HPC 27
Building Granularity: Example (continued)
The cost is (n log p), which is asymptotically higher than the (n)
cost of adding n numbers sequentially. Therefore, the parallel
system is not cost-optimal.
BTech IT PE HPC 28
Building Granularity: Example (continued)
0 4 : 12
(a) (b)
15
(c) (d)
BTech IT PE HPC 29
Building Granularity: Example (continued)
BTech IT PE HPC 30
BTech IT Sem I
PE HPCC
2021-22
Dr D BKulkarni
BTech IT PE HPC 1
Topic Overview
BTech IT PE HPC 2
Scalability of Parallel Systems
5 ***
****-
***
S
0
Binary exchange
2-D transpose *****
3-D transpose
Ts
plp
=
1
or E (4)
1+T
The total overhead function T, is an increasing function of p
BTech IT PE HPC 4
Scaling Characteristics of Parallel Programs
Ts depends on
Initialization
Distribution time for inputs
BTech IT PE HPC 5
Scaling Characteristics of Parallel Programs: Example
Tp = 2logp (5)
S 2logp (6)
1
E 1 logp (7)
n
BTech IT PE HPC 6
Scaling Characteristics of Parallel Programs: Example
(continued)
Plotting the speedup for various input sizes gives us:
35
Linear
30
25
20 X n== 512
n = 3200
15
n =192
S
10
n =64
0 10 15 20 25 30 35 40
BTech IT PE HPC 1
Scaling Characteristics of Parallel Programs
BTech IT PE HPC
Scaling Characteristics of Parallel Programs
BTech IT PE HPC 9
Isoefficiency Metric of Scalability
E E
P W
(a) (b)
BTech IT PE HPC 10
Isoefficiency Metric of Scalability
What is the rate at which the problem size must increase with
respect to the number of processing elements to keep the efficiencyy
fixed?
This rate determines the scalability of the system. The slower this
rate, the better.
BTech IT PE HPC 11
Isoefficiency Metric of Scalability
Parallel runtime can be written as:
W+To(W, p) (8)
TP
p
The resulting expression for speedup is
W
S
Tp
Wp
(9)
W+To(W,p)
So efficiency is
E =
P
W
W+To(W,p)
1
1+To(W,P)/W
BTech IT PE HPC 12
Isoefficiency Metric of Scalability
W KT.(W, p).
(12)
BTech IT PE HPC 13
Isoefficiency Metric of Scalability
This function determines the ease with which a parallel system can
maintain a constant efficiency and hence achieve speedups
increasing in proportion to the number of processing elements
BTech IT PE HPC 14
Isoefficiency Metric: Example
The overhead function for the problem of adding n numbers on
P processing elements is approximately 2p logp.
Substituting T. by 2p log p W K2plogP.
(13)
Thus, the asymptotic isoefficiency function for this parallel system is
BTech IT PE HPC 15
Isoefficiency Metric: Example
The overhead function for the problem of adding n numbers on
P processing elements is approximately 2p log pp.
BTech IT PE HPC 16
Isoefficiency Metric: Example
Consider a more complex example where To = p°/2 +p3/4y3/4
Using only the first term of T, in Equation 12, we get
W Kp (14)
W =Kps/4w3/4
wl/4= Kp3/4
W =K3 (15)
BTech IT PE HPC 17
Cost-Optimality and the Isoefficiency Function
W+To(W.p) = e(W)
(17)
T.(W,p) O(W)
W = Q(T.(W,p) (18)
BTech IT PE HPC 18
Lower Bound on the Isoefficiency Function
BTech IT PE HPC 19
Degree of Concurrency and the lsoefficiency Function
BTech IT PE HPC 20
Degree of Concurrency and the lsoefficiency Function: Example
BTech IT PE HPC 21
Minimum Execution Time and Minimum Cost-Optimal
Execution Time
Often, we are interested in the minimum time to solution.
Tp = 0 (19)
dp
If p is the value of p as determined by this equation, TplPo) is the
minimum parallel time.
BTech IT PE HPC 22
Minimum Execution Time: Example
Tp +2logp. (20)
Tp
Tmin 2logn. (21)
(One may verify that this is indeed a min by verifying that the second
derivative is positive).
BTech IT PE HPC 23
Minimum Cost-Optimal Parallel Time
mcost-opt W
TP (22)
-1(W)
BTech IT PE HPC 24
Minimum Cost-Optimal Parallel Time: Example
Note that both 7pi and Tcost.opt for adding n numbers are
O(log n). This may not always be the case.
BTech IT PE HPC 25
Asymptotic Analysis of Parallel Programs
Problem: sortinga list ofn numbers.
The fastest serial programs for this problem run in time O(n log n).
P n log n Vn
TP 1 Vn yn log n
E log 1 log n 1
Vn
BTech IT PE HPC 27
Other Scalability Metrics
BTech IT PE HPC 28
Dense Matrix Algorithms
Matrix-Vector Multiplication
Matrix-Matrix Multiplication
Solving a System of Linear Equations
Matrix Algorithms: Introduction
X =
Afn X n xn x 1] yln x 1]
The serial algorithm requires n2 multiplications and
additions.
W = n
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
The n Xn matrix is partitioned among n processors, with
each processor storing complete row of the matrix.
The n X1 vectorx is distributed such that each process
Owns one of its elements.
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Maix Vector x Processe25
Ppa
(a) Initial partitioning of the matix (b) Distribution of the full vector among all
and the stating vector x the processes by all-to-all broadcast
Matix i Vector y
1E
X2
A12 Az2 Aa2 As2 X y2 A.X=Y
Y3
A13 A23 Ass A43
X4 y4
|Aa Az Aga As
X X2 X
P P
2?
P3 P4 P3 P4
X3
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Since each process starts with only one element of x
an all-to-all broadcast is required to distribute all the
elements to all the processes.
Process P; now computes vi] = E- (A[i,i] x i]).
The all-to-all broadcast and the computation of yli] both
take timne n). Therefore, the parallel time is (n) .
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Consider now the case when p <n and we use block 1D
partitioning.
Each process initially stores n/p complete rows of the
matrix and a portion of the vector of size n/p.
The all-to-all broadcast takes place among p processes
and involves messages of size n/p.
This is followed by n/p local dot products.
Thus, the parallel run time of this procedure is
n2
Tp = +ts logp + tun.
This is cost-optimal.
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Scalability Analysis:
PoP
**********|*********]*********T*********
T T
**" *" *" "1" ** *"
**l ** *"*"* "* ***
-
****""
******
4*|*"***
(a) Initial data distribution and communication (b) One-to-all broadcast of portions of
steps to align the vector along the diagonal the vector along process columns
Matrix A Vector
PoP
c) All-to-one reduction of patial results (d) Final distribution of the result vector
B1 C1
A11
Ag8
A B C
MM-M: Processes
PA P2
P5
Matrix-Matrix Multiplication
P B1 B12
P2
Bi1 B42
B21 Bz2 B21 B22
Ps
MM-M: Processes, 1/Ps & O/P
B,B12
C
BB12
C1
B,,B12 C B,
C
B2122 B21B22 B2B22 B21B22
A
B B3
A
B B Bp2
A
E
()A andB after initial alignment (dSubnatiz locations afner first shift
1 A3
AuA A
B
(e) Subais locarions after second shitt ( Submmatris locations after third shi
time.
2(t
Each of the single-step shiftsin the compute-and-shift phase
of the algorithr fakes time.
matriées Bt
The computation time for mutiplying mafries 6t size
is VP
ThesaraileP tinheis apßrókimately
Tp =
n +2 pts+ 2tw n
The cost-efficiecy apd isoëfficiency afhealgorithm are identical
to the first algorithm, except, this is memory optimal.
Matrix-Matrix Multiplication:
DNS Algorithm
Uses a 3-D partitioning.
Visualize the matrix multiplication algorithm as a cube.
matrices A and B come in two orthogonal faces and
result C comes out the other orthogonal face.
Each internal node in the cube represents a single add-
multiply operation (and thus the complexity).
DNS algorithm partitions this cube using a 3-D block
Scheme.
Matrix-Matrix Multiplication:
DNS Algorithm
Assume an n X n X n mesh of processors.
Move the columns of A and rows of B and perform
broadcast.
Each processor computes a single add-multiply.
This is followed by an accumulation along the C
dimension.
Since each add-multiply takes constant time and
accumulation and broadcast takes log n time, the total
runtime is log n.
This is not cost optimal. It can be made cost optimal by
using n / log n processors along the direction of
accumulation.
Matrix-Matrix Multiplication:
DNS Algorithm
k=3
A, B
o0o =2
o
|AOo
O
oo
(a) Initial distiburion of i and5 ) After moving.47 from Pye o
AFD.31 B5
A
A0.2J« B2.0 B
A0.07% B[0.0
Tp = n logp.
+ts logp +tu2/3
2
In-1 Yn-1
-A[i,
13. bla=k] b[ k] x y[k];
14 Ai, := 0;
15 endfor: /Line 9 */
16. endfor: Line 3 /
17 end GAUSIAN_EUMINATION
Inactive part
Akj]= A[kj]A[kk]1
Actve put
-- Aij]=A[Lj]-A[Lk]x A[kj]
oi--
2
Parallel Gaussian Elimination
1.7 1) (14 (09 (103
(a) Computation:
B
) Akj]A[kj]A[kk] for kj<n
a) Akk]=]
P
) Communication:
One-to-ail brodcast of ron A[k.1
(C) Conputation:
AL]=A[Lj]-A[ik]« AAJ]
or &i<I and k<jo
() A[ik] =0 for k<i<a
4 4,) 42) (AS) (4 40) (41) 42) (43) (4 4)41) (43) (43) 404) ) A (49 (44
,
14,3) (4,2) 43) (4
40)
(e) Irerationk=1 starts ( Iterationk =0 ends
2) (13) (1.
0 1 (
P
V
0 (4,3) (4,4) 4,5) (4,6) 4,7| 0 0 (2,3) (2,4) (2.5) (2,6) (2,7
P2
0 0 (6.3) (5,4) (6.5) (6.6) 6.7 0 0 (6.3) (6,4) (6.5) (6,) (6.7
0 (6,3) (6,4) (6,5) (6,6) 6,7) 0 (3.3) 3.4) (3,5) (3.6) (3.7
P 0 0 0 7,3) (7,4) (7,5) (7,6) (7,. 0 0 (7.3) (7,4) (7.5) (7,6) (7,.7
P
K4.44(447 43(444.44
available.
The computation and communication for each iteration moves
through the mesh from top-left to bottom-right as a "front."
After the front corresponding to a certain iteration passes through a
process, the process is free to perform subsequent iterations.
Multiple fronts that correspond to different iterations are active
simultaneously.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining
If each step (division, elimination, or communication) is
assumed to take constant time, the front moves a single
step in this time. The front takes O(n) time to reach n-1,n-
1
2324
o0,2ass
4.2344
n IteraOn k=2 starts P) Iterationk=0 ends
0 0
0 6.3)64) (6.5(6.6)(6.7) 00 0 6.36,4 (6.5)(6.6% 6.7
3.615.33(37
143 (A,4
4A (47
D (5.3 4540 530(5
7(7:
8. endfor;
end BACK-SUBSTITUTION
BTech IT PE HPC 1
Topic Overview
BTech IT PE HPC 2
Analytical Modeling - Basics
BTech IT PE HPC 3
Analytical Modeling - Basics
• Wall clock time - the time from the start of the first processor to the
stopping time of the last processor in a parallel ensemble. But how
does this scale when the number of processors is changed of the
program is ported to another machine altogether?
• How much faster is the parallel version? This begs the obvious
followup question - whats the baseline serial version with which we
compare? Can we use a suboptimal serial program to make our
parallel program look
• Raw FLOP count - What good are FLOP counts when they dont
solve a problem?
BTech IT PE HPC 4
Sources of Overhead in Parallel Programs
BTech IT PE HPC 5
Sources of Overheads in Parallel Programs
BTech IT PE HPC 6
Performance Metrics for Parallel Systems: Execution
Time
• Serial runtime of a program is the time elapsed between the
beginning and the end of its execution on a sequential computer.
• The parallel runtime is the time that elapses from the moment the
first processor starts to the moment the last processor finishes
execution.
BTech IT PE HPC 7
Performance Metrics for Parallel Systems: Total Parallel
Overhead
• Let Tall be the total time collectively spent by all the processing
elements.
• Observe that Tall - TS is then the total time spend by all processors
combined in non-useful work. This is called the total overhead.
To = p TP - TS (1)
BTech IT PE HPC 8
Performance Metrics for Parallel Systems: Speedup
BTech IT PE HPC 9
Performance Metrics: Example
BTech IT PE HPC 10
Performance Metrics: Example
BTech IT PE HPC 12
Performance Metrics: Speedup
BTech IT PE HPC 13
Performance Metrics: Speedup Example
BTech IT PE HPC 14
Performance Metrics: Speedup Bounds
BTech IT PE HPC 15
Performance Metrics: Superlinear Speedups
BTech IT PE HPC 16
Performance Metrics: Superlinear Speedups
If DRAM access time is 100 ns, cache access time is 2 ns, and
remote memory access time is 400ns, this corresponds to a
speedup of 2.43!
BTech IT PE HPC 17
Performance Metrics: Efficiency
• Mathematically, it is given by
= (2)
BTech IT PE HPC 18
Performance Metrics: Efficiency Example
• Efficiency is given by
BTech IT PE HPC 19
Parallel Time, Speedup, and Efficiency Example
BTech IT PE HPC 20
Parallel Time, Speedup, and Efficiency Example
(continued)
BTech IT PE HPC 21
Parallel Time, Speedup, and Efficiency Example
(continued)
• The total time for the algorithm is therefore given by:
and
BTech IT PE HPC 22
Cost of a Parallel System
• Cost reflects the sum of the time that each processing element
spends solving the problem.
BTech IT PE HPC 23
Cost of a Parallel System: Example
• Since the serial runtime of this operation is Θ(n), the algorithm is not
cost optimal.
BTech IT PE HPC 24
Impact of Non-Cost Optimality
• The first log p of the log n steps of the original algorithm are
simulated in (n / p) log p steps on p processing elements.
BTech IT PE HPC 27
Building Granularity: Example (continued)
• The cost is Θ (n log p), which is asymptotically higher than the Θ (n)
cost of adding n numbers sequentially. Therefore, the parallel
system is not cost-optimal.
BTech IT PE HPC 28
Building Granularity: Example (continued)
BTech IT PE HPC 29
Building Granularity: Example (continued)
(3)
• The cost is
BTech IT PE HPC 30
Scalability of Parallel Systems
BTech IT PE HPC 31
Scaling Characteristics of Parallel Programs
or (4)
BTech IT PE HPC 32
Scaling Characteristics of Parallel Programs
• The overall efficiency of the parallel program goes down. This is the
case for all parallel programs.
BTech IT PE HPC 33
Scaling Characteristics of Parallel Programs: Example
= (5)
= (6)
= (7)
BTech IT PE HPC 34
Scaling Characteristics of Parallel Programs: Example
(continued)
Plotting the speedup for various input sizes gives us:
BTech IT PE HPC 35
Scaling Characteristics of Parallel Programs
BTech IT PE HPC 36
Scaling Characteristics of Parallel Programs
BTech IT PE HPC 37
Isoefficiency Metric of Scalability
BTech IT PE HPC 38
Isoefficiency Metric of Scalability
BTech IT PE HPC 39
Isoefficiency Metric of Scalability
• What is the rate at which the problem size must increase with
respect to the number of processing elements to keep the efficiency
fixed?
• This rate determines the scalability of the system. The slower this
rate, the better.
BTech IT PE HPC 40
Isoefficiency Metric of Scalability
• We can write parallel runtime as:
(8)
(9)
BTech IT PE HPC 41
Isoefficiency Metric of Scalability
(11)
(12)
BTech IT PE HPC 42
Isoefficiency Metric of Scalability
• This function determines the ease with which a parallel system can
maintain a constant efficiency and hence achieve speedups
increasing in proportion to the number of processing elements
BTech IT PE HPC 43
Isoefficiency Metric: Example
= (13)
BTech IT PE HPC 44
Isoefficiency Metric: Example
Consider a more complex example where
• Using only the first term of To in Equation 12, we get
= (14)
(15)
BTech IT PE HPC 45
Cost-Optimality and the Isoefficiency Function
(16)
(17)
(18)
BTech IT PE HPC 46
Lower Bound on the Isoefficiency Function
BTech IT PE HPC 47
Degree of Concurrency and the Isoefficiency Function
BTech IT PE HPC 48
Degree of Concurrency and the Isoefficiency Function: Example
BTech IT PE HPC 49
Minimum Execution Time and Minimum Cost-Optimal
Execution Time
Often, we are interested in the minimum time to solution.
=0 (19)
BTech IT PE HPC 50
Minimum Execution Time: Example
= (20)
= (21)
(One may verify that this is indeed a min by verifying that the second
derivative is positive).
BTech IT PE HPC 51
Minimum Cost-Optimal Parallel Time
•
= (22)
BTech IT PE HPC 52
Minimum Cost-Optimal Parallel Time: Example
• The isoefficiency function f(p) of this parallel system is Θ(p log p).
= (23)
• Note that both TPmin and TPcost_opt for adding n numbers are
Θ(log n). This may not always be the case.
BTech IT PE HPC 53
Asymptotic Analysis of Parallel Programs
BTech IT PE HPC 54
Asymptotic Analysis of Parallel Programs
BTech IT PE HPC 55
Other Scalability Metrics
BTech IT PE HPC 56
Other Scalability Metrics: Scaled Speedup
BTech IT PE HPC 57
Scaled Speedup: Example
(24)
BTech IT PE HPC 58
Scaled Speedup: Example (continued)
or
BTech IT PE HPC 59
Scaled Speedup: Example (continued)
• We have TP = O(n2) .
• This is not surprising, since the memory and time complexity of the
operation are identical.
BTech IT PE HPC 60
Scaled Speedup: Example
(25)
BTech IT PE HPC 61
Scaled Speedup: Example (continued)
BTech IT PE HPC 62
Scaled Speedup: Example (continued)
BTech IT PE HPC 63
Serial Fraction f
(26)
BTech IT PE HPC 64
Serial Fraction f
• Therefore, we have:
BTech IT PE HPC 65
Serial Fraction
• Since S = W / TP , we have
(27)
BTech IT PE HPC 66
Serial Fraction: Example
We have:
(28)
or
Here, the denominator is the serial runtime and the numerator is the
overhead.
BTech IT PE HPC 67
Physical Organization
of Parallel Platforms
An ideal parallel machine called Parallel Random Access
Machine, or PRAM.
Architecture of an
Ideal Parallel Computer
A natural extension of the Random Access Machine
(RAM) serial architecture is the Parallel Random Access
Machine, or PRAM.
PRAMs consist of p processors and a global memory of
unbounded size that is uniformly accessible to all
processors.
Processors a
share common clock but may execute
different instructions in each cycle.
Physical Complexity of an
Ideal Parallel Computer
Processors and memories are connected via switches.
Since these switches must operate in O(1) time at the
level of words, for a system of p prOcessors and m
words, the switch complexity is O(mp).
Clearly, for meaningful values of p and m, a true PRAM
IS not realizable.
Network Topologies: Crossbars
0<ip/2-1
2i+1-p, p/2išp-1
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes
Campletely-connected p4 -1 plp-1)/2
Star p-1
Camplete binary tree 2 loE((p 1)/2) P
Linear aray p 1 p1
2-D mesh, no wraparound 2/-1) 2(p-VP
2-D wraparound mesh
21 P/2 2P
Hypercube logp p/2 Jogp (p losp)/2
1
Wraparound k-ary d-cube d k/2 2k 24 dp
Network Topologies: Linear Arrays
Processing nodes
Switching modes
(a) (b)
Complete binary tree networks: (a) a static tree network; and (b)
a dynamic tree network.
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes
dinesh.kulkarni
= O
Network Topologies: Buses
Cac
Poesacel
ILLIL A switchin
ciement
dinesh.kulkarni
O
Network Topologies: Buses
Some of the simplest and earliest parallel machines
used buses.
All processors access a common bus for exchanging
data.
The distance between any tWo nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
However, the bandwidth of the shared bus is a major
bottleneck.
Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures.
Network Topologies
A variety of network topologies have been proposed and
implemented.
These topologies tradeoff performance for cost.
Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and
available components.
92
Meeting
20:43 14attendees
Interconnection Networks:
Network lInterfaces
a
Processors talk to the network via network interface.
The network interface may hang off the /O bus or the
memory bus.
In a physical sense, this distinguishes a cluster from a
tightly coupled multicomputer.
The relative speeds of the l/O and memory buses impac
the performance of the network.
dinesh.kulkarni
=
Static and Dynamic
Interconnection Networks
S3tiC newaek Tisdireckootnork
Interconnection Networks
Switches a
map fixed number of inputs to outputs.
The total number of ports on a switch is the degree of
the switch.
The cost of a switch grows as the square of the degree
of the switch, the peripheral hardware linearly as the
degree, and the packaging costs linearly as the number
of pins.
dinesh.kulkarni
O
Interconnection Networks
for Parallel Computers
Interconnection networks carry data between processors
and to memory.
Interconnects are made of switches and links (wires.
fiber).
Interconnecis are classified as static or dynamic.
Static networks consist of point-to-point communication
links among processing nodes and are also referred to
as direct networks.
Dynamic networks are built using switches and
communication links. Dynamic networks are also
referred to as indirect networks.
Architecture of an
Ideal Parallel Computer
Depending on how simultaneous memory accesses are
handled, PRAMs can be divided into four subclasses.
Exclusive-read, exclusive-write (EREW) PRAM.
Concurrent-read, exclusive-write (CREW) PRAM.
Exclusive-read, concurrent-write (ERCW) PRAM.
Concurrent-read, concurrent-write (CRCW) PRAM.
Architecture of an
Ideal Parallel Computer
What does concurrent write mean, anyway?
Common: write only if all values are identical.
Arbitrary: write the data from a randomly selected processor.
Priority: follow a predetermined priority order.
Sum: Write the sum of all data items.
Basic Communication Operations
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
• Circular Shift
7 6 5 4
0 1 2 3
2
3 3
7 6 5 4
0 1 2 3
2
1 1
P0 P0 P1 P2 P3
P4 P4 P5 P6 P7
Output
Vector
4 4 4 4
2 6 10 14
3 3 3 3
1 5 9 13
4 4 4 4
2 2
0 4 8 12
6 7
(011)
(010) 2 3
2
3
3
2 1 4 5
(100) (101)
(000) 0 1
(001)
3
• Assume that source processor is the root of this tree. In the first
step, the source sends the data to the right child (assuming
the source is also the left child). The problem has now
been decomposed into two problems with half the number of
processors.
Broadcast and Reduction on a Balanced Binary Tree
3 3 3 3
0 1 2 3 4 5 6 7
• Each node first sends to one of its neighbors the data it needs
to broadcast.
7 6 5 4
(7) (6) (5) (4)
1 (7) 1 (3)
0 1 2 3
7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)
0 1 2 3
. .
. .
. .
7 (0) 7 (7) 7 (6)
7 6 5 4
(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7 (1) 7 (5)
0 1 2 3
6 7 8 6 7 8
0 1 2 0 1 2
(a) Initial data distribution (b) Data distribution after rowwise broadcast
6 7 6 7
(4) (5)
4 5 4 5
(4,5) (4,5)
(a) Initial distribution of messages (b) Distribution before the second step
(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7
(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5
(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)
(c) Distribution before the third step (d) Final distribution of messages
• On a hypercube, we have:
Xp
log
T = (ts + 2i−1tw m)
i=1
= ts log p + tw m(p − 1). (2)
All-to-all broadcast: Notes
0 1 2 3
6 7 6 7
[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 1 (1) [1] (0+1) 0 1 (0+1) [0+1]
(a) Initial distribution of values (b) Distribution of sums before second step
6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ [0+1+2]
2 3 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
(0+1+ 0 1 (0+1+ 0 1
[0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums
• We must account for the fact that in prefix sums the node with
label k uses information from only the k-node subset whose
labels are less than or equal to k.
2 3 2 3
4 5 4 5
(4,5,
(0,1,2,3, (0,1, 6,7)
4,5,6,7) 2,3)
0 1 0 1
(a) Initial distribution of messages (b) Distribution before the second step
6 7 6 7
(4) (5)
4 5 4 5
(4,5)
(c) Distribution before the third step (d) Final distribution of messages
• There are log p steps, in each step, the machine size halves and
the data size halves.
P1
n
P2
P3
• Each node extracts the information meant for it from the data
received, and forwards the remaining (p − 2) pieces of size m
each to the next node.
X
p−1
T = (ts + tw m(p − i))
i=1
X
p−1
= ts(p − 1) + itw m
i=1
= (ts + tw mp/2)(p − 1). (4)
({5,0},{5,3},{5,6},
3 4 5 {5,1},{5,4},{4,7},
{5,2},{5,5},{5,8})
({3,0},{3,3},{3,6}, ({4,0},{4,3},{4,6},
{3,1},{3,4},{3,7}, {4,1},{4,4},{4,7},
{3,2},{3,5},{3,8}) {4,2},{4,5},{4,8}) ({6,0},{6,3},{6,6}, ({6,1},{6,4},{6,7}, ({6,2},{6,5},{6,8},
{7,0},{7,3},{7,6}, {7,1},{7,4},{7,7}, {7,2},{7,5},{7,8},
{8,0},{8,3},{8,6}) {8,1},{8,4},{8,7}) {8,2},{8,5},{8,8})
0 1 2
6 7 8
({0,0},{0,3},{0,6}, ({1,0},{1,3},{1,6}, ({2,0},{2,3},{2,6},
{0,1},{0,4},{0,7}, {1,1},{1,4},{1,7}, {2,1},{2,4},{2,7}, ({3,1},{3,4}, ({3,2},{3,5},
{0,2},{0,5},{0,8}) {1,2},{1,5},{1,8}) {2,2},{2,5},{2,8}) {3,7},{4,1}, {3,8},{4,2},
{4,4},{4,7}, {4,5},{4,8},
{5,1},{5,,4}, {5,2},{5,5},
(a) Data distribution at the ({3,0},{3,3},{3,6}, {5,7}) {5,8})
beginning of first phase 3 4 5
{4,0},{4,3},{4,6},
{5,0},{5,3},{5,6})
({0,1},{0,4}, ({0,2},{0,5},
{0,7},{1,1}, {0,8},{1,2},
{1,4},{1,7}, {1,5},{1,8},
({0,0},{0,3},{0,6}, {2,1},{2,4}, {2,2},{2,5},
{1,0},{1,3},{1,6}, {2,7}) {2,8})
0 1 2
{2,0},{2,3},{2,6})
√
• Time for the first phase is identical to that in a ring with p
√
processors, i.e., (ts + tw mp/2)( p − 1).
6 7 6 7
4 5 4 5 {4,5},{4,7},
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})
0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(a) Initial distribution of messages (b) Distribution before the second step
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7 6 7
0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
(c) Distribution before the third step (d) Final distribution of messages
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 0 1 3 7
1 0 2 6
2 3 2 3 1 5
3 2 0 4
4 5 7 3
4 5
5 4 6 2
6 7 5 1
0 1 7 6 4 0
(g)
Seven steps in all-to-all personalized communication on an
eight-node hypercube.
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
(a) Initial data distribution and the (b) Step to compensate for backward row shifts
first communication step
(c) Column shifts in the third communication step (d) Final distribution of the data
T = ts + tw m. (9)
Circular Shift on a Hypercube
(4) (5) (3) (2)
4 5 4 5
(3) (0)
3 2 (2) 3 2 (1)
(7) (4)
7 6 (6) 7 6 (5)
(0) (7)
0 1 (1) 0 1 (6)
First communication step of the 4-shift Second communication step of the 4-shift
4 5 4 5
(7) (6)
3 2 (6) 3 2 (5)
(3) (2)
7 6 (2) 7 6 (1)
(4) (3)
0 1 (5) 0 1 (4)
(b) The second phase (a 1-shift) (c) Final data distribution after the 5-shift
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) 1-shift (b) 2-shift (c) 3-shift
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) 4-shift (e) 5-shift (f) 6-shift
6 7
2 3
4 5
0 1
(g) 7-shift
• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation
followed by an all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw (p − 1) )
p
≈ 2 × (ts log p + tw m). (10)