0% found this document useful (0 votes)
98 views369 pages

HPC Final PPTs

This document discusses sorting algorithms that can be used on parallel computers. It covers sorting networks like bitonic sort, which uses a network of comparators to sort elements. Bitonic sort maps well to hypercubes by assigning each wire in the network to a different processor. It can also be mapped to meshes, but with some overhead due to the lower connectivity of meshes. The document also discusses assigning a block of elements to each processor and using compare-split operations instead of compare-exchanges.

Uploaded by

User Not
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views369 pages

HPC Final PPTs

This document discusses sorting algorithms that can be used on parallel computers. It covers sorting networks like bitonic sort, which uses a network of comparators to sort elements. Bitonic sort maps well to hypercubes by assigning each wire in the network to a different processor. It can also be mapped to meshes, but with some overhead due to the lower connectivity of meshes. The document also discusses assigning a block of elements to each processor and using compare-split operations instead of compare-exchanges.

Uploaded by

User Not
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 369

Sorting Algorithms

Ref: ``Introduction to Parallel Computing'',


Addison Wesley, 2003.
By Ananth Grama, Anshul Gupta, George Karypis, and Vipin
Kumar
Topic Overview

• Issues in Sorting on Parallel Computers


• Sorting Networks
• Bubble Sort and its Variants
• Quicksort
• Bucket and Sample Sort
• Other Sorting Algorithms
Sorting: Overview

• One of the most commonly used and well-studied kernels.


• Sorting can be comparison-based or noncomparison-based.
• The fundamental operation of comparison-based sorting is
compare-exchange.
• The lower bound on any comparison-based sort of n
numbers is Θ(nlog n) .
• We focus here on comparison-based sorting algorithms.
Sorting: Basics

What is a parallel sorted sequence? Where are the input and


output lists stored?

• We assume that the input and output lists are distributed.


• The sorted list is partitioned with the property that each
partitioned list is sorted and each element in processor Pi's
list is less than that in Pj's list if i < j.
Sorting: Parallel Compare Exchange Operation

A parallel compare-exchange operation. Processes Pi and Pj


send their elements to each other. Process Pi keeps
min{ai,aj}, and Pj keeps max{ai, aj}.
Sorting: Basics
What is the parallel counterpart to a sequential comparator?

• If each processor has one element, the compare exchange


operation stores the smaller element at the processor with
smaller id. This can be done in ts + tw time.
• If we have more than one element per processor, we call
this operation a compare split. Assume each of two
processors have n/p elements.
• After the compare-split operation, the smaller n/p elements
are at processor Pi and the larger n/p elements at Pj, where
i < j.
• The time for a compare-split operation is (ts+ twn/p),
assuming that the two partial lists were initially sorted.
Sorting: Parallel Compare Split Operation

A compare-split operation. Each process sends its block of size


n/p to the other process. Each process merges the received
block with its own block and retains only the appropriate half of
the merged block. In this example, process Pi retains the
smaller elements and process Pi retains the larger elements.
Sorting Networks

• Networks of comparators designed specifically for sorting.


• A comparator is a device with two inputs x and y and two
outputs x' and y'. For an increasing comparator, x' = min{x,y}
and y' = min{x,y}; and vice-versa.
• We denote an increasing comparator by  and a decreasing
comparator by Ө.
• The speed of the network is proportional to its depth.
Sorting Networks: Comparators

A schematic representation of comparators: (a) an increasing


comparator, and (b) a decreasing comparator.
Sorting Networks

A typical sorting network. Every sorting network is made up of


a series of columns, and each column contains a number of
comparators connected in parallel.
Sorting Networks: Bitonic Sort

• A bitonic sorting network sorts n elements in Θ(log2n) time.


• A bitonic sequence has two tones - increasing and
decreasing, or vice versa. Any cyclic rotation of such
networks is also considered bitonic.
 1,2,4,7,6,0 is a bitonic sequence, because it first increases
and then decreases. 8,9,2,1,0,4 is another bitonic
sequence, because it is a cyclic shift of 0,4,8,9,2,1.
• The kernel of the network is the rearrangement of a bitonic
sequence into a sorted sequence.
Sorting Networks: Bitonic Sort
• Let s = a0,a1,…,an-1 be a bitonic sequence such that a0 ≤
a1 ≤ ··· ≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1.

• Consider the following subsequences of s:


s1 = min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1}
s2 = max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1}
(1)

• Note that s1 and s2 are both bitonic and each element of s1


is less than every element in s2.
• We can apply the procedure recursively on s1 and s2 to get
the sorted sequence.
Sorting Networks: Bitonic Sort

Merging a 16-element bitonic sequence through a series of log 16


bitonic splits.
Sorting Networks: Bitonic Sort
• We can easily build a sorting network to implement this
bitonic merge algorithm.
• Such a network is called a bitonic merging network.
• The network contains log n columns. Each column
contains n/2 comparators and performs one step of the
bitonic merge.
• We denote a bitonic merging network with n inputs by
BM[n].
• Replacing the  comparators by Ө comparators results
in a decreasing output sequence; such a network is
denoted by ӨBM[n].
Sorting Networks: Bitonic Sort

A bitonic merging network for n = 16. The input wires are numbered 0,1,…,
n - 1, and the binary representation of these numbers is shown. Each
column of comparators is drawn separately; the entire figure represents
a BM[16] bitonic merging network. The network takes a bitonic
sequence and outputs it in sorted order.
Sorting Networks: Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge?

• We must first build a single bitonic sequence from the given


sequence.
• A sequence of length 2 is a bitonic sequence.
• A bitonic sequence of length 4 can be built by sorting the
first two elements using BM[2] and next two, using ӨBM[2].
• This process can be repeated to generate larger bitonic
sequences.
Sorting Networks: Bitonic Sort

A schematic representation of a network that converts an input


sequence into a bitonic sequence. In this example, BM[k]
and ӨBM[k] denote bitonic merging networks of input size k
that use  and Ө comparators, respectively. The last merging
network (BM[16]) sorts the input. In this example, n = 16.
Sorting Networks: Bitonic Sort

The comparator network that transforms an input sequence


of 16 unordered numbers into a bitonic sequence.
Sorting Networks: Bitonic Sort
• The depth of the network is Θ(log2 n).
• Each stage of the network contains n/2 comparators. A
serial implementation of the network would have
complexity Θ(nlog2 n).
Mapping Bitonic Sort to Hypercubes
• Consider the case of one item per processor. The question
becomes one of how the wires in the bitonic network should
be mapped to the hypercube interconnect.
• Note from our earlier examples that the compare-exchange
operation is performed between two wires only if their labels
differ in exactly one bit!
• This implies a direct mapping of wires to processors. All
communication is nearest neighbor!
Mapping Bitonic Sort to Hypercubes

Communication during the last stage of bitonic sort. Each


wire is mapped to a hypercube process; each connection
represents a compare-exchange between processes.
Mapping Bitonic Sort to Hypercubes

Communication characteristics of bitonic sort on a hypercube.


During each stage of the algorithm, processes communicate
along the dimensions shown.
Mapping Bitonic Sort to Hypercubes

Parallel formulation of bitonic sort on a hypercube with n = 2d processes.


Mapping Bitonic Sort to Hypercubes

• During each step of the algorithm, every process


performs a compare-exchange operation (single nearest
neighbor communication of one word).
• Since each step takes Θ(1) time, the parallel time is
Tp = Θ(log2n) (2)

• This algorithm is cost optimal w.r.t. its serial counterpart,


but not w.r.t. the best sorting algorithm.
Mapping Bitonic Sort to Meshes

• The connectivity of a mesh is lower than that of a


hypercube, so we must expect some overhead in this
mapping.
• Consider the row-major shuffled mapping of wires to
processors.
Mapping Bitonic Sort to Meshes

Different ways of mapping the input wires of the bitonic


sorting network to a mesh of processes: (a) row-major
mapping, (b) row-major snakelike mapping, and (c) row-
major shuffled mapping.
Mapping Bitonic Sort to Meshes

The last stage of the bitonic sort algorithm for n = 16 on a


mesh, using the row-major shuffled mapping. During
each step, process pairs compare-exchange their
elements. Arrows indicate the pairs of processes that
perform compare-exchange operations.
Mapping Bitonic Sort to Meshes
• In the row-major shuffled mapping, wires that differ at the
ith least-significant bit are mapped onto mesh processes
that are 2(i-1)/2 communication links away.
• The total amount of communication performed by each
process is ∑ ∑
⌊( j− 1)/2⌋
2 ≈ 7 √n, or Θ( √ n) . The total computation
performed by each process is Θ(log2n).
• The parallel runtime is:

• This is not cost optimal.


Block of Elements Per Processor

• Each process is assigned a block of n/p elements.


• The first step is a local sort of the local block.
• Each subsequent compare-exchange operation is
replaced by a compare-split operation.
• We can effectively view the bitonic network as having
(1 + log p)(log p)/2 steps.
Block of Elements Per Processor: Hypercube

• Initially the processes sort their n/p elements (using


merge sort) in time Θ((n/p)log(n/p)) and then perform
Θ(log2p) compare-split steps.
• The parallel run time of this formulation is

• Comparing to an optimal sort, the algorithm can


p=Θ(2√log n)
efficiently use up to processes.
• The isoefficiency function due to both communication
and extra work is Θ(plog plog2p) .
Block of Elements Per Processor: Mesh

• The parallel runtime in this case is given by:

• This formulation can efficiently use up to p = Θ(log2n)


processes.
• The isoefficiency function is
Performance of Parallel Bitonic Sort

The performance of parallel formulations of bitonic sort for


n elements on p processes.
Bubble Sort and its Variants
The sequential bubble sort algorithm compares and exchanges
adjacent elements in the sequence to be sorted:

Sequential bubble sort algorithm.


Bubble Sort and its Variants

• The complexity of bubble sort is Θ(n2).


• Bubble sort is difficult to parallelize since the algorithm
has no concurrency.
• A simple variant, though, uncovers the concurrency.
Odd-Even Transposition

Sequential odd-even transposition sort algorithm.


Odd-Even Transposition

Sorting n = 8 elements, using the odd-even transposition sort


algorithm. During each phase, n = 8 elements are compared.
Odd-Even Transposition

• After n phases of odd-even exchanges, the sequence is


sorted.
• Each phase of the algorithm (either odd or even)
requires Θ(n) comparisons.
• Serial complexity is Θ(n2).
Parallel Odd-Even Transposition

• Consider the one item per processor case.


• There are n iterations, in each iteration, each processor
does one compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial
algorithm but not the optimal one.
Parallel Odd-Even Transposition

Parallel formulation of odd-even transposition.


Parallel Odd-Even Transposition

• Consider a block of n/p elements per processor.


• The first step is a local sort.
• In each subsequent step, the compare exchange
operation is replaced by the compare split operation.
• The parallel run time of the formulation is
Parallel Odd-Even Transposition

• The parallel formulation is cost-optimal for p = O(log n).


• The isoefficiency function of this parallel formulation is
Θ(p2p).
Graph Algorithms

Ananth Grama, Anshul Gupta, George


Karypis, and Vipin Kumar

To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003


Topic Overview

• Definitions and Representation


• Minimum Spanning Tree: Prim's Algorithm
• Single-Source Shortest Paths: Dijkstra's Algorithm
• All-Pairs Shortest Paths
Definitions and Representation

• An undirected graph G is a pair (V,E), where V is a finite


set of points called vertices and E is a finite set of edges.
• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V.
• In a directed graph, the edge e is an ordered pair (u,v).
An edge (u,v) is incident from vertex u and is incident to
vertex v.
• A path from a vertex v to a vertex u is a sequence
<v0,v1,v2,…,vk> of vertices where v0 = v, vk = u, and (vi,
vi+1) ∈ E for I = 0, 1,…, k-1.
• The length of a path is defined as the number of edges
in the path.
Definitions and Representation

a) An undirected graph and (b) a directed graph.


Definitions and Representation

• An undirected graph is connected if every pair of vertices


is connected by a path.
• A forest is an acyclic graph, and a tree is a connected
acyclic graph.
• A graph that has weights associated with each edge is
called a weighted graph.
Definitions and Representation

• Graphs can be represented by their adjacency matrix or


an edge (or vertex) list.
• Adjacency matrices have a value ai,j = 1 if nodes i and j
share an edge; 0 otherwise. In case of a weighted graph,
ai,j = wi,j, the weight of the edge.
• The adjacency list representation of a graph G = (V,E)
consists of an array Adj[1..|V|] of lists. Each list Adj[v] is
a list of all vertices adjacent to v.
• For a grapn with n nodes, adjacency matrices take Θ(n2)
space and adjacency list takes Θ(|E|) space.
Definitions and Representation

An undirected graph and its adjacency matrix representation.

An undirected graph and its adjacency list representation.


Minimum Spanning Tree

• A spanning tree of an undirected graph G is a subgraph


of G that is a tree containing all the vertices of G.
• In a weighted graph, the weight of a subgraph is the sum
of the weights of the edges in the subgraph.
• A minimum spanning tree (MST) for a weighted
undirected graph is a spanning tree with minimum
weight.
Minimum Spanning Tree

An undirected graph and its minimum spanning tree.


Minimum Spanning Tree: Prim's
Algorithm
• Prim's algorithm for finding an MST is a greedy
algorithm.
• Start by selecting an arbitrary vertex, include it into the
current MST.
• Grow the current MST by inserting into it the vertex
closest to one of the vertices already in current MST.
Minimum Spanning Tree: Prim's Algorithm

Prim's minimum spanning tree algorithm.


Minimum Spanning Tree: Prim's
Algorithm

Prim's sequential minimum spanning tree algorithm.


Prim's Algorithm: Parallel Formulation

• The algorithm works in n outer iterations - it is hard to execute these


iterations concurrently.
• The inner loop is relatively easy to parallelize. Let p be the number
of processes, and let n be the number of vertices.
• The adjacency matrix is partitioned in a 1-D block fashion, with
distance vector d partitioned accordingly.
• In each step, a processor selects the locally closest node, followed
by a global reduction to select globally closest node.
• This node is inserted into MST, and the choice broadcast to all
processors.
• Each processor updates its part of the d vector locally.
Prim's Algorithm: Parallel Formulation

The partitioning of the distance array d and the adjacency matrix A


among p processes.
Prim's Algorithm: Parallel Formulation

• The cost to select the minimum entry is O(n/p + log p).


• The cost of a broadcast is O(log p).
• The cost of local updation of the d vector is O(n/p).
• The parallel time per iteration is O(n/p + log p).
• The total parallel time is given by O(n2/p + n log p).
• The corresponding isoefficiency is O(p2log2p).
Single-Source Shortest Paths

• For a weighted graph G = (V,E,w), the single-source


shortest paths problem is to find the shortest paths from
a vertex v ∈ V to all other vertices in V.
• Dijkstra's algorithm is similar to Prim's algorithm. It
maintains a set of nodes for which the shortest paths are
known.
• It grows this set based on the node closest to source
using one of the nodes in the current shortest path set.
Single-Source Shortest Paths: Dijkstra's
Algorithm

Dijkstra's sequential single-source shortest paths algorithm.


Dijkstra's Algorithm: Parallel Formulation

• Very similar to the parallel formulation of Prim's algorithm


for minimum spanning trees.
• The weighted adjacency matrix is partitioned using the 1-
D block mapping.
• Each process selects, locally, the node closest to the
source, followed by a global reduction to select next
node.
• The node is broadcast to all processors and the l-vector
updated.
• The parallel performance of Dijkstra's algorithm is
identical to that of Prim's algorithm.
All-Pairs Shortest Paths

• Given a weighted graph G(V,E,w), the all-pairs shortest


paths problem is to find the shortest paths between all
pairs of vertices vi, vj ∈ V.
• A number of algorithms are known for solving this
problem.
All-Pairs Shortest Paths: Matrix-
Multiplication Based Algorithm
• Consider the multiplication of the weighted adjacency
matrix with itself - except, in this case, we replace the
multiplication operation in matrix multiplication by
addition, and the addition operation by minimization.
• Notice that the product of weighted adjacency matrix
with itself returns a matrix that contains shortest paths of
length 2 between any pair of nodes.
• It follows from this argument that An contains all shortest
paths.
Matrix-Multiplication Based Algorithm
Matrix-Multiplication Based Algorithm

• An is computed by doubling powers - i.e., as A, A2, A4,


A8, and so on.
• We need log n matrix multiplications, each taking time
O(n3).
• The serial complexity of this procedure is O(n3log n).
• This algorithm is not optimal, since the best known
algorithms have complexity O(n3).
Matrix-Multiplication Based Algorithm:
Parallel Formulation
• Each of the log n matrix multiplications can be performed
in parallel.
• We can use n3/log n processors to compute each matrix-
matrix product in time log n.
• The entire process takes O(log2n) time.
Dijkstra's Algorithm

• Execute n instances of the single-source shortest path


problem, one for each of the n source vertices.
• Complexity is O(n3).
Dijkstra's Algorithm: Parallel Formulation

• Two parallelization strategies - execute each of the n


shortest path problems on a different processor (source
partitioned), or use a parallel formulation of the shortest
path problem to increase concurrency (source parallel).
Dijkstra's Algorithm: Source Partitioned
Formulation
• Use n processors, each processor Pi finds the shortest
paths from vertex vi to all other vertices by executing
Dijkstra's sequential single-source shortest paths
algorithm.
• It requires no interprocess communication (provided that
the adjacency matrix is replicated at all processes).
• The parallel run time of this formulation is: Θ(n2).
• While the algorithm is cost optimal, it can only use n
processors. Therefore, the isoefficiency due to
concurrency is p3.
Dijkstra's Algorithm: Source Parallel
Formulation
• In this case, each of the shortest path problems is further
executed in parallel. We can therefore use up to n2
processors.
• Given p processors (p > n), each single source shortest
path problem is executed by p/n processors.
• Using previous results, this takes time:

• For cost optimality, we have p = O(n2/log n) and the


isoefficiency is Θ((p log p)1.5).
Floyd's Algorithm

• For any pair of vertices vi, vj ∈ V, consider all paths from


vi to vj whose intermediate vertices belong to the set
{v1,v2,…,vk}. Let pi(,kj) (of weight di(,kj) be the minimum-
weight path among them.
• If vertex vk is not in the shortest path from vi to vj, then
pi(,kj) is the same as pi(,kj-1).
• If f vk is in pi(,kj), then we can break pi(,kj) into two paths -
one from vi to vk and one from vk to vj . Each of these
paths uses vertices from {v1,v2,…,vk-1}.
Floyd's Algorithm

From our observations, the following recurrence relation


follows:

This equation must be computed for each pair of nodes


and for k = 1, n. The serial complexity is O(n3).
Floyd's Algorithm

Floyd's all-pairs shortest paths algorithm. This program


computes the all-pairs shortest paths of the graph G =
(V,E) with adjacency matrix A.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
• Matrix D(k) is divided into p blocks of size (n / √p) x (n /
√p).
• Each processor updates its part of the matrix during
each iteration.
• To compute dl(,kk-1) processor Pi,j must get dl(,kk-1) and
dk(,kr-1).
• In general, during the kth iteration, each of the √p
processes containing part of the kth row send it to the √p
- 1 processes in the same column.
• Similarly, each of the √p processes containing part of the
kth column sends it to the √p - 1 processes in the same
row.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping

(a) Matrix D(k) distributed by 2-D block mapping into √p x √p subblocks,


and (b) the subblock of D(k) assigned to process Pi,j.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping

(a) Communication patterns used in the 2-D block mapping. When computing di(,kj),
information must be sent to the highlighted process from two other processes along
the same row and column. (b) The row and column of √p processes that contain the
kth row and column send them along process columns and rows.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping

Floyd's parallel formulation using the 2-D block mapping. P*,j denotes
all the processes in the jth column, and Pi,* denotes all the processes
in the ith row. The matrix D(0) is the adjacency matrix.
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
• During each iteration of the algorithm, the kth row and kth
column of processors perform a one-to-all broadcast
along their rows/columns.
• The size of this broadcast is n/√p elements, taking time
Θ((n log p)/ √p).
• The synchronization step takes time Θ(log p).
• The computation time is Θ(n2/p).
• The parallel run time of the 2-D block mapping
formulation of Floyd's algorithm is
Floyd's Algorithm: Parallel Formulation
Using 2-D Block Mapping
• The above formulation can use O(n2 / log2 n) processors
cost-optimally.
• The isoefficiency of this formulation is Θ(p1.5 log3 p).
• This algorithm can be further improved by relaxing the
strict synchronization after each iteration.
Floyd's Algorithm: Speeding Things Up
by Pipelining
• The synchronization step in parallel Floyd's algorithm
can be removed without affecting the correctness of the
algorithm.
• A process starts working on the kth iteration as soon as it
has computed the (k-1)th iteration and has the relevant
parts of the D(k-1) matrix.
Floyd's Algorithm: Speeding Things Up
by Pipelining

Communication protocol followed in the pipelined 2-D block mapping formulation of


Floyd's algorithm. Assume that process 4 at time t has just computed a segment of
the kth column of the D(k-1) matrix. It sends the segment to processes 3 and 5. These
processes receive the segment at time t + 1 (where the time unit is the time it takes
for a matrix segment to travel over the communication link between adjacent
processes). Similarly, processes farther away from process 4 receive the segment
later. Process 1 (at the boundary) does not forward the segment after receiving it.
Floyd's Algorithm: Speeding Things Up
by Pipelining
• In each step, n/√p elements of the first row are sent from process Pi,j
to Pi+1,j.
• Similarly, elements of the first column are sent from process Pi,j to
process Pi,j+1.
• Each such step takes time Θ(n/√p).
• After Θ(√p) steps, process P√p ,√p gets the relevant elements of the
first row and first column in time Θ(n).
• The values of successive rows and columns follow after time Θ(n2/p)
in a pipelined mode.
• Process P√p ,√p finishes its share of the shortest path computation in
time Θ(n3/p) + Θ(n).
• When process P√p ,√p has finished the (n-1)th iteration, it sends the
relevant values of the nth row and column to the other processes.
Floyd's Algorithm: Speeding Things Up
by Pipelining
• The overall parallel run time of this formulation is

• The pipelined formulation of Floyd's algorithm uses up to


O(n2) processes efficiently.
• The corresponding isoefficiency is Θ(p1.5).
All-pairs Shortest Path: Comparison

• The performance and scalability of the all-pairs shortest


paths algorithms on various architectures with bisection
bandwidth. Similar run times apply to all cube
architectures, provided that processes are properly
mapped to the underlying processors.
BTech IT Sem I

PE HPCC
2021-22

Analytical Modeling of Parallel Systems


Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

BTech IT PE HPC 1
Topic Overview

Sources of Overhead in Parallel Programs

Performance Metrics for Parallel Systems

Effect of Granularity on Performance

BTech IT PE HPC 2
Analytical Modeling -
Basics
A sequential algorithm is evaluated by its runtime (in general,
asymptotic runtime as a function of input size).

The asymptotic runtime of a sequential program is identical on any


serial platform.

The parallel runtime of a program depends on the input size, the


number of processors, and the communication parameters of the
machine.

An algorithm must therefore be analyzed in the context of the


underlying platform.

A parallel system is a combination of a parallel algorithm and an


underlying platform.

BTech IT PE HPC 3
Analytical Modeling -
Basics

A number of performance measures are intuitive.

Wall clock time the time from the start of the first processor to the
stopping time of the last processor in a parallel ensemble. But how
does this scale when the number of processors is changed of the
program is ported to another machine altogether?

How much faster is the parallel version? This begs the obvious
followup question whats the baseline serial version with which we
Compare? Can we use a suboptimal serial program to make Our
parallel program look

Raw FLOP count -


What good are FLOP counts when they dont
Solve a problem?

BTech IT PE HPC 4
Sources of Overhead in Parallel Programs
If
I use two processors, shouldnt my program run twice as fast?
No a number of overheads, includin9 wasted Computation,
Communication, idling, and contention cause degradation in
performance.
Execution Time

Essential/Excess Computation Interprocessor Communication


Idling

The execution profile of a hypothetical parallel program


executing on eight processing elements. Profile indicates times spent
performing computation (both essential and excess), communication,
and idling.

BTech IT PE HPC 5
Sources of Overheads in Parallel Programs

Interprocess interactions: Processors working on any non-trivial


parallel problem will need to talk to each other.

Idling: Processees may idle because of load imbalance,


synchronization, or serial components.

Excess Computation: This is computation not performed by the


serial version. This might be because the serial algorithm is difficult
to parallelize, or that some Computations are repeated across
processors to minimize communication.

BTech IT PE HPC 6
Performance Metrics for Parallel Systems: Execution
Time
Serial runtime of a program is the time elapsed between the
beginning and the end of its execution on a sequential computer.

The parallel runtime is the time that elapses from the moment the
first processor starts to the moment the last processor finishes
execution.

We denote the serial runtime by and the parallel runtime by Tp

BTech IT PE HPC 1
Performance Metrics for Parallel Systems: Total Parallel
Overhead
Let Ta be the total time collectively spent by all the processing
elements.

Ts is the serial time.

Observe that Tal - Ts is then the total time spend by all processors
Combined in non-useful work. This is called the total overhead.

The total time collectively spent by all the processing elements


Tall=p Tp (pis the number of processors).

The overhead function (T) is therefore given by

T= p Tp- Ts (1)

BTech IT PE HPC
Performance Metrics for Parallel Systems: Speedup

What is the benefit from parallelism?

Speedup (S) is the ratio of the time taken to solve a problem on a


Single processor to the time required to solve the same problem on
a parallel computer with p identical processing elements.

BTech IT PE HPC 9
Performance Metrics: Example

Consider the problem of adding n numbers by using n processing


elements.

Ifn is a power of two, we can perform this operation in log n steps


by propagating partial sums up a logical binary tree of processors.

BTech IT PE HPC 10
Performance Metrics: Example
6 10 11 12 13 4 5

(a) Initial data distribution and the first communication step

14
-** ***** -*****

(6) Second communication step

8 12

*******--****** ********* ****

(c) Third communication step

**

************ nneemeenmneram************

(d) Fourth communication step

(e) Accumulation of the sum at processing element 0 after the final communicatior

Computing the globalsum of 16 partial sums using 16


processing elements. 2 denotes the sum of numbers with
consecutive labels from i to j.
BTech IT PE HPC 11
Performance Metrics: Example (continued)
Ifan addition takes constant time, say, t, and communication
of a single word takes time t + t We have the parallel time
Tp (log n)

We know that Ts = (n)

Speedup Sis given by S= (n/log n)

BTech IT PE HPC 12
Performance Metrics: Speedup

For a given problem, there might be many serial algorithms


available. These algorithms may have different asymptotic runtimes
and may be parallelizable to different degrees.

For the purpose of computing speedup, we always consider the best


sequential program as the baseline.

BTech IT PE HPC 13
Performance Metrics: Speedup Example

Consider the problem of parallel bubble sort.

The serial time for bubblesort is 150 seconds.

The parallel time for odd-even sort (efficient parallelization of bubble


sort) is 40 seconds.

The speedup would appearto be 150/40 3.75.

But is this really a fair assessment of the system?

What if serial quicksort only took 30 seconds? In this case, the


speedup is 30/40 = 0.75. This is a more realistic assessment of the
system.

BTech IT PE HPC 14
Performance Metrics: Speedup Bounds

Speedup can be as low as 0 (the parallel programn never


terminates).

Speedup, in theory, should be upper bounded by p - after all, we can


only expect a p-fold speedup if we use times as many resources.

A speedup greater than p is possible only if each processing


element spends less than time Ts/p solving the problem.

In this case, a single processor could be timeslided to achieve aa


faster serial program, which contradicts our assumption of fastest
serial program as basis for speedup.

BTech IT PE HPC 15
Performance Metrics: Superlinear Speedups

One reason for superlinearity is that the parallel version does


less work than corresponding serial algorithm.
Processing element 0 Processing element 1

Searching an unstructured tree for a node with a given label,


S', on two processing elements using depth-first traversal. The two-
processor version with processor 0 searching the left subtree and
processor 1 searching the right subtree expands only the shaded
nodes before the solution is found. The corresponding serial
formulation expands the entire tree. It is clear that the serial
algorithm does more work than the parallel algorithm.

BTech IT PE HPC 16
Performance Metrics: Superlinear Speedups

Resource-based Superlinearity The higher aggregate


cache/memory bandwidth can result in better cache-hit ratios, and
therefore superlinearity.

Example: A processor with 64KB of cache yields an 80% hit


ratio. If two processors are used, since the problem size/processor
is smaller, the hit ratio goes up to 90%. Of the remaining 10%
access, 8% come from local memory and 2% from remote memory.

If DRAMaccess time is 100 ns, cache access time is 2 ns, and


remote memory access time is 400ns, this corresponds to a
speedup of 2.43!

BTech IT PE HPC 17
Performance Metrics: Efficiency

Efficiency is a measure of the fraction of time for which a


processing element is usefully employed

Mathematically, it is given by

E S (2)
P

Following the bounds on speedup, efficiency can be as low as 0


and as high as 1.

BTech IT PE HPC 18
Performance Metrics: Efficiency Example

The speedup of adding numbers on processors is given by

S=Tlog n SN Processor VP Size S E

Efficiency is given by
1
4 32

log n
E 2 4 64

e (_1n
log 3 16 32

4 16 64

BTech IT PE HPC 19
Parallel Time, Speedup, and Efficiency Example

Consider the problem of edge-detection in images. The


problem requires us to apply a 3 x 3 template to each pixel. If each
multiply-add operation takes time t the serial time for an n x n
image is given by Ts=9 t n

-2 02
-101

21
21
(a) (b) (C)

Example of edge detection: (a) an 8x 8 image; (b) typical


templates for detecting edges; and (c) partitioning of the image
across four processors with shaded regions indicating image data
that must be communicated from neighboring processors to
processor 1.

BTech IT PE HPC 20
Parallel Time, Speedup, and Efficiency Example
(Continued)

One possible parallelization partitions the image equally into vertical


segments, each with n2/p pixels.

The boundary of each segment is 2n pixels. This is also the number


of pixel values that will have to be communicated. This takes time
comm2(t, + tyn).
Templates may now be applied to all n2/p pixels in time
Tcomp= 9 tIp.

BTech IT PE HPC 21
Parallel Time, Speedup, and Efficiency Example
(continued)
The total time for the algorithm is therefore given by:

Tp 9te+2(t, +twn)

The corresponding values of speedup and efficiency are given by:

S = 9t n2
9t+2t,
P
+ tun)
and
1
E =
2Pts+twn)
1+ 9tn2

BTech IT PE HPC 22
Cost of a Parallel System

Cost is the product of parallel runtime and the number of processing


elements used (px Tp).

Cost reflects the sum of the time that each processing element
spends solving the problem.

system is said to be cost-optimal if the cost of solving a


A parallel
problem on a parallel computer is asymptotically identical to serial
Cost.

Since E= Ts/p TP for cost optimal systems, E= O(1).


Cost is sometimes referred to as ork or processor-time product.

BTech IT PE HPC 23
Cost of a Parallel System: Example

Consider the problem of adding numbers on processors.

We have, Tp= log n (for p= n).

The cost of this system is given by p Tp=n log n.

Since the serial runtime of this operation is (n), the algorithm is not
cost optimal.

BTech IT PE HPC 24
Impact of Non-Cost Optimality

Consider a sorting algorithm that uses n processing elements


to sort the list in time (log n)?.
Since the serial runtime of a (comparison-based) sort is n log n, the
speedup and efficiency of this algorithm are given by n/ log n and 1
/ log n, respectively.

The p Tpproduct of this algorithm is n (log n)2

This algorithm is not cost optimal but only by a factor of log n.

If p< n, assigningn tasks to p processors gives Tp = n (log n)2/p.

The corresponding speedup of this formulation is p/ log n.

This speedup goes down as the problem size n is increased for a


given p
!

BTech IT PE HPC 25
Effect of Granularity on Performance

Often, using fewer processors improves performance of parallel


systemsS.

Using fewer than the maximum possiblee number of processing


elements to execute a parallel algorithm is called scaling down a
parallel system.

A naive way of scaling down is to think of each processor in the


original case as a virtual processor and to assign virtual processors
equally to scaled down processors.

Since the number of processing elements decreases by a factor of


np, the computation at each processing element increases by a
factorofnlp.
The communication cost should not increase by this factor since
Some of the virtual processors assigned to a physical procesSors
might talk to each other. This is the basic reason for the
improvement from building granularity.
BTech IT PE HPC 26
Building Granularity: Example

Consider the problem of adding n numbers on p processing


elements such that p < n and bothnand p are powers of 2.

Use the parallel algorithm for n processors, except, in this case, we


think of them as virtual processors.

Each of the p prOcessors is now assigned nlpvirtual processors.

The first log p of the log n steps of the original algorithm are
simulated in (n/ p) log p steps on p processing elements.

Subsequent log n- log p steps do not require any communication.

BTech IT PE HPC 27
Building Granularity: Example (continued)

The overall parallel execution time of this parallel system is


e((n/ p) log p).

The cost is (n log p), which is asymptotically higher than the (n)
cost of adding n numbers sequentially. Therefore, the parallel
system is not cost-optimal.

BTech IT PE HPC 28
Building Granularity: Example (continued)

Can we build granularity in the example in a cost-optimal


fashion?
Each processing element locally adds its
time e (n/p). n
p p numbers in

The p partial sums on p processing elements can be added in time


O(log p).
11 15
10 14

0 4 : 12

(a) (b)

15

(c) (d)

A cost-optimal way of computing the sum of 16 numbers using four


processing elements.

BTech IT PE HPC 29
Building Granularity: Example (continued)

The parallel runtime of this algorithm is

Tp (n/p + logp) (3)

The cost is n= 2(plog p)


This is cost-optimal, so long as O(n +plog p) !

BTech IT PE HPC 30
BTech IT Sem I

PE HPCC
2021-22

M3 Analytical Modeling of Parallel Systems Part ll

Dr D BKulkarni

BTech IT PE HPC 1
Topic Overview

Scalability of Parallel Systems

Minimum Execution Time and Minimum Cost-Optimal Execution


Time

Asymptotic Analysis of Parallel Programs

BTech IT PE HPC 2
Scalability of Parallel Systems

How to extrapolate performance from small problems


and small systems to larger problems on larger configurations?
Consider three parallel algorithms for computing an n-point Fast
Fourier Transform (FFT) on 64 processing elements.
*********"****

5 ***

****-

***

S
0
Binary exchange
2-D transpose *****

3-D transpose

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

A comparison of the speedups obtained by the binary-exchange, 2-D


transpose and 3-D transpose algorithms on 64 processing elements
with t =2, t 4, t, = 25, and t, =2.
Clearly, it is difficult to infer scaling characteristics from m
observations on small datasets on small machines
BTech IT PE HPC 3
Scaling Characteristics of Parallel Programs

The efficiency of a parallel program can be written as:

Ts
plp

=
1
or E (4)
1+T
The total overhead function T, is an increasing function of p

BTech IT PE HPC 4
Scaling Characteristics of Parallel Programs

Ts depends on
Initialization
Distribution time for inputs

For a given problem size (the value of Ts remains constant)


As we increase the number of processing elements, To increases.
The overall efficiency of the parallel program goes down.

This is the applicable for all parallel programs.

BTech IT PE HPC 5
Scaling Characteristics of Parallel Programs: Example

Consideer the problem of adding numbers on processing


elements.

We have seen that:

Tp = 2logp (5)

S 2logp (6)

1
E 1 logp (7)
n

BTech IT PE HPC 6
Scaling Characteristics of Parallel Programs: Example
(continued)
Plotting the speedup for various input sizes gives us:
35
Linear
30

25

20 X n== 512

n = 3200
15
n =192
S
10

n =64

0 10 15 20 25 30 35 40

Speedup versus the number of processing elements for


adding a list of numbers.

Speedup tends to saturate and efficiency drops as a


consequence of Amdahl's law

BTech IT PE HPC 1
Scaling Characteristics of Parallel Programs

Total overhead function T is a function of both problem size n and


the number of processing elements p

In many cases, 7, grows sub-linearly with respect to n.

The efficiency increases if the problem size is increased keeping the


number of processing elements constant.

the problem size and number of processors can be simultaneously


increased to keep efficiency constant.

Such systems are called as scalable parallel systems.

BTech IT PE HPC
Scaling Characteristics of Parallel Programs

Recall that cost-optimal parallel systems have an efficiency of O(1).

Scalability and cost-optimality are therefore related.

A scalable parallel system can always be made cost-optimal


if the number of processing elements and the size of the

computation are chosen appropriately.

BTech IT PE HPC 9
Isoefficiency Metric of Scalability

For a given problem size, as we increase the number of processingg


elements, the overall efficiency of the parallel system goes down for
all systems.

For some systems, the efficiency of a parallel system increases if


the problem size is increased while keeping the number of
processing elements constant.
Fixed problem size (W) Fixed number of processors (p)

E E

P W
(a) (b)
BTech IT PE HPC 10
Isoefficiency Metric of Scalability

What is the rate at which the problem size must increase with
respect to the number of processing elements to keep the efficiencyy
fixed?

This rate determines the scalability of the system. The slower this
rate, the better.

Before we formalize this rate, we define the problem size W as the


asymptotic number of operations associated with the best serial
algorithm to solve the problem.

BTech IT PE HPC 11
Isoefficiency Metric of Scalability
Parallel runtime can be written as:
W+To(W, p) (8)
TP
p
The resulting expression for speedup is
W
S
Tp
Wp
(9)
W+To(W,p)
So efficiency is
E =

P
W
W+To(W,p)
1
1+To(W,P)/W
BTech IT PE HPC 12
Isoefficiency Metric of Scalability

For scalable parallel systems, efficiency can be maintained at a


fixed value (between 0 and 1) if the ratio T/ Wis maintained at a
Constant value.
For a desired value E of efficiency,
E
1+T(W,P)/W'
To(W.P)1-E
W E
E (11)
W=
1
T(W,P).

IfK= E/ (1 E)is a constant depending on the efficiency to be


-

maintained, since T, is a function of Wand p, we have

W KT.(W, p).
(12)
BTech IT PE HPC 13
Isoefficiency Metric of Scalability

The problem size W can usually be obtained as a function of p by


algebraic manipulations to keep efficiency constant.

This function is called the isoefficiency function.

This function determines the ease with which a parallel system can
maintain a constant efficiency and hence achieve speedups
increasing in proportion to the number of processing elements

BTech IT PE HPC 14
Isoefficiency Metric: Example
The overhead function for the problem of adding n numbers on
P processing elements is approximately 2p logp.
Substituting T. by 2p log p W K2plogP.
(13)
Thus, the asymptotic isoefficiency function for this parallel system is

Ifthe number of processing elements is increased from p to p', the


problem size (in this case, n ) must be increased by a factor
of (p log p) / (p log p) to get the same efficiency as on p
RRESSSng plements.
Speedup with P proc=wp/w+T0=np/(n+2plogp) with p2
proc=np^2/(n+4p^2logp) substituting n by n*p^2log p2/plog
p-2np2 log p/plog p=2 np l.e 2np^3/(2np+4p^2 logp)=

BTech IT PE HPC 15
Isoefficiency Metric: Example
The overhead function for the problem of adding n numbers on
P processing elements is approximately 2p log pp.

Substituting T, by 2p log p, we get


W K2plogp. (13)

Thus, the asymptotic isoefficiency function for this parallel system is

Ifthe number of processing elements is increased from p to p', the


Brgeor
of
ze (in this case, n ) must be increased by a factor
(P log p) / (p log P) to get the same efficiency as on p
processing elements.
Speedup with p proc=wp/w+T0=np/(n+2plogp) with p 2
proc=np2/(n+4p^2logp) substitutingn by n*p^2log p^2/plog
p-2np 2 log p/plog p=2 np l.e 2np^3/(2np+4p^2 logp)=

BTech IT PE HPC 16
Isoefficiency Metric: Example
Consider a more complex example where To = p°/2 +p3/4y3/4
Using only the first term of T, in Equation 12, we get

W Kp (14)

Using only the second term, Equation 12 yields the following


relation between Wand p:

W =Kps/4w3/4
wl/4= Kp3/4
W =K3 (15)

The larger of these two asymptotic rates determines the


isoefficiency. This is given by O(p)

BTech IT PE HPC 17
Cost-Optimality and the Isoefficiency Function

A parallel system is cost-optimal if and only if

pTp = e(w). (16)

From this, we have:

W+To(W.p) = e(W)
(17)
T.(W,p) O(W)
W = Q(T.(W,p) (18)

Ifwe have an isoefficiency function f(p), then it follows that thhe


relation W= Q(fp) must be satisfied to ensure the cost-optimality of
a parallel system as it is Scaled up.

BTech IT PE HPC 18
Lower Bound on the Isoefficiency Function

For a problem consisting of W units of work, no more than W


processing elements can be used cost-optimally.

The problem size must increase at least as fast as O(p) to maintain


fixed efficiency; hence, Q(p) is the asymptotic lower bound on the
isoefficiency function.

BTech IT PE HPC 19
Degree of Concurrency and the lsoefficiency Function

The maximum number of tasks that can be executed simultaneously


at any time in a parallel algorithm is called its degree of concurrency.

IfCW) is the degree of concurrency of a parallel algorithm, then for


a problem of size W, no more than C(W) processing elements can
be employed effectively.

BTech IT PE HPC 20
Degree of Concurrency and the lsoefficiency Function: Example

Consider solving a system of equations in variables by using


Gaussian elimination (W= O(*))

The n variablees must be eliminated one after the other, and


eliminating each variable requires O(P) computations.

At most O(?) processing elements can be kept busy at any time.

Since W= O(n*) for this problem, the degree of concurrency C(W) is


O(W/s) .

Given p processing elements, the problem size should be at least


Q(p32) to use them all.

BTech IT PE HPC 21
Minimum Execution Time and Minimum Cost-Optimal
Execution Time
Often, we are interested in the minimum time to solution.

We can determine the minimum parallel runtime Tpmin for a given W


by differentiating the expression for Tp W.r.t. p and equating it to
zero.

Tp = 0 (19)
dp
If p is the value of p as determined by this equation, TplPo) is the
minimum parallel time.

BTech IT PE HPC 22
Minimum Execution Time: Example

Consider the minimum execution time for adding n numbers.

Tp +2logp. (20)

Setting the derivative w.r.t. p to zero, n/p^2=2/p In p ,with In p=1, p


n/2.The corresponding runtime is

Tp
Tmin 2logn. (21)

(One may verify that this is indeed a min by verifying that the second
derivative is positive).

Note that at this point, the formulation is not cost-optimal.

BTech IT PE HPC 23
Minimum Cost-Optimal Parallel Time

Let Tcost.opt be the minimum cost-optimal parallel time.

Ifthe isoefficiency function of a parallel system is (fp), then a


problem of size Wcan be solved cost-optimally if and onlyif
W=Q(fp)).

In other words, for cost optimality, p = O(f1(W)) .

Forcost-optimal systems, Tp (W/p), therefore,

mcost-opt W
TP (22)
-1(W)

BTech IT PE HPC 24
Minimum Cost-Optimal Parallel Time: Example

Consider the problem of adding n numbers.

The isoefficiency function fp) of this parallel system is (p log p).

From this, we have p n/log n.

At this processor count, the parallel runtime is:


cost-opt
Tp logn+log
log n
= 2logn-log logn. (23)

Note that both 7pi and Tcost.opt for adding n numbers are
O(log n). This may not always be the case.

BTech IT PE HPC 25
Asymptotic Analysis of Parallel Programs
Problem: sortinga list ofn numbers.
The fastest serial programs for this problem run in time O(n log n).

Consider four parallel algorithms, A1, A2, A3, and A4 as follows:


The comparison table shows number of processing elements, parallel
runtime, speedup, efficiency and the pTp product.
Algorithm Al A2 A3 A4

P n log n Vn

TP 1 Vn yn log n

S n log n log n Vn logn Vn

E log 1 log n 1

Vn

plp n n log n n.5 n log n


BTech IT PE HPC 26
Asymptotic Analysis of Parallel Programs

Ifthe metric is speed, algorithm A1 is the best, followed by A3, A4,


and A2 (in order of increasing Tp)

In terms of efficiency, A2 and A4 are the best, followed by A3 and


A1.

Interms of cost, algorithms A2 and A4 are cost optimal, A1 and A3


are not.

It is important to identify the objectives of analysis and to use


appropriate metrics!

BTech IT PE HPC 27
Other Scalability Metrics

A number of other metrics have been proposed, dictated by specific


needs of applications.

For real-time applications, the objective is to scale up a system to


accomplish a task in a specified time bound.

Inmemory constrained environments, metrics operate at the limit of


memory and estimate performance under this problem growth rate.

BTech IT PE HPC 28
Dense Matrix Algorithms

Ref: Introduction to Parallel Computing


By: Grama, Gupta, Karypis, Kumar
Topic Overview

Matrix-Vector Multiplication
Matrix-Matrix Multiplication
Solving a System of Linear Equations
Matrix Algorithms: Introduction

.Due to their regular structure, parallel computations


involving matrices and vectors readily lend themselves to
data-decomposition.
Typical algorithms rely on input, output, or intermediate
data decomposition.
Most algorithms use
one- and two-dimensional
block, cyclic, and block-cyclic partitioning
Matrix-Vector Multiplication

To multiply a dense n X n matrix A with an n xI vector x to


yield then x1 result vector y.

X =

Afn X n xn x 1] yln x 1]
The serial algorithm requires n2 multiplications and
additions.
W = n
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
The n Xn matrix is partitioned among n processors, with
each processor storing complete row of the matrix.
The n X1 vectorx is distributed such that each process
Owns one of its elements.
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Maix Vector x Processe25

Ppa

(a) Initial partitioning of the matix (b) Distribution of the full vector among all
and the stating vector x the processes by all-to-all broadcast

Matix i Vector y

1E

(c) Entire vector distributed to each (d Fnal distribution of the matix


process atter the broadcast and the result vector y

Multiplication of an n X n matrix with an n x 1 vector using


rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
Matrix(4*4)-Vector(4*1) Multiplication:
Row-wise 1-D Partitioning- 4 Processes
A[4x 41 X[4 x 1] Y[4x 1]
An A21 A As1 X1

X2
A12 Az2 Aa2 As2 X y2 A.X=Y
Y3
A13 A23 Ass A43
X4 y4
|Aa Az Aga As

X X2 X
P P
2?

P3 P4 P3 P4

X3
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Since each process starts with only one element of x
an all-to-all broadcast is required to distribute all the
elements to all the processes.
Process P; now computes vi] = E- (A[i,i] x i]).
The all-to-all broadcast and the computation of yli] both
take timne n). Therefore, the parallel time is (n) .
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Consider now the case when p <n and we use block 1D
partitioning.
Each process initially stores n/p complete rows of the
matrix and a portion of the vector of size n/p.
The all-to-all broadcast takes place among p processes
and involves messages of size n/p.
This is followed by n/p local dot products.
Thus, the parallel run time of this procedure is
n2
Tp = +ts logp + tun.
This is cost-optimal.
Matrix-Vector Multiplication:
Row-wise 1-D Partitioning
Scalability Analysis:

We know that - we have,


T = pTp W, therefore,
T= tsp logp +tunp.
For iso-efficiency, we have W KT,, where K = E(1 -E)
for desired efficiency E.
=
From this, we have W O(p?) (from the :, term).
There is also a bound on isoefficiency because of
concurrency. In this case, p <n, therefore, W =n = (p°).
Overall iso-efficiency is W O(p?).
Matrix-Vector Multiplication:
2-D Partitioning
The n Xn matrix is partitioned among n processors such
that each processor owns a single element.
The nx 1 vector x is distributed only in the last column of
n processors.
Matrix-Vector Multiplication: 2-D Partitioning
Matrix 4 Vector x

PoP
**********|*********]*********T*********

T T
**" *" *" "1" ** *"
**l ** *"*"* "* ***

-
****""
******

4*|*"***

(a) Initial data distribution and communication (b) One-to-all broadcast of portions of
steps to align the vector along the diagonal the vector along process columns

Matrix A Vector

PoP

c) All-to-one reduction of patial results (d) Final distribution of the result vector

Matrix-vector multiplication with block 2-D partitioning. For the


one-element-per-process case, p = në if the matrix size is n Xn.
Matrix-Vector Multiplication:
2-D Partitioning
We must first align the vector with the matrix
appropriately.
The first communication step for the 2-D partitioning
aligns the vector x along the principal diagonal of the
matrix.
The second step copies the vector elements from each
diagonal process to all the processes in the
corresponding column using n simultaneous broadcasts
among all processors in the column.
Finally, the result vector is computed by performing an
all-to-one reduction along the columns.
Matrix-Vector Multiplication:
2-D Partitioning
Three basic communication operations are used in this
algorithm: one-to-one communication to align the vector
along the main diagonal, one-to-all broadcast of each
vector element among the n processes of each column,
and all-to-one reduction in each row.
Each of these operations takes (log n) time and the
parallel time is O(log n).
The cost (process-time product) is (n2 log n) hence,
;

the algorithm is not cost-optimal.


Matrix-Vector Multiplication:
2-D Partitioning
When using fewerthan n2 processors, each process
owns an (n//) x (7n/) block of the matrix.
The vector is distributed in portions of n/yp elements in
the last process-column only.
In this case, the message sizes for the alignment,
broadcast, and reduction are all n//.
The computation is a product of an (n//) x (n/yp)
submatrix with a vector of length n//p.
Matrix-Vector Multiplication:
2-D Partitioning
The first alignment step takes time
ts +tun/yP
The broadcast and reductions take time
(t +tn/y) log(/P)
Local matrix-vector products take time
ten/p
Total time is
T
Tp
P
+t, logp + logP
Matrix-Vector Multiplication:
2-D Partitioning
Scalability Analysis:

To = pTp- W =t,plogP + tuyP logP


Equating T, with W, term by term, for isoefficiency, we
have, W = Kp log p as the dominant term.
The isoefficiency due to concurrency is Op).
The overall isoefficiency is O(plog p) (due to the
network bandwidth).
For cost optimality, we have, W = n = p logp .For this,
we have, p = O
log
Matrix-Matrix Multiplicatioon
Consider the problem of multiplying two n X n dense, soquare matrices A
and B to yield the product matrix C =A x B.
The serial complexity is O(n').
We do not consider better serial algorithms (Strassen's method),
although, these can be used as serial kernels in the parallel algorithms.
A useful concept in this case is called block operations. In this view, an
n xn matrix A can be regarded as a qxqarray of blocks A;, (0 sij<q)
such that each block is an (n/q) x (n/q) submatrix.
In this view, we perform q' matrix multiplications, each involving (n/q) x
(n/q) matrices.
MM-M: Input and Output

B1 C1
A11

Ag8

A B C
MM-M: Processes

PA P2

P5
Matrix-Matrix Multiplication

A1 A12 A13 A14


Az1 A22 A23 A24

P B1 B12
P2
Bi1 B42
B21 Bz2 B21 B22

Ps
MM-M: Processes, 1/Ps & O/P

AA A12 AA2 AA12


A21A22 Az1A22 Az21A22 Az1A22

B,B12
C
BB12
C1
B,,B12 C B,
C
B2122 B21B22 B2B22 B21B22

A1A12 AysA2 AnA2 A1A12


AznA22 A21A22 AznA22 A21A22
C11 C11 C11 Cin
B,B12 B,B12 BB12 B12
B222 Ba1B22 Ba1B22 BB22

AA2 Ay1Ay2 A1A12 AA12


Az1A22 Aa1A22 Az1A22 AznA22
Ci1 C1 C BiB12
C1
B1B12 B,B12 B,B12
B21B22 B21B22 B2,B22 B21B22

As AnA2 AA12 AAi2


AnAz2 AzAz2 A22 Az1A22
C1 C11
11 11
B,B12 B,B2 BB12 BB12
B21522 Ba,Bz2 B21B22 BaB22
Matrix-Matrix Multiplication

Consider two n Xn matrices A and B partitioned into p


blocks A and B,, (0 Sij< P) of size (n/VP) x (n/VP)
each.
,
Process P;; initially stores A, and B, and computes block
Ci, of the result matrix.
Computing submatrix Ci, requires all submatrices A; and
B for 0 sk< . vP
All-to-all broadcast blocks of A along rows and B along
Columns.
Perform local submatrix multiplication.
Matrix-Matrix Multiplication

The two broadcasts take time


2t los(yP) +tu(n*/p)(VP 1))
-

The computation requires yp multiplications of


(n/P) x (n//P) sized submatrices.
The parallel run time is approximately
Tp=+ts logp +2tw
n
P
V
O(p3) due to bandwidth term ,
The algorithm is cost optimal and the isoefficiency is
and concurrency.
Major drawback of the algorithm is that it is not memory
optimal.
Matrix-Matrix Multiplication:
Cannon's Algorithm
In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given time,
each process is using a different block A;
These blocks can be systematically rotated among the
processes after every submatrix multiplication so that
every process gets a fresh A; after each rotation.
Matrix-Matrix Multiplication:
Cannon's Algorithm
AAu AA
AuA A
A AAu A

(a) Initial alignment of A (6) Iitial aligrmsent of B

A
B B3
A
B B Bp2
A
E

()A andB after initial alignment (dSubnatiz locations afner first shift

1 A3

AuA A
B
(e) Subais locarions after second shitt ( Submmatris locations after third shi

Communication steps in Cannon's algorithm on 16 processes.


Matrix-Matrix Multiplication:
Cannon's Algorithm
Align the blocks of A and B in such a way that each
process multiplies its local submatrices. This is done by
shifting all submatrices A;, to the left (with wraparound)
J
by i steps and all submatrices B J up (with wraparound)
byj steps.
.Perform local block multiplication.
.Each block of A moves one step left and each block of B
moves one step up (again with wraparound).
Perform next block multiplication, add to partial result,
repeat until all vP blocks have been multiplied.
Matrix-Matrix Multiplication:
Cannon's Algorithm
In the alignment step, since the maximum distance over which a
block shifts is the twe shift operations require a total of
,

time.
2(t
Each of the single-step shiftsin the compute-and-shift phase
of the algorithr fakes time.
matriées Bt
The computation time for mutiplying mafries 6t size
is VP
ThesaraileP tinheis apßrókimately

Tp =
n +2 pts+ 2tw n
The cost-efficiecy apd isoëfficiency afhealgorithm are identical
to the first algorithm, except, this is memory optimal.
Matrix-Matrix Multiplication:
DNS Algorithm
Uses a 3-D partitioning.
Visualize the matrix multiplication algorithm as a cube.
matrices A and B come in two orthogonal faces and
result C comes out the other orthogonal face.
Each internal node in the cube represents a single add-
multiply operation (and thus the complexity).
DNS algorithm partitions this cube using a 3-D block
Scheme.
Matrix-Matrix Multiplication:
DNS Algorithm
Assume an n X n X n mesh of processors.
Move the columns of A and rows of B and perform
broadcast.
Each processor computes a single add-multiply.
This is followed by an accumulation along the C
dimension.
Since each add-multiply takes constant time and
accumulation and broadcast takes log n time, the total
runtime is log n.
This is not cost optimal. It can be made cost optimal by
using n / log n processors along the direction of
accumulation.
Matrix-Matrix Multiplication:
DNS Algorithm
k=3

A, B
o0o =2
o
|AOo
O
oo
(a) Initial distiburion of i and5 ) After moving.47 from Pye o

AFD.31 B5

A
A0.2J« B2.0 B

A0.07% B[0.0

()Aher broadcasting4i7 alongj avis d) Comesponding distriontio of B

The communication steps in the DNS algorithm while


multiplying 4 x 4 matricesA and B on 64 processes.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n' processors.
Assume that the number of processes p is equal to q for
Some q <n.
The two matrices are partitioned into blocks of size (n/q)
X(n/q).
Each matrix can thus be regarded as a qxq two
dimensional square array of blocks.
The algorithm follows from the previous one, except, in
this case, we operate on blocks rather than on individual
elements.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n' processors.
The first one-to-one communication step is performed for
both A and B, and takes t + t (n/q}me for each matrix.
The two one-to-all broadcasts take 2(t log q + tu(n/4jmag q)
for each matrix.
The reduction takes time t, log q + tw(n/q)* logq
Multiplication of (n/g) x (n/4$ubmatrices takes (nime.
The parallel time is approximated by:

Tp = n logp.
+ts logp +tu2/3
2

The isoefficiency fuhction is p/3


e(p(logp)3)
Solving a System of Linear Equations

Consider the problem of solving linear equations of the


kind:
ao,0To +a0,11 +a0,n-1tn-1 = bo
a1,0T0 +a1,11 + ..+a1,n-1tn-1 b1,

an-1,0o+ an-1,1T1 + *an-1,n-12n-1 n-1


-1
This is written as Ax = b, where A is an n X n matrix with
A[i, j] = a; b is an n
,
xI vector [b, b, ... b, ]", and x is
the solution.
Solving a System of Linear Equations

Two steps in solution are: reduction to triangular form, and


back-substitution. The triangular form is as:
#o+u0,1*1 t u0,202 + . +u0,n-1n-1 y0
1+u1,202 + +U1.n-1tn-1 y1

In-1 Yn-1

We write this as: Ux =yy


A commonly used method for transforming a given matrix
into an upper-triangular matrix is Gaussian Elimination.
Gaussian Elimination
1 procedure GAUSSIAN_ELIMINATION (A, b, y)
2. begin
3. for k= 0 to n - 1 do / Outer loop /
A begin
5. forj= k +1 to n 1 do -

6 A [k, i]= A[k, il/A[k, k]; /* Division step


7. yk]= b[k]/A[k, k]:
8. A[k, k] := 1;
9 for i= k +1 to n -1 do
10. begin
11
forj= k+1 to n 1 do -

12 A[i, i]= A[i, i A[i, k] x A[k, il:/* Elimination step /


-

-A[i,
13. bla=k] b[ k] x y[k];
14 Ai, := 0;
15 endfor: /Line 9 */
16. endfor: Line 3 /
17 end GAUSIAN_EUMINATION

Serial Gaussian Elimination


Gaussian Elimination

The computation has three nested loops in the kth -

iteration of the outer loop, the algorithm performs (n-kKP


computations. Summing from k = 1..n, we have roughly
(n/3) multiplications-subtractions.

Inactive part

Akj]= A[kj]A[kk]1

Actve put

-- Aij]=A[Lj]-A[Lk]x A[kj]
oi--

A typical computation in Gaussian elimination.


Parallel Gaussian Elimination
Assumep = n With each row assigned to a processor.
.The first step of the algorithm normalizes the row. This is a
serial operation and takes time (n-k) in the kth iteration.
In the second step, the normalized row is broadcast to all the
processors. This takes time
Each processor can independently eliminatetbi royfrgm,its
own. This requires (n-k-1) multiplicâtions and subtractions.
The total parallel time can be computed by summing from k
= 1... n-1 as

The formulation is not cost optimal because of the t, term.


3 1
Tp=n(n -
1) +t,n log n+tu n(n 1) log n.
-

2
Parallel Gaussian Elimination
1.7 1) (14 (09 (103
(a) Computation:
B
) Akj]A[kj]A[kk] for kj<n

a) Akk]=]
P

) Communication:
One-to-ail brodcast of ron A[k.1

(C) Conputation:

AL]=A[Lj]-A[ik]« AAJ]
or &i<I and k<jo
() A[ik] =0 for k<i<a

Gaussian elimination steps during the iteration corresponding k = 3


for an 8 x 8 matrix partitioned rowwise among eight processes.
Parallel Gaussian Elimination:
Pipelined Executioon
In the previousformulation, the (k+1)st iteration starts
only after all the computation and communication for the
kth iteration is complete.
In the pipelined version, there are three steps
normalization of a row, communication, and elimination.
These steps are performed in an asynchronous fashion.
A processor P, waits to receive and eliminate all rows
prior to k.
Once it has done this, it forwards its own row to
procesSor Pk+1.
Parallel Gaussian Elimination:
Pipelined Execution
-

4 4,) 42) (AS) (4 40) (41) 42) (43) (4 4)41) (43) (43) 404) ) A (49 (44

(a) Irerationk=0 starts


.1 03

,
14,3) (4,2) 43) (4
40)
(e) Irerationk=1 starts ( Iterationk =0 ends

2) (13) (1.

4) Iterationk=2 starts ) Iteration k = I ends

12) (13) 2) (03) (1.


113 (1.9 (00|

0 1 (

o) Lreratio &=3 erds P) Lteration k=4

Cornunicarion for k = 0, 3 Conpatarion for k = 0. 3

Counaunication for k = I Compatation for k= 1, 4

--Comnvnicerion for k = 2 Comptazion for k= 2

Pipelined Gaussian elimination on a 5 x5 matrix partitioned


withone row per process.
Parallel Gaussian Elimination:
Pipelined Execution
The total number of steps in the entire pipelined
procedure is O(n).
In any step, either O(n) elements are communicated
between directly-connected processes, or a division step
is performed on O(n) elements of a row, or an elimination
step is performed on O(n) elements of a row.
The parallel time is therefore O(n?).
This is cost optimal.
Parallel Gaussian Elimination:
Pipelined Execution
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7)
Po
0 1 (1,2) (1,3) (1,4) (1,5) (16) (1,7)

001 (2,3) (2,4) (2,5) (2,6) (2,7)

P I 3,4) G,5) (3,5) (3,7

0 0 0 (4,3) (4,4) 4,5) (4,6) (4,7)


Pa V V

0 0 (5,3) (5,4) (5.5) (5,6) (6,7)

0 (6,3) (6,4): (6,5) (6,6); (6,7)

P
V

0 0 0 (7,3) (7,4) (7,5) (7,6) (7,7)

The communication in the Gaussian elimination iteration


corresponding to k = 3 for an 8 x 8 matrix distributed among
four processes using block 1-D partitioning.
Parallel Gaussian Elimination:
Block 1D with p <n
The above algorithm can be easily adapted to the case
when p < n.
In the kth iteration, a processorwith all rows belonging to
the active part of the matrix performs (n -k-1)/ np
multiplications and subtractions.
In the pipelined version, forn> p, computation dominates
Communication.
.The parallel time is given by: 2(n/p)E-(n k-1)
-
or approximately, np.
While the algorithm is cost optimal, the cost of the parallel
algorithm is higher than the sequential run time by a factor
of 3/2.
Parallel Gaussian Elimination:
Block 1D with p <n
1 (0,1) (0.2) (0,3) (0,4) (0,5) (0,6) (0,7) I0,1) (0.2) (0,3) (04) 0.5) (0.6) (0.7)
Po Po
0 1
(1,2) (1,3) (1,4) (1,5) (1,6) (1,7) 0 0 4,3) (4,4) (4,5) (4,) (4,7|
1(2,3) (2,4) (2,5) (2,6) (2,7) 01 (1,2) (1,3) (1,4) (1,5) (1,6) (1,7|
P 0 0 (3,3) (3,4) (3,5) 3,6) (3,7|
P
0 0 0 (5,3) (5,4) (5,5) (5,6) (5,7

0 (4,3) (4,4) 4,5) (4,6) 4,7| 0 0 (2,3) (2,4) (2.5) (2,6) (2,7
P2
0 0 (6.3) (5,4) (6.5) (6.6) 6.7 0 0 (6.3) (6,4) (6.5) (6,) (6.7

0 (6,3) (6,4) (6,5) (6,6) 6,7) 0 (3.3) 3.4) (3,5) (3.6) (3.7
P 0 0 0 7,3) (7,4) (7,5) (7,6) (7,. 0 0 (7.3) (7,4) (7.5) (7,6) (7,.7
P

(a) Block 1-D mapping (b) Cyclic 1-D mapping

Computation load on different processes in block and cyclic


1-D partitioning of an 8 x 8 matrix on four processes during the
=
Gaussian elimination iteration corresponding tok 3.
Parallel Gaussian Elimination:
Block 1D with p < n
The load imbalance problem can be alleviated by using a
cyclic mapping.
In this case, other than processing of the lastp rows,
there is no load imbalance.
This corresponds to a cumulative load imbalance
overhead of Ofnp) (instead of O(ns) in the previous
case).
Parallel Gaussian Elimination:
2-D Mapping
Assume an n X n matrix A mapped onto an n X n mesh
of processors.
Each update of the partial matrix can be thought of as a
Scaled rank-one update (scaling by the pivot element).
In the first step, the pivot is broadcast to the row of
procesSors.
In the second step, each processor locally updates its
value. For this it needs the corresponding value from the
pivot row, and the scaling value from its Own row.
This reaquires two broadcasts, each of which takes log n
time.
This results in a non-cost-optimal algorithm.
Parallel Gaussian Elimination:
2-D Mapping
(2.30

K4.44(447 43(444.44

(a) Rowwise broadcast of Aik] b) Ai]= A[kj]/A[k.k]


for(k- 1)icn forkjD

c) Colaawise broadcast of As] Ai]A[jIA[ik] x Ak]


forkj< for k i<L md k j<n

Various steps in the Gaussian elimination iteration


corresponding to k = 3 for an 8 x 8 matrix on 64
processes arranged in a logical two-dimensional mesh.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining
We pipeline along two dimensions. First, the pivot value is pipelined
along the row. Then the scaled pivot row is pipelined down.
Processor P; (not on the pivot row) performs the elimination step
A[i. j]=A[, j]] A[i, k] A[k, j] as soon as A[i, k] and A[K, j] are
-

available.
The computation and communication for each iteration moves
through the mesh from top-left to bottom-right as a "front."
After the front corresponding to a certain iteration passes through a
process, the process is free to perform subsequent iterations.
Multiple fronts that correspond to different iterations are active
simultaneously.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining
If each step (division, elimination, or communication) is
assumed to take constant time, the front moves a single
step in this time. The front takes O(n) time to reach n-1,n-
1

Once the front has progressed past a diagonal


processor, the next front can be initiated. In this way, the
last front passes the bottom-right corner of the matrix
(n) steps after the first one.
The parallel time is therefore O(n) which is cost-optimal.
,
2-D Mapping with Pipelining

(a) Iteration k= 0 staT5

(E) Iteration k=1 starts

2324
o0,2ass

4.2344
n IteraOn k=2 starts P) Iterationk=0 ends

LomLcation sor k =0 Computation sor k =0

Cornunication for k = 1 Computation sor k=1

Corumunication for k = 2 Computation for k=2

Pipelined Gaussian elimination for a 5 x5 matrix with 25 processors.


Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p <n
In this case, a processor containing a completely active
part of the matrix performs n/p multiplications and
subtractions, and communicates n/yP words along its
row and its column.
The computation dominates communication for n >> p.
The total parallel run time of this algorithm is (2n°/p) X n,
since there are n iterations. This is equal to 2n'/p.
This is three times the serial operation count!
Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p <n
/P
I (0,1) (0.2) (0.30,4) (0.5)|(0.6) (0.7) 1 (0.1)(0.2) (0.3) (0.4) (0.5)(0.6) (0.7

0 1.2) (1.3}(1.4) (1.5(1.6) (1.7 0 (1.2) (13)|(1,4) (15)|(1.6) (1.7

0 0 (2,3) (2,4) (2,5)(2.6) (2.7) 001 (2.3)|(2.4) (2,5) (2,6) (2.7)|


o 3.33.4) (3.5)(3.6)3,7 0 0
0 I 3.4)3.53.013.7
O 0 0 4.3)}(4.4) 4.5 (4.6) (4.7 00 0 (4.3)(4.4) (4.5)4,6(4.7)
00 0 i(5.3)5,4) (5,5) (5.6) (5.7) 00 0 (5.3)(5,4)(5.5)(5.6) (5.7)

0 0
0 6.3)64) (6.5(6.6)(6.7) 00 0 6.36,4 (6.5)(6.6% 6.7

o o (7,3)(7,4) (7.5) (7.6) (7.7


0
L-
0
0 (7.3)(7.4) (7.5 (7.6) (7.7)

(a) Rowwise broadcast of A[i.k] (b) Columnwise broadcast of A[k.j


for i = k to (n - 1) forj = (k + 1) to (n - 1)

The communication steps in the Gaussian elimination


iteration corresponding tok = 3 for an 8 x 8 matrix on 16
processes of a two-dimensional mesh.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p <n
1. 11,4 (19(1 (1.7
4. (4 (4
23 (7,4 2.592 (27

3.615.33(37
143 (A,4
4A (47
D (5.3 4540 530(5

7(7:

(a) Block-checkerboard mapping b) Cyclic-checkerboard mapping

Computational load on different processes in block and cyclic


2-D mappings of an 8 x 8 matrix onto 16 processes during
the Gaussian elimination iteration corresponding to k = 3.
Parallel Gaussian Elimination:
2-D Cyclic Mapping
The idling in the block mapping can be alleviated using a
cyclic mapping.
The maximum difference in computational load between
any two processes in any iteration is that of one row and
one column update.
This contributes e(n/p) to the overhead function. Since
there are n iterations, the total overhead is e(nP)
Gaussian Elimination
with Partial Pivoting
For numerical stability, one generally
pivoting.
uses partial
Inthe k th iteration, we select a column (called the pivot
i

column) such that A[k, i] is the largest in magnitude


among all A[k, i] such that k <j < n.
The k th and the i th columns are interchanged.
Simple to implement with row-partitioning and does not
add overhead since the division step takes the same
time as computing the max.
Column-partitioning, however, requires a global
reduction, adding a logp term to the overhead.
Pivoting precludes the use of pipelining.
Gaussian Elimination with Partial
Pivoting: 2-D Partitioning
Partial pivoting restricts use of pipelining, resulting in
performance loss.
This loss can be alleviated by restricting pivoting to
specific columns.
Alternately, we can use faster algorithms for broadcast.
Solving a Triangular System:
Back-Substitution
The upper triangular matrix U undergoes back-
substitution to determine the vector x.
1. procedure BACK_SUBSTITUTION (U, x, y)
2 begin
3. for k= n - 1 downto 0 do /" Main loop /
4. begin
5. rk] := y[k];
fori= k - 1 downto 0do
7 yl= vll æ[k] x U[i,k];
-

8. endfor;
end BACK-SUBSTITUTION

A serial algorithm for back-substitution.


Solving a Triangular System:
Back-Substitution
The algorithm performs approximately n?/2 multiplications and
subtractions.
Since complexity of this part is asymptotically lower, we should
optimize the data distribution for the factorization part.
Consider a rowwise block 1-D mapping of the n Xn matrix U with
vector y distributed uniformly.
The value of the variable solved at a step can be pipelined back.
Each step of a pipelined implementation requires a constant amount
of time for communication and O(n/p) time for computation.
The parallel run time of the entire algorithm is O(n/p).
Solving a Triangular System:
Back-Substitution
the matrix is partitioned by using 2-D partitioning on a
If logical of p x p processes,
mesh and the elements of
the vector are distributed along one of the columns of thne
process mesh, then only the yP processes containing
the vector perform any computation.
Using pipelining to communicate the appropriate
elements of U to the process containing the
corresponding elements of y for the substitution step
(line 7), the algorithm can be executed in e(n*//p) time.
While this is not cost optimal, since this does not
dominate the overall computation, the cost optimality is
determined by the factorization.
BTech IT Sem I
PE HPC
2021-22

Analytical Modeling of Parallel Systems


Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

BTech IT PE HPC 1
Topic Overview

• Sources of Overhead in Parallel Programs

• Performance Metrics for Parallel Systems

• Effect of Granularity on Performance

• Scalability of Parallel Systems

• Minimum Execution Time and Minimum Cost-Optimal Execution


Time

• Asymptotic Analysis of Parallel Programs

• Other Scalability Metrics

BTech IT PE HPC 2
Analytical Modeling - Basics

• A sequential algorithm is evaluated by its runtime (in general,


asymptotic runtime as a function of input size).

• The asymptotic runtime of a sequential program is identical on any


serial platform.

• The parallel runtime of a program depends on the input size, the


number of processors, and the communication parameters of the
machine.

• An algorithm must therefore be analyzed in the context of the


underlying platform.

• A parallel system is a combination of a parallel algorithm and an


underlying platform.

BTech IT PE HPC 3
Analytical Modeling - Basics

• A number of performance measures are intuitive.

• Wall clock time - the time from the start of the first processor to the
stopping time of the last processor in a parallel ensemble. But how
does this scale when the number of processors is changed of the
program is ported to another machine altogether?

• How much faster is the parallel version? This begs the obvious
followup question - whats the baseline serial version with which we
compare? Can we use a suboptimal serial program to make our
parallel program look

• Raw FLOP count - What good are FLOP counts when they dont
solve a problem?

BTech IT PE HPC 4
Sources of Overhead in Parallel Programs

• If I use two processors, shouldnt my program run twice as fast?

• No - a number of overheads, including wasted computation,


communication, idling, and contention cause degradation in
performance.

The execution profile of a hypothetical parallel program


executing on eight processing elements. Profile indicates times spent
performing computation (both essential and excess), communication,
and idling.

BTech IT PE HPC 5
Sources of Overheads in Parallel Programs

• Interprocess interactions: Processors working on any non-trivial


parallel problem will need to talk to each other.

• Idling: Processes may idle because of load imbalance,


synchronization, or serial components.

• Excess Computation: This is computation not performed by the


serial version. This might be because the serial algorithm is difficult
to parallelize, or that some computations are repeated across
processors to minimize communication.

BTech IT PE HPC 6
Performance Metrics for Parallel Systems: Execution
Time
• Serial runtime of a program is the time elapsed between the
beginning and the end of its execution on a sequential computer.

• The parallel runtime is the time that elapses from the moment the
first processor starts to the moment the last processor finishes
execution.

• We denote the serial runtime by and the parallel runtime by TP .

BTech IT PE HPC 7
Performance Metrics for Parallel Systems: Total Parallel
Overhead
• Let Tall be the total time collectively spent by all the processing
elements.

• TS is the serial time.

• Observe that Tall - TS is then the total time spend by all processors
combined in non-useful work. This is called the total overhead.

• The total time collectively spent by all the processing elements


Tall = p TP (p is the number of processors).

• The overhead function (To) is therefore given by

To = p TP - TS (1)

BTech IT PE HPC 8
Performance Metrics for Parallel Systems: Speedup

• What is the benefit from parallelism?

• Speedup (S) is the ratio of the time taken to solve a problem on a


single processor to the time required to solve the same problem on
a parallel computer with p identical processing elements.

BTech IT PE HPC 9
Performance Metrics: Example

• Consider the problem of adding n numbers by using n processing


elements.

• If n is a power of two, we can perform this operation in log n steps


by propagating partial sums up a logical binary tree of processors.

BTech IT PE HPC 10
Performance Metrics: Example

Computing the globalsum of 16 partial sums using 16


processing elements . Σji denotes the sum of numbers with
consecutive labels from i to j.
BTech IT PE HPC 11
Performance Metrics: Example (continued)

• If an addition takes constant time, say, tc and communication


of a single word takes time ts + tw, we have the parallel time
TP = Θ (log n)

• We know that TS = Θ (n)

• Speedup S is given by S = Θ (n / log n)

BTech IT PE HPC 12
Performance Metrics: Speedup

• For a given problem, there might be many serial algorithms


available. These algorithms may have different asymptotic runtimes
and may be parallelizable to different degrees.

• For the purpose of computing speedup, we always consider the best


sequential program as the baseline.

BTech IT PE HPC 13
Performance Metrics: Speedup Example

• Consider the problem of parallel bubble sort.

• The serial time for bubblesort is 150 seconds.

• The parallel time for odd-even sort (efficient parallelization of bubble


sort) is 40 seconds.

• The speedup would appear to be 150/40 = 3.75.

• But is this really a fair assessment of the system?

• What if serial quicksort only took 30 seconds? In this case, the


speedup is 30/40 = 0.75. This is a more realistic assessment of the
system.

BTech IT PE HPC 14
Performance Metrics: Speedup Bounds

• Speedup can be as low as 0 (the parallel program never


terminates).

• Speedup, in theory, should be upper bounded by p - after all, we can


only expect a p-fold speedup if we use times as many resources.

• A speedup greater than p is possible only if each processing


element spends less than time TS / p solving the problem.

• In this case, a single processor could be timeslided to achieve a


faster serial program, which contradicts our assumption of fastest
serial program as basis for speedup.

BTech IT PE HPC 15
Performance Metrics: Superlinear Speedups

One reason for superlinearity is that the parallel version does


less work than corresponding serial algorithm.

Searching an unstructured tree for a node with a given label,


`S', on two processing elements using depth-first traversal. The two-
processor version with processor 0 searching the left subtree and
processor 1 searching the right subtree expands only the shaded
nodes before the solution is found. The corresponding serial
formulation expands the entire tree. It is clear that the serial
algorithm does more work than the parallel algorithm.

BTech IT PE HPC 16
Performance Metrics: Superlinear Speedups

Resource-based superlinearity: The higher aggregate


cache/memory bandwidth can result in better cache-hit ratios, and
therefore superlinearity.

Example: A processor with 64KB of cache yields an 80% hit


ratio. If two processors are used, since the problem size/processor
is smaller, the hit ratio goes up to 90%. Of the remaining 10%
access, 8% come from local memory and 2% from remote memory.

If DRAM access time is 100 ns, cache access time is 2 ns, and
remote memory access time is 400ns, this corresponds to a
speedup of 2.43!

BTech IT PE HPC 17
Performance Metrics: Efficiency

• Efficiency is a measure of the fraction of time for which a


processing element is usefully employed

• Mathematically, it is given by

= (2)

• Following the bounds on speedup, efficiency can be as low as 0


and as high as 1.

BTech IT PE HPC 18
Performance Metrics: Efficiency Example

• The speedup of adding numbers on processors is given by

• Efficiency is given by

BTech IT PE HPC 19
Parallel Time, Speedup, and Efficiency Example

Consider the problem of edge-detection in images. The


problem requires us to apply a 3 x 3 template to each pixel. If each
multiply-add operation takes time tc, the serial time for an n x n
image is given by TS= tc n2.

Example of edge detection: (a) an 8 x 8 image; (b) typical


templates for detecting edges; and (c) partitioning of the image
across four processors with shaded regions indicating image data
that must be communicated from neighboring processors to
processor 1.

BTech IT PE HPC 20
Parallel Time, Speedup, and Efficiency Example
(continued)

• One possible parallelization partitions the image equally into vertical


segments, each with n2 / p pixels.

• The boundary of each segment is 2n pixels. This is also the number


of pixel values that will have to be communicated. This takes time
2(ts + twn).

• Templates may now be applied to all n2 / p pixels in time


TS = 9 tcn2 / p.

BTech IT PE HPC 21
Parallel Time, Speedup, and Efficiency Example
(continued)
• The total time for the algorithm is therefore given by:

• The corresponding values of speedup and efficiency are given by:

and

BTech IT PE HPC 22
Cost of a Parallel System

• Cost is the product of parallel runtime and the number of processing


elements used (p x TP ).

• Cost reflects the sum of the time that each processing element
spends solving the problem.

• A parallel system is said to be cost-optimal if the cost of solving a


problem on a parallel computer is asymptotically identical to serial
cost.

• Since E = TS / p TP, for cost optimal systems, E = O(1).

• Cost is sometimes referred to as work or processor-time product.

BTech IT PE HPC 23
Cost of a Parallel System: Example

Consider the problem of adding numbers on processors.

• We have, TP = log n (for p = n).

• The cost of this system is given by p TP = n log n.

• Since the serial runtime of this operation is Θ(n), the algorithm is not
cost optimal.

BTech IT PE HPC 24
Impact of Non-Cost Optimality

Consider a sorting algorithm that uses n processing elements


to sort the list in time (log n)2.
• Since the serial runtime of a (comparison-based) sort is n log n, the
speedup and efficiency of this algorithm are given by n / log n and 1
/ log n, respectively.

• The p TP product of this algorithm is n (log n)2.

• This algorithm is not cost optimal but only by a factor of log n.

• If p < n, assigning n tasks to p processors gives TP = n (log n)2 / p .

• The corresponding speedup of this formulation is p / log n.

• This speedup goes down as the problem size n is increased for a


given p !
BTech IT PE HPC 25
Effect of Granularity on Performance

• Often, using fewer processors improves performance of parallel


systems.

• Using fewer than the maximum possible number of processing


elements to execute a parallel algorithm is called scaling down a
parallel system.

• A naive way of scaling down is to think of each processor in the


original case as a virtual processor and to assign virtual processors
equally to scaled down processors.

• Since the number of processing elements decreases by a factor of


n / p, the computation at each processing element increases by a
factor of n / p.

• The communication cost should not increase by this factor since


some of the virtual processors assigned to a physical processors
might talk to each other. This is the basic reason for the
improvement from building granularity.
BTech IT PE HPC 26
Building Granularity: Example

• Consider the problem of adding n numbers on p processing


elements such that p < n and both n and p are powers of 2.

• Use the parallel algorithm for n processors, except, in this case, we


think of them as virtual processors.

• Each of the p processors is now assigned n / p virtual processors.

• The first log p of the log n steps of the original algorithm are
simulated in (n / p) log p steps on p processing elements.

• Subsequent log n - log p steps do not require any communication.

BTech IT PE HPC 27
Building Granularity: Example (continued)

• The overall parallel execution time of this parallel system is


Θ ( (n / p) log p).

• The cost is Θ (n log p), which is asymptotically higher than the Θ (n)
cost of adding n numbers sequentially. Therefore, the parallel
system is not cost-optimal.

BTech IT PE HPC 28
Building Granularity: Example (continued)

Can we build granularity in the example in a cost-optimal


fashion?
• Each processing element locally adds its n / p numbers in time
Θ (n / p).
• The p partial sums on p processing elements can be added in time
Θ(n /p).

A cost-optimal way of computing the sum of 16 numbers using four


processing elements.

BTech IT PE HPC 29
Building Granularity: Example (continued)

• The parallel runtime of this algorithm is

(3)

• The cost is

• This is cost-optimal, so long as !

BTech IT PE HPC 30
Scalability of Parallel Systems

How do we extrapolate performance from small problems and


small systems to larger problems on larger configurations?
Consider three parallel algorithms for computing an n-point Fast
Fourier Transform (FFT) on 64 processing elements.

A comparison of the speedups obtained by the binary-exchange, 2-D


transpose and 3-D transpose algorithms on 64 processing elements
with tc = 2, tw = 4, ts = 25, and th = 2.
Clearly, it is difficult to infer scaling characteristics from
observations on small datasets on small machines.

BTech IT PE HPC 31
Scaling Characteristics of Parallel Programs

• The efficiency of a parallel program can be written as:

or (4)

• The total overhead function To is an increasing function of p .

BTech IT PE HPC 32
Scaling Characteristics of Parallel Programs

• For a given problem size (i.e., the value of TS remains constant), as


we increase the number of processing elements, To increases.

• The overall efficiency of the parallel program goes down. This is the
case for all parallel programs.

BTech IT PE HPC 33
Scaling Characteristics of Parallel Programs: Example

• Consider the problem of adding numbers on processing


elements.

• We have seen that:

= (5)

= (6)

= (7)

BTech IT PE HPC 34
Scaling Characteristics of Parallel Programs: Example
(continued)
Plotting the speedup for various input sizes gives us:

Speedup versus the number of processing elements for


adding a list of numbers.

Speedup tends to saturate and efficiency drops as a


consequence of Amdahl's law

BTech IT PE HPC 35
Scaling Characteristics of Parallel Programs

• Total overhead function To is a function of both problem size Ts and


the number of processing elements p.

• In many cases, To grows sublinearly with respect to Ts.

• In such cases, the efficiency increases if the problem size is


increased keeping the number of processing elements constant.

• For such systems, we can simultaneously increase the problem size


and number of processors to keep efficiency constant.

• We call such systems scalable parallel systems.

BTech IT PE HPC 36
Scaling Characteristics of Parallel Programs

• Recall that cost-optimal parallel systems have an efficiency of Θ(1).

• Scalability and cost-optimality are therefore related.

• A scalable parallel system can always be made cost-optimal if the


number of processing elements and the size of the computation are
chosen appropriately.

BTech IT PE HPC 37
Isoefficiency Metric of Scalability

• For a given problem size, as we increase the number of processing


elements, the overall efficiency of the parallel system goes down for
all systems.

• For some systems, the efficiency of a parallel system increases if


the problem size is increased while keeping the number of
processing elements constant.

BTech IT PE HPC 38
Isoefficiency Metric of Scalability

Variation of efficiency: (a) as the number of processing elements is


increased for a given problem size; and (b) as the problem size is
increased for a given number of processing elements. The
phenomenon illustrated in graph (b) is not common to all parallel
systems.

BTech IT PE HPC 39
Isoefficiency Metric of Scalability

• What is the rate at which the problem size must increase with
respect to the number of processing elements to keep the efficiency
fixed?

• This rate determines the scalability of the system. The slower this
rate, the better.

• Before we formalize this rate, we define the problem size W as the


asymptotic number of operations associated with the best serial
algorithm to solve the problem.

BTech IT PE HPC 40
Isoefficiency Metric of Scalability
• We can write parallel runtime as:

(8)

• The resulting expression for speedup is

(9)

• Finally, we write the expression for efficiency as

BTech IT PE HPC 41
Isoefficiency Metric of Scalability

• For scalable parallel systems, efficiency can be maintained at a


fixed value (between 0 and 1) if the ratio To / W is maintained at a
constant value.
• For a desired value E of efficiency,

(11)

• If K = E / (1 – E) is a constant depending on the efficiency to be


maintained, since To is a function of W and p, we have

(12)
BTech IT PE HPC 42
Isoefficiency Metric of Scalability

• The problem size W can usually be obtained as a function of p by


algebraic manipulations to keep efficiency constant.

• This function is called the isoefficiency function.

• This function determines the ease with which a parallel system can
maintain a constant efficiency and hence achieve speedups
increasing in proportion to the number of processing elements

BTech IT PE HPC 43
Isoefficiency Metric: Example

• The overhead function for the problem of adding n numbers on p


processing elements is approximately 2p log p .

• Substituting To by 2p log p , we get

= (13)

Thus, the asymptotic isoefficiency function for this parallel system is


.

• If the number of processing elements is increased from p to p’, the


problem size (in this case, n ) must be increased by a factor of
(p’ log p’) / (p log p) to get the same efficiency as on p processing
elements.

BTech IT PE HPC 44
Isoefficiency Metric: Example
Consider a more complex example where
• Using only the first term of To in Equation 12, we get

= (14)

• Using only the second term, Equation 12 yields the following


relation between W and p:

(15)

• The larger of these two asymptotic rates determines the


isoefficiency. This is given by Θ(p3)

BTech IT PE HPC 45
Cost-Optimality and the Isoefficiency Function

• A parallel system is cost-optimal if and only if

(16)

• From this, we have:

(17)

(18)

• If we have an isoefficiency function f(p), then it follows that the


relation W = Ω(f(p)) must be satisfied to ensure the cost-optimality of
a parallel system as it is scaled up.

BTech IT PE HPC 46
Lower Bound on the Isoefficiency Function

• For a problem consisting of W units of work, no more than W


processing elements can be used cost-optimally.

• The problem size must increase at least as fast as Θ(p) to maintain


fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the
isoefficiency function.

BTech IT PE HPC 47
Degree of Concurrency and the Isoefficiency Function

• The maximum number of tasks that can be executed simultaneously


at any time in a parallel algorithm is called its degree of concurrency.

• If C(W) is the degree of concurrency of a parallel algorithm, then for


a problem of size W, no more than C(W) processing elements can
be employed effectively.

BTech IT PE HPC 48
Degree of Concurrency and the Isoefficiency Function: Example

Consider solving a system of equations in variables by using


Gaussian elimination (W = Θ(n3))

• The n variables must be eliminated one after the other, and


eliminating each variable requires Θ(n2) computations.

• At most Θ(n2) processing elements can be kept busy at any time.

• Since W = Θ(n3) for this problem, the degree of concurrency C(W) is


Θ(W2/3) .

• Given p processing elements, the problem size should be at least


Ω(p3/2) to use them all.

BTech IT PE HPC 49
Minimum Execution Time and Minimum Cost-Optimal
Execution Time
Often, we are interested in the minimum time to solution.

• We can determine the minimum parallel runtime TPmin for a given W


by differentiating the expression for TP w.r.t. p and equating it to
zero.

=0 (19)

• If p0 is the value of p as determined by this equation, TP(p0) is the


minimum parallel time.

BTech IT PE HPC 50
Minimum Execution Time: Example

Consider the minimum execution time for adding n numbers.

= (20)

Setting the derivative w.r.t. p to zero, we have p = n/ 2 . The


corresponding runtime is

= (21)

(One may verify that this is indeed a min by verifying that the second
derivative is positive).

Note that at this point, the formulation is not cost-optimal.

BTech IT PE HPC 51
Minimum Cost-Optimal Parallel Time

• Let TPcost_opt be the minimum cost-optimal parallel time.

• If the isoefficiency function of a parallel system is Θ(f(p)) , then a


problem of size W can be solved cost-optimally if and only if
W= Ω(f(p)) .

• In other words, for cost optimality, p = O(f--1(W)) .

• For cost-optimal systems, TP = Θ(W/p) , therefore,


= (22)

BTech IT PE HPC 52
Minimum Cost-Optimal Parallel Time: Example

Consider the problem of adding n numbers.

• The isoefficiency function f(p) of this parallel system is Θ(p log p).

• From this, we have p ≈ n /log n .

• At this processor count, the parallel runtime is:

= (23)

• Note that both TPmin and TPcost_opt for adding n numbers are
Θ(log n). This may not always be the case.

BTech IT PE HPC 53
Asymptotic Analysis of Parallel Programs

Consider the problem of sorting a list of n numbers. The fastest


serial programs for this problem run in time Θ(n log n). Consider
four parallel algorithms, A1, A2, A3, and A4 as follows:

Comparison of four different algorithms for sorting a given list of


numbers. The table shows number of processing elements, parallel
runtime, speedup, efficiency and the pTP product.

BTech IT PE HPC 54
Asymptotic Analysis of Parallel Programs

• If the metric is speed, algorithm A1 is the best, followed by A3, A4,


and A2 (in order of increasing TP).

• In terms of efficiency, A2 and A4 are the best, followed by A3 and


A1.

• In terms of cost, algorithms A2 and A4 are cost optimal, A1 and A3


are not.

• It is important to identify the objectives of analysis and to use


appropriate metrics!

BTech IT PE HPC 55
Other Scalability Metrics

• A number of other metrics have been proposed, dictated by specific


needs of applications.

• For real-time applications, the objective is to scale up a system to


accomplish a task in a specified time bound.

• In memory constrained environments, metrics operate at the limit of


memory and estimate performance under this problem growth rate.

BTech IT PE HPC 56
Other Scalability Metrics: Scaled Speedup

• Speedup obtained when the problem size is increased linearly with


the number of processing elements.

• If scaled speedup is close to linear, the system is considered


scalable.

• If the isoefficiency is near linear, scaled speedup curve is close to


linear as well.

• If the aggregate memory grows linearly in p, scaled speedup


increases problem size to fill memory.

• Alternately, the size of the problem is increased subject to an upper-


bound on parallel execution time.

BTech IT PE HPC 57
Scaled Speedup: Example

• The serial runtime of multiplying a matrix of dimension n x n with a


vector is tcn2 .

• For a given parallel algorithm,

(24)

• Total memory requirement of this algorithm is Θ(n2) .

BTech IT PE HPC 58
Scaled Speedup: Example (continued)

Consider the case of memory-constrained scaling.

• We have m= Θ(n2) = Θ(p).

• Memory constrained scaled speedup is given by

or

• This is not a particularly scalable system

BTech IT PE HPC 59
Scaled Speedup: Example (continued)

Consider the case of time-constrained scaling.

• We have TP = O(n2) .

• Since this is constrained to be constant, n2= O(p) .

• Note that in this case, time-constrained speedup is identical to


memory constrained speedup.

• This is not surprising, since the memory and time complexity of the
operation are identical.

BTech IT PE HPC 60
Scaled Speedup: Example

• The serial runtime of multiplying two matrices of dimension n x n is


tcn3.

• The parallel runtime of a given algorithm is:

• The speedup S is given by:

(25)

BTech IT PE HPC 61
Scaled Speedup: Example (continued)

Consider memory-constrained scaled speedup.


• We have memory complexity m= Θ(n2) = Θ(p), or n2=c x p .

• At this growth rate, scaled speedup S’ is given by:

• Note that this is scalable.

BTech IT PE HPC 62
Scaled Speedup: Example (continued)

Consider time-constrained scaled speedup.

• We have TP = O(1) = O(n3 / p) , or n3=c x p .

• Time-constrained speedup S’’ is given by:

• Memory constrained scaling yields better performance.

BTech IT PE HPC 63
Serial Fraction f

• If the serial runtime of a computation can be divided into a totally


parallel and a totally serial component, we have:

• From this, we have,

(26)

BTech IT PE HPC 64
Serial Fraction f

• The serial fraction f of a parallel program is defined as:

• Therefore, we have:

BTech IT PE HPC 65
Serial Fraction

• Since S = W / TP , we have

• From this, we have:

(27)

• If f increases with the number of processors, this is an indicator of


rising overhead, and thus an indicator of poor scalability.

BTech IT PE HPC 66
Serial Fraction: Example

Consider the problem of extimating the serial component of the


matrix-vector product.

We have:

(28)

or

Here, the denominator is the serial runtime and the numerator is the
overhead.
BTech IT PE HPC 67
Physical Organization
of Parallel Platforms
An ideal parallel machine called Parallel Random Access
Machine, or PRAM.
Architecture of an
Ideal Parallel Computer
A natural extension of the Random Access Machine
(RAM) serial architecture is the Parallel Random Access
Machine, or PRAM.
PRAMs consist of p processors and a global memory of
unbounded size that is uniformly accessible to all
processors.
Processors a
share common clock but may execute
different instructions in each cycle.
Physical Complexity of an
Ideal Parallel Computer
Processors and memories are connected via switches.
Since these switches must operate in O(1) time at the
level of words, for a system of p prOcessors and m
words, the switch complexity is O(mp).
Clearly, for meaningful values of p and m, a true PRAM
IS not realizable.
Network Topologies: Crossbars

The cost of a crossbar of p processors grows as O(p).


This is generally difficult to scale for large values of p.
Examples of machines that employ crossbars include the
Sun Ultra HPC 10000 and the Fujitsu VPP500.
Network Topologies: Multistage Omega
Network

One of the most commonly used multistage


interconnects is the Omega network.
This network consists of log p stages, where p is
the number of inputs/loutputs.
At each stage, input i is connected to output j if:

0<ip/2-1
2i+1-p, p/2išp-1
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes

In a linear array, each node has two neighbors, one to its


left and one to its right. If the nodes at either end are
connected, we refer to it as a 1-D torus or a ring.
A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.
A further generalization to d dimensions has nodes with
2d neighbors.
A special case of ad-dimensional mesh is a hypercube.
Here, d= log p, where p is the total number of nodes.
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes

In a linear array, each node has two neighbors, one to its


left and one to its right. If the nodes at either end are
connected, we refer to it as a 1-D torus or a ring.
A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.
A further generalization to d dimensions has nodes with
2d neighbors.
A special case of a d-dimensional mesh is a hypercube.
Here, d = log p, where p is the total number of nodes.
Network Topologies:
Properties of Hypercubes

The distance between any two nodes is at most log p.


Each node has log p neighbors.
The distance between two nodes is given by the number
of bit positions at which the two nodes differ.
Evaluating
Static Interconnection Networks
Diameter: The distance between the farthest two nodes in the
network. The diameter of a linear array is p-1, that of a mesh
Is 2(VP- 1). that of a tree and hypercube is logP, and that of
a completely connected network is O(1).
Bisection Width: The minimum number of wires you must cut
to divide the network into two equal parts. The bisection width
of a linear array and tree is 1, that of a mesh is VP, that of a
hypercube is p/2 and that of a completely connected network
is p/4
Cost: The number of links or switches (whichever is
asymptotically higher) is a meaningful measure of the cost.
However, a number of other factors, such as the ability to
layout the network, the length of wires, etc., also factor in to
the cost.
Network Topologies: Fat Trees

A fat tree network of 15 processing nodes.


Evaluating
Static Interconnection Networks
Diameter: The distance between the farthest two nodes in the
network. The diameter of a linear array is p-1, that of a mesh
is 2(v- 1), that of a tree and hypercube is logp, and that of
a completely connected network is O(1).
Bisection Width: The minimum number of wires you must cut
to divide the network into two equal parts. The bisection width
of a linear array and tree is 1, that of a mesh is yP, that of a
hypercube is p/2 and that of a completely connected network
is p/4.
Cost: The number of links or switches (whichever is
asymptotically higher) is a meaningful measure of the cost.
However, a number of other factors, such as the ability to
layout the network, the length of wires, etc, also factor in to
the cost.
Evaluating
Static Interconnection Networks

Bisection Arc Cost


Network Diamele
Width Connectvity (No. of inks)

Campletely-connected p4 -1 plp-1)/2
Star p-1
Camplete binary tree 2 loE((p 1)/2) P
Linear aray p 1 p1
2-D mesh, no wraparound 2/-1) 2(p-VP
2-D wraparound mesh
21 P/2 2P
Hypercube logp p/2 Jogp (p losp)/2
1
Wraparound k-ary d-cube d k/2 2k 24 dp
Network Topologies: Linear Arrays

Linear arrays: (a) with no wraparound links; (b) with


Wraparound link.
Network Topologies:
Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower


dimension.
Network Topologies: Tree-Based Networks

Processing nodes

Switching modes

(a) (b)

Complete binary tree networks: (a) a static tree network; and (b)
a dynamic tree network.
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes

In a linear array, each node has two neighbors, one to its


left and one to its right. If the nodes at either end are
Connected, we refer to it as a 1-D torus or a ring.
A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.
A further generalization to d dimensions has nodes with
2d neighbors.
A special case of ad-dimensional mesh is a hypercube.
Here, d= log p, where p is the total number of nodes.
Network Topologies
Multistage Networks
ProcessOS Multistage interconnection network Memorybank:
------
-----**

The schematic of a typical multistage interconnection network.


Network Topologies:
Multistage Networks
Crossbars have excellent performance scalability but
poor cost scalability.
Buses have excellent cost scalability, but poor
performance scalability.
Multistage interconnects strike a compromise between
these extremes.
91

Network Topologies: Buses

Some of the simplest and earliest parallel machines


used buses.
a
All processors access common bus for exchanging
data.
The distance between any two nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
However, the bandwidth of the shared bus is a major
bottleneck.
Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples or such
architectures.

dinesh.kulkarni

= O
Network Topologies: Buses

Cac

Poesacel

Bus-based interconnects (a) with no local caches; (b) with local


memory/caches.

Since much of the data accessed by processors is local


to the processor, a local memory can improve the
performance of bus-based machines.
91

Guests are waiting to join. View lobby X

Network Topologies: Crossbars


A crossbar network uses an pxm grid of switches to
connect p inputs to m outputs in a non-blocking manner.
Msrmiy anks

ILLIL A switchin
ciement

A completely non-blocking crossbar network connecting p


processors to b memory banks.

dinesh.kulkarni

O
Network Topologies: Buses
Some of the simplest and earliest parallel machines
used buses.
All processors access a common bus for exchanging
data.
The distance between any tWo nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
However, the bandwidth of the shared bus is a major
bottleneck.
Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures.
Network Topologies
A variety of network topologies have been proposed and
implemented.
These topologies tradeoff performance for cost.
Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and
available components.
92

Meeting
20:43 14attendees

Interconnection Networks:
Network lInterfaces
a
Processors talk to the network via network interface.
The network interface may hang off the /O bus or the
memory bus.
In a physical sense, this distinguishes a cluster from a
tightly coupled multicomputer.
The relative speeds of the l/O and memory buses impac
the performance of the network.

dinesh.kulkarni

=
Static and Dynamic
Interconnection Networks
S3tiC newaek Tisdireckootnork

Network inaerfiace'switech Swltching elenant


Procesaing node

Classification of interconnection networks: (a) a static


network; and (b) a dynamic network.
Static and Dynamic
Interconnection Networks
Static netwoEk Indirect network

Vetwork interface/switch Switching element


Processing node

Classification of interconnection networks: (a) a static


network; and (b) a dynamic network.
92

Interconnection Networks

Switches a
map fixed number of inputs to outputs.
The total number of ports on a switch is the degree of
the switch.
The cost of a switch grows as the square of the degree
of the switch, the peripheral hardware linearly as the
degree, and the packaging costs linearly as the number
of pins.

dinesh.kulkarni

O
Interconnection Networks
for Parallel Computers
Interconnection networks carry data between processors
and to memory.
Interconnects are made of switches and links (wires.
fiber).
Interconnecis are classified as static or dynamic.
Static networks consist of point-to-point communication
links among processing nodes and are also referred to
as direct networks.
Dynamic networks are built using switches and
communication links. Dynamic networks are also
referred to as indirect networks.
Architecture of an
Ideal Parallel Computer
Depending on how simultaneous memory accesses are
handled, PRAMs can be divided into four subclasses.
Exclusive-read, exclusive-write (EREW) PRAM.
Concurrent-read, exclusive-write (CREW) PRAM.
Exclusive-read, concurrent-write (ERCW) PRAM.
Concurrent-read, concurrent-write (CRCW) PRAM.
Architecture of an
Ideal Parallel Computer
What does concurrent write mean, anyway?
Common: write only if all values are identical.
Arbitrary: write the data from a randomly selected processor.
Priority: follow a predetermined priority order.
Sum: Write the sum of all data items.
Basic Communication Operations
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

To accompany the text “Introduction to Parallel Computing”,


Addison Wesley, 2003.
Topic Overview

• One-to-All Broadcast and All-to-One Reduction

• All-to-All Broadcast and Reduction

• All-Reduce and Prefix-Sum Operations

• Scatter and Gather

• All-to-All Personalized Communication

• Circular Shift

• Improving the Speed of Some Communication Operations


Basic Communication Operations: Introduction

• Many interactions in practical parallel programs occur in well-


defined patterns involving groups of processors.

• Efficient implementations of these operations can improve


performance, reduce development effort and cost, and
improve software quality.

• Efficient implementations must leverage underlying architecture.


For this reason, we refer to specific architectures here.

• We select a descriptive set of architectures to illustrate the


process of algorithm design.
Basic Communication Operations: Introduction

• Group communication operations are built using point-to-point


messaging primitives.

• Recall from our discussion of architectures that communicating


a message of size m over an uncongested network takes time
ts + tmw.

• We use this as the basis for our analyses. Where necessary, we


take congestion into account explicitly by scaling the tw term.

• We assume that the network is bidirectional and that


communication is single-ported.
One-to-All Broadcast and All-to-One Reduction

• One processor has a piece of data (of size m) it needs to send


to everyone.

• The dual of one-to-all broadcast is all-to-one reduction.

• In all-to-one reduction, each processor has m units of data.


These data items must be combined piece-wise (using some
associative operator, such as addition or min), and the result
made available at a target processor.
One-to-All Broadcast and All-to-One Reduction
One-to-all Broadcast
M M M M
0 1 ... p-1 0 1 ... p-1
All-to-one Reduction

One-to-all broadcast and all-to-one reduction among p


processors.
One-to-All Broadcast and All-to-One Reduction on
Rings

• Simplest way is to send p − 1 messages from the source to the


other p − 1 processors – this is not very efficient.

• Use recursive doubling: source sends a message to a selected


processor. We now have two independent problems derined
over halves of machines.

• Reduction can be performed in an identical fashion by


inverting the process.
One-to-All Broadcast
3 3
2

7 6 5 4

0 1 2 3

2
3 3

One-to-all broadcast on an eight-node ring. Node 0 is the


source of the broadcast. Each message transfer step is shown by
a numbered, dotted arrow from the source of the message to its
destination. The number on an arrow indicates the time step
during which the message is transferred.
All-to-One Reduction
1 1
2

7 6 5 4

0 1 2 3

2
1 1

Reduction on an eight-node ring with node 0 as the destination


of the reduction.
Broadcast and Reduction: Example

Consider the problem of multiplying a matrix with a vector.

• The n × n matrix is assigned to an n × n (virtual) processor grid.


The vector is assumed to be on the first row of processors.

• The first step of the product requires a one-to-all broadcast


of the vector element along the corresponding column of
processors. This can be done concurrently for all n columns.

• The processors compute local product of the vector element


and the local matrix entry.

• In the final step, the results of these products are accumulated


to the first row using n concurrent all-to-one reduction
operations along the oclumns (using the sum operation).
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2 P3
One-to-all broadcast

P0 P0 P1 P2 P3

P4 P4 P5 P6 P7

P8 P8 P9 P10 P11 Matrix

P12 P12 P13 P14 P15

Output
Vector

One-to-all broadcast and all-to-one reduction in the


multiplication of a 4 × 4 matrix with a 4 × 1 vector.
Broadcast and Reduction on a Mesh

• We can view each row and column of a square mesh of p



nodes as a linear array of p nodes.

• Broadcast and reduction operations can be performed in two


steps – the first step does the operation along a row and the
second step along each column concurrently.

• This process generalizes to higher dimensions as well.


Broadcast and Reduction on a Mesh: Example
3 7 11 15

4 4 4 4

2 6 10 14

3 3 3 3

1 5 9 13

4 4 4 4

2 2
0 4 8 12

One-to-all broadcast on a 16-node mesh.


Broadcast and Reduction on a Hypercube

• A hypercube with 2d nodes can be regarded as a d-


dimensional mesh with two nodes in each dimension.

• The mesh algorithm can be generalized to a hypercube and


the operation is carried out in d (= log p) steps.
Broadcast and Reduction on a Hypercube: Example
(110) 3 (111)

6 7

(011)

(010) 2 3
2
3
3
2 1 4 5
(100) (101)

(000) 0 1
(001)
3

One-to-all broadcast on a three-dimensional hypercube. The


binary representations of node labels are shown in parentheses.
Broadcast and Reduction on a Balanced Binary Tree

• Consider a binary tree in which processors are (logically) at the


leaves and internal nodes are routing nodes.

• Assume that source processor is the root of this tree. In the first
step, the source sends the data to the right child (assuming
the source is also the left child). The problem has now
been decomposed into two problems with half the number of
processors.
Broadcast and Reduction on a Balanced Binary Tree

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree.


Broadcast and Reduction Algorithms

• All of the algorithms described above are adaptations of the


same algorithmic template.

• We illustrate the algorithm for a hypercube, but the algorithm,


as has been seen, can be adapted to other architectures.

• The hypercube has 2d nodes and my id is the label for a node.

• X is the message to be broadcast, which initially resides at the


source node 0.
Broadcast and Reduction Algorithms

1. procedure GENERAL ONE TO ALL BC(d, my id, source, X )


2. begin
3. my virtual id := my id XOR source;
4. mask := 2d − 1;
5. for i := d − 1 downto 0 do /* Outer loop */
6. mask := mask XOR 2i; /* Set bit i of mask to 0 */
7. if (my virtual id AND mask) = 0 then
8. if (my virtual id AND 2i) = 0 then
9. virtual dest := my virtual id XOR 2i;
10. send X to (virtual dest XOR source);
/* Convert virtual dest to the label of the physical destination */
11. else
12. virtual source := my virtual id XOR 2i;
13. receive X from (virtual source XOR source);
/* Convert virtual source to the label of the physical source */
14. endelse;
15. endfor;
16. end GENERAL ONE TO ALL BC

One-to-all broadcast of a message X from source on a hypercube.


Broadcast and Reduction Algorithms

1. procedure ALL TO ONE REDUCE(d, my id, m, X , sum)


2. begin
3. for j := 0 to m − 1 do sum[j] := X[j];
4. mask := 0;
5. for i := 0 to d − 1 do
/* Select nodes whose lower i bits are 0 */
6. if (my id AND mask) = 0 then
7. if (my id AND 2i) 6= 0 then
8. msg destination := my id XOR 2i;
9. send sum to msg destination;
10. else
11. msg source := my id XOR 2i;
12. receive X from msg source;
13. for j := 0 to m − 1 do
14. sum[j] :=sum[j] + X[j];
15. endelse;
16. mask := mask XOR 2i; /* Set bit i of mask to 1 */
17. endfor;
18. end ALL TO ONE REDUCE

Single-node accumulation on a d-dimensional hypercube. Each node


contributes a message X containing m words, and node 0 is the destination.
Cost Analysis

• The broadcast or reduction procedure involves log p point-to-


point simple message transfers, each at a time cost of ts + tw m.

• The total time is therefore given by:

T = (ts + tw m) log p. (1)


All-to-All Broadcast and Reduction

• Generalization of broadcast in which each processor is the


source as well as destination.

• A process sends the same m-word message to every other


process, but different processes may broadcast different
messages.
All-to-All Broadcast and Reduction
M p -1 M p -1 M p -1
.. .. .
All-to-all broadcast . . ..
M1 M1 M1
M0 M1 M p -1 M0 M0 M0
0 1 ... p-1 All-to-all reduction 0 1 ... p-1

All-to-all broadcast and all-to-all reduction.


All-to-All Broadcast and Reduction on a Ring

• Simplest approach: perform p one-to-all broadcasts. This is not


the most efficient way, though.

• Each node first sends to one of its neighbors the data it needs
to broadcast.

• In subsequent steps, it forwards the data received from one of


its neighbors to its other neighbor.

• The algorithm terminates in p − 1 steps.


All-to-All Broadcast and Reduction on a Ring
1 (6) 1 (5) 1 (4)

7 6 5 4
(7) (6) (5) (4)
1 (7) 1 (3)

1st communication step


(0) (1) (2) (3)

0 1 2 3

1 (0) 1 (1) 1 (2)

2 (5) 2 (4) 2 (3)

7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)

2nd communication step


(0,7) (1,0) (2,1) (3,2)

0 1 2 3

2 (7) 2 (0) 2 (1)

. .
. .
. .
7 (0) 7 (7) 7 (6)

7 6 5 4
(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7 (1) 7 (5)

7th communication step


(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

0 1 2 3

7 (2) 7 (3) 7 (4)

All-to-all broadcast on an eight-node ring.


All-to-All Broadcast and Reduction on a Ring

1. procedure ALL TO ALL BC RING(my id, my msg , p, result)


2. begin
3. left := (my id − 1) mod p;
4. right := (my id + 1) mod p;
5. result := my msg ;
6. msg := result;
7. for i := 1 to p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
12. end ALL TO ALL BC RING

All-to-all broadcast on a p-node ring.

All-to-all reduction is simply a dual of this operation and can be


performed in an identical fashion.
All-to-all Broadcast on a Mesh

• Performed in two phases – in the first phase, each row of the


mesh performs an all-to-all broadcast using the procedure for
the linear array.

• In this phase, all nodes collect p messages corresponding to

the p nodes of their respective rows. Each node consolidates

this information into a single message of size m p.

• The second communication phase is a columnwise all-to-all


broadcast of the consolidated messages.
All-to-all Broadcast on a Mesh
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)

6 7 8 6 7 8

(3) (4) (5)


(3,4,5) (3,4,5) (3,4,5)
3 4 5 3 4 5

0 1 2 0 1 2

(0) (1) (2) (0,1,2) (0,1,2) (0,1,2)

(a) Initial data distribution (b) Data distribution after rowwise broadcast

All-to-all broadcast on a 3 × 3 mesh. The groups of nodes


communicating with each other in each phase are enclosed by
dotted boundaries. By the end of the second phase, all nodes
get (0,1,2,3,4,5,6,7) (that is, a message from each node).
All-to-all Broadcast on a Mesh
1. procedure ALL TO ALL BC MESH(my id, my msg , p, result)
2. begin
/* Communication along rows */
√ √
3. left := my id − (my id mod p) + (my id − 1)mod p;
√ √
4. right := my id − (my id mod p) + (my id + 1) mod p;
5. result := my msg ;
6. msg := result;

7. for i := 1 to p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
/* Communication along columns */

12. up := (my id − p) mod p;

13. down := (my id + p) mod p;
14. msg := result;

15. for i := 1 to p − 1 do
16. send msg to down;
17. receive msg from up;
18. result := result ∪ msg;
19. endfor;
20. end ALL TO ALL BC MESH
All-to-all broadcast on a Hypercube

• Generalization of the mesh algorithm to log p dimensions.

• Message size doubles at each of the log p steps.


All-to-all broadcast on a Hypercube
(6) (7) (6,7) (6,7)

6 7 6 7

(2) 2 3 (3) (2,3) 2 3 (2,3)

(4) (5)

4 5 4 5
(4,5) (4,5)

(0) 0 1 (1) (0,1) 0 1 (0,1)

(a) Initial distribution of messages (b) Distribution before the second step

(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7

(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5

(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)

(c) Distribution before the third step (d) Final distribution of messages

All-to-all broadcast on an eight-node hypercube.


All-to-all broadcast on a Hypercube

1. procedure ALL TO ALL BC HCUBE(my id, my msg , d, result)


2. begin
3. result := my msg ;
4. for i := 0 to d − 1 do
5. partner := my id XOR 2i;
6. send result to partner;
7. receive msg from partner;
8. result := result ∪ msg;
9. endfor;
10. end ALL TO ALL BC HCUBE

All-to-all broadcast on a d-dimensional hypercube.


All-to-all Reduction

• Similar communication pattern to all-to-all broadcast, except


in the reverse order.

• On receiving a message, a node must combine it with the local


copy of the message that has the same destination as the
received message before forwarding the combined message
to the next neighbor.
Cost Analysis

• On a ring, the time is given by: (ts + tw m)(p − 1).



• On a mesh, the time is given by: 2ts( p − 1) + tw m(p − 1).

• On a hypercube, we have:

Xp
log
T = (ts + 2i−1tw m)
i=1
= ts log p + tw m(p − 1). (2)
All-to-all broadcast: Notes

• All of the algorithms presented above are asymptotically


optimal in message size.

• It is not possible to port algorithms for higher dimensional


networks (such as a hypercube) into a ring because this would
cause contention.
All-to-all broadcast: Notes
Contention for a single
channel by multiple
messages
7 6 5 4

0 1 2 3

Contention for a channel when the hypercube is mapped onto


a ring.
All-Reduce and Prefix-Sum Operations

• In all-reduce, each node starts with a buffer of size m and the


final results of the operation are identical buffers of size m on
each node that are formed by combining the original p buffers
using an associative operator.

• Identical to all-to-one reduction followed by a one-to-all


broadcast. This formulation is not the most efficient. Uses the
pattern of all-to-all broadcast, instead. The only difference
is that message size does not increase here. Time for this
operation is (ts + tw m) log p.

• Different from all-to-all reduction, in which p simultaneous all-to-


one reductions take place, each with a different destination for
the result.
The Prefix-Sum Operation

• Given p numbers n0, n1, . . . , np−1 (one on each node), the


problem is to compute the sums sk = Σki=0ni for all k between 0
and p − 1.

• Initially, nk resides on the node labeled k, and at the end of the


procedure, the same node holds sk .
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]

6 7 6 7

[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]

[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 1 (1) [1] (0+1) 0 1 (0+1) [0+1]

(a) Initial distribution of values (b) Distribution of sums before second step

(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]

6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ [0+1+2]
2 3 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]

(0+1+ 0 1 (0+1+ 0 1
[0+1] [0] [0+1]
2+3) 2+3)

(c) Distribution of sums before third step (d) Final distribution of prefix sums

Computing prefix sums on an eight-node hypercube. At each


node, square brackets show the local prefix sum accumulated in
the result buffer and parentheses enclose the contents of the
outgoing message buffer for the next step.
The Prefix-Sum Operation

• The operation can be implemented using the all-to-all


broadcast kernel.

• We must account for the fact that in prefix sums the node with
label k uses information from only the k-node subset whose
labels are less than or equal to k.

• This is implemented using an additional result buffer. The


content of an incoming message is added to the result buffer
only if the message comes from a node with a smaller label
than the recipient node.

• The contents of the outgoing message (denoted by parentheses


in the figure) are updated with every incoming message.
The Prefix-Sum Operation

1. procedure PREFIX SUMS HCUBE(my id, my number, d, result)


2. begin
3. result := my number;
4. msg := result;
5. for i := 0 to d − 1 do
6. partner := my id XOR 2i;
7. send msg to partner;
8. receive number from partner;
9. msg := msg + number;
10. if (partner < my id) then result := result + number;
11. endfor;
12. end PREFIX SUMS HCUBE

Prefix sums on a d-dimensional hypercube.


Scatter and Gather

• In the scatter operation, a single node sends a unique message


of size m to every other node (also called a one-to-all
personalized communication).

• In the gather operation, a single node collects a unique


message from each node.

• While the scatter operation is fundamentally different from


broadcast, the algorithmic structure is similar, except for
differences in message sizes (messages get smaller in scatter
and stay constant in broadcast).

• The gather operation is exactly the inverse of the scatter


operation and can be executed as such.
Gather and Scatter Operations
M p -1
..
. Scatter
M1
M0 M0 M1 M p -1
0 1 ... p-1
Gather
0 1 ... p-1

Scatter and gather operations.


Example of the Scatter Operation
6 7 6 7

2 3 2 3

4 5 4 5
(4,5,
(0,1,2,3, (0,1, 6,7)
4,5,6,7) 2,3)
0 1 0 1

(a) Initial distribution of messages (b) Distribution before the second step

(6,7) (6) (7)

6 7 6 7

(2,3) (2) (3)


2 3 2 3

(4) (5)

4 5 4 5
(4,5)

(0,1) (0) (1)


0 1 0 1

(c) Distribution before the third step (d) Final distribution of messages

The scatter operation on an eight-node hypercube.


Cost of Scatter and Gather

• There are log p steps, in each step, the machine size halves and
the data size halves.

• We have the time for this operation to be:

T = ts log p + tw m(p − 1). (3)

• This time hpnds for a linear array as well as a 2-D mesh.

• These times are asymptotically optimal in message size.


All-to-All Personalized Communication

• Each node has a distinct message of size m for every other


node.

• This is unlike all-to-all broadcast, in which each node sends the


same message to all other nodes.

• All-to-all personalized communication is also known as total


exchange.
All-to-All Personalized Communication
M 0, p -1 M 1, p -1 M p -1, p -1 M p -1,0 M p -1,1 M p -1, p -1
. . . .. .. ..
. . .
. . . . . .
M 0,1 M 1,1 M p -1,1 M 1,0 M 1,1 M 1, p -1
M 0,0 M 1,0 M p -1,0 All-to-all personalized M 0,0 M 0,1 M 0, p -1
0 1 ... p-1
communication
0 1 ... p-1

All-to-all personalized communication.


All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Each processor contains one full row of the matrix.

• The transpose operation in this case is identical to an all-to-all


personalized communication operation.
All-to-All Personalized Communication: Example
P0

P1
n
P2

P3

All-to-all personalized communication in transposing a 4 × 4


matrix using four processes.
All-to-All Personalized Communication on a Ring

• Each node sends all pieces of data as one consolidated


message of size m(p − 1) to one of its neighbors.

• Each node extracts the information meant for it from the data
received, and forwards the remaining (p − 2) pieces of size m
each to the next node.

• The algorithm terminates in p − 1 steps.

• The size of the message reduces by m at each step.


All-to-All Personalized Communication on a Ring
({0,5}) ({5,4})
5 5
({1,5}, {1,0}) ({0,4}, {0,5})
4 4
({2,5} ... {2,1}) ({1,4} ... {1,0})
3 3
({3,5} ... {3,2}) ({2,4} ... {2,1})
2 2
({4,5} ... {4,3}) ({3,4} ... {3,2})
1 1
5 4 3 2 1
5 4 3
({5,0},
({3,0}, ({4,0}, {5,1}, ({2,3},
({1,0}) ({2,0}, {3,1}, {4,1}, {5,2}, {2,4}, ({1,3}, ({0,3}, ({5,3},
{2,1}) {2,5}, {1,4}, {0,4}, {5,4}) ({4,3})
{3,2}) {4,2}, {5,3},
{4,3}) {5,4}) {2,0}, {1,5}, {0,5})
{2,1}) {1,0})
0 1 2
({0,1} ... {0,5}) ({1,2} ... {1,0}) 1 2 3 4 5
1 1
({5,1} ... {5,4}) ({0,2} ... {0,5})
2 2
({4,1} ... {4,3}) ({5,2} ... {5,4})
3 3
({3,1}, {3,2}) ({4,2}, {4,3})
4 4
({2,1}) ({3,2})
5 5

All-to-all personalized communication on a six-node ring. The


label of each message is of the form {x, y}, where x is the label
of the node that originally owned the message, and y is the
label of the node that is the final destination of the message. The
label ({x1, y1}, {x2, y2}, . . . , {xn, yn}) indicates a message that is
formed by concatenating n individual messages.
All-to-All Personalized Communication on a Ring:
Cost

• We have p − 1 steps in all.

• In step i, the message size is m(p − i).

• The total time is given by:

X
p−1
T = (ts + tw m(p − i))
i=1

X
p−1
= ts(p − 1) + itw m
i=1
= (ts + tw mp/2)(p − 1). (4)

• The tw term in this equation can be reduced by a factor of 2 by


communicating messages in both directions.
All-to-All Personalized Communication on a Mesh

• Each node first groups its p messages according to the columns


of their destination nodes.

• All-to-all personalized communication is performed independently



in each row with clustered messages of size m p.

• Messages in each node are sorted again, this time according


to the rows of their destination nodes.

• All-to-all personalized communication is performed independently



in each column with clustered messages of size m p.
All-to-All Personalized Communication on a Mesh
({8,0},{8,3},{8,6},
6 7 8 {8,1},{8,4},{8,7},
{8,2},{8,5},{8,8})
({6,0},{6,3},{6,6}, ({7,0},{7,3},{7,6},
{6,1},{6,4},{6,7}, {7,1},{7,4},{7,7},
{6,2},{6,5},{6,8}) {7,2},{7,5},{7,8})

({5,0},{5,3},{5,6},
3 4 5 {5,1},{5,4},{4,7},
{5,2},{5,5},{5,8})
({3,0},{3,3},{3,6}, ({4,0},{4,3},{4,6},
{3,1},{3,4},{3,7}, {4,1},{4,4},{4,7},
{3,2},{3,5},{3,8}) {4,2},{4,5},{4,8}) ({6,0},{6,3},{6,6}, ({6,1},{6,4},{6,7}, ({6,2},{6,5},{6,8},
{7,0},{7,3},{7,6}, {7,1},{7,4},{7,7}, {7,2},{7,5},{7,8},
{8,0},{8,3},{8,6}) {8,1},{8,4},{8,7}) {8,2},{8,5},{8,8})
0 1 2
6 7 8
({0,0},{0,3},{0,6}, ({1,0},{1,3},{1,6}, ({2,0},{2,3},{2,6},
{0,1},{0,4},{0,7}, {1,1},{1,4},{1,7}, {2,1},{2,4},{2,7}, ({3,1},{3,4}, ({3,2},{3,5},
{0,2},{0,5},{0,8}) {1,2},{1,5},{1,8}) {2,2},{2,5},{2,8}) {3,7},{4,1}, {3,8},{4,2},
{4,4},{4,7}, {4,5},{4,8},
{5,1},{5,,4}, {5,2},{5,5},
(a) Data distribution at the ({3,0},{3,3},{3,6}, {5,7}) {5,8})
beginning of first phase 3 4 5
{4,0},{4,3},{4,6},
{5,0},{5,3},{5,6})
({0,1},{0,4}, ({0,2},{0,5},
{0,7},{1,1}, {0,8},{1,2},
{1,4},{1,7}, {1,5},{1,8},
({0,0},{0,3},{0,6}, {2,1},{2,4}, {2,2},{2,5},
{1,0},{1,3},{1,6}, {2,7}) {2,8})
0 1 2
{2,0},{2,3},{2,6})

(b) Data distribution at the beginning of second phase

The distribution of messages at the beginning of each phase of


all-to-all personalized communication on a 3 × 3 mesh. At the
end of the second phase, node i has messages ({0,i}, . . . ,{8,i}),
where 0 ≤ i ≤ 8. The groups of nodes communicating together in
each phase are enclosed in dotted boundaries.
All-to-All Personalized Communication on a Mesh:
Cost


• Time for the first phase is identical to that in a ring with p

processors, i.e., (ts + tw mp/2)( p − 1).

• Time in the second phase is identical to the first phase.


Therefore, total time is twice of this time, i.e.,

T = (2ts + tw mp)( p − 1). (5)

• It can be shown that the time for rearrangement is less much


less than this communication time.
All-to-All Personalized Communication on a
Hypercube

• Generalize the mesh algorithm to log p steps.

• At any stage in all-to-all personalized communication, every


node holds p packets of size m each.

• While communicating in a particular dimension, every node


sends p/2 of these packets (consolidated as one message).

• A node must rearrange its messages locally before each of the


log p communication steps.
All-to-All Personalized Communication on a
Hypercube ({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
({6,0} ... {6,7}) ({7,0} ... {7,7}) {7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

6 7 6 7

({2,0} ... {2,7}) ({3,0} ... {3,7}) ({2,0},{2,2},


{2,4},{2,6},
2 3 2 3
{3,0},{3,2},
{3,4},{3,6})
({4,1},{4,3},

4 5 4 5 {4,5},{4,7},
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})

0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})

(a) Initial distribution of messages (b) Distribution before the second step

({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})

6 7 6 7

({0,2} ... {7,2}) ({0,3} ... {7,3})


({0,2},{2,2},
{0,6},{2,6}, 2 3 2 3
{1,2},{3,2}, ({4,1},{6,1},
{1,6},{3,6}) {4,5},{6,5},
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5}) ({0,4} ... {7,4}) ({0,5} ... {7,5})

0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})

(c) Distribution before the third step (d) Final distribution of messages

An all-to-all personalized communication algorithm on a


three-dimensional hypercube.
All-to-All Personalized Communication on a
Hypercube: Cost

• We have log p iterations and mp/2 words are communicated in


each iteration. Therefore, the cost is:

T = (ts + tw mp/2) log p. (6)

• This is not optimal!


All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

• Each node simply performs p − 1 communication steps,


exchanging m words of data with a different node in every
step.

• A node must choose its communication partner in each step


so that the hypercube links do not suffer congestion.

• In the jth communication step, node i exchanges data with


node (i XOR j).

• In this schedule, all paths in every communication step are


congestion-free, and none of the bidirectional links carry more
than one message in the same direction.
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(a) (b) (c)

6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(d) (e) (f)

6 7 0 1 3 7
1 0 2 6

2 3 2 3 1 5
3 2 0 4
4 5 7 3
4 5
5 4 6 2
6 7 5 1
0 1 7 6 4 0

(g)
Seven steps in all-to-all personalized communication on an
eight-node hypercube.
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

1. procedure ALL TO ALL PERSONAL(d, my id)


2. begin
3. for i := 1 to 2d − 1 do
4. begin
5. partner := my id XOR i;
6. send Mmy id,partner to partner;
7. receive Mpartner,my id from partner;
8. endfor;
9. end ALL TO ALL PERSONAL

A procedure to perform all-to-all personalized communication on a


d-dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j .
All-to-All Personalized Communication on a
Hypercube: Cost Analysis of Optimal Algorithm

• There are p − 1 steps and each step involves non-congesting message


transfer of m words.
• We have:
T=(ts + tw m)(p − 1). (7)
• This is asymptotically optimal in message size.
Circular Shift

• A special permutation in which node i sends a data packet to node (i + q)


mod p in a p-node ensemble (0 < q < p).
Circular Shift on a Mesh

• The implementation on a ring is rather intuitive. It can be performed in


min{q , p − q } neighbor communications.
• Mesh algorithms follow from this as well. We shift in one direction (all
processors) followed by the next direction.
• The associated time has an upper bound of:

T = (ts + tw m)( p + 1).
Circular Shift on a Mesh
(12) (13) (14) (15) (15) (12) (13) (14)
12 13 14 15 12 13 14 15

(8) (9) (10) (11) (11) (8) (9) (10)


8 9 10 11 8 9 10 11

(4) (5) (6) (7)


(7) (4) (5) (6)
4 5 6 7 4 5 6 7

(0) (1) (2) (3)


(3) (0) (1) (2)
0 1 2 3 0 1 2 3

(a) Initial data distribution and the (b) Step to compensate for backward row shifts
first communication step

(11) (12) (13) (14) (7) (8) (9) (10)


12 13 14 15 12 13 14 15

(7) (8) (9) (10) (3) (4) (5) (6)


8 9 10 11 8 9 10 11

(3) (4) (5) (6) (15) (0) (1) (2)


4 5 6 7 4 5 6 7

(15) (0) (1) (2) (11) (12) (13) (14)


0 1 2 3 0 1 2 3

(c) Column shifts in the third communication step (d) Final distribution of the data

The communication steps in a circular 5-shift on a 4 × 4 mesh.


Circular Shift on a Hypercube

• Map a linear array with 2d nodes onto a d-dimensional hypercube.


• To perform a q -shift, we expand q as a sum of distinct powers of 2.
• If q is the sum of s distinct powers of 2, then the circular q -shift on a
hypercube is performed in s phases.
• The time for this is upper bounded by:

T = (ts + tw m)(2 log p − 1). (8)

• If E-cube routing is used, this time can be reduced to

T = ts + tw m. (9)
Circular Shift on a Hypercube
(4) (5) (3) (2)

4 5 4 5

(3) (0)
3 2 (2) 3 2 (1)

(7) (4)
7 6 (6) 7 6 (5)

(0) (7)
0 1 (1) 0 1 (6)

First communication step of the 4-shift Second communication step of the 4-shift

(a) The first phase (a 4-shift)

(0) (1) (7) (0)

4 5 4 5

(7) (6)
3 2 (6) 3 2 (5)

(3) (2)
7 6 (2) 7 6 (1)

(4) (3)
0 1 (5) 0 1 (4)

(b) The second phase (a 1-shift) (c) Final data distribution after the 5-shift

The mapping of an eight-node linear array onto a three-dimensional


hypercube to perform a circular 5-shift as a combination of a 4-shift and a
1-shift.
Circular Shift on a Hypercube
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(a) 1-shift (b) 2-shift (c) 3-shift

6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(d) 4-shift (e) 5-shift (f) 6-shift

6 7

2 3

4 5

0 1
(g) 7-shift

Circular q -shifts on an 8-node hypercube for 1 ≤ q < 8.


Improving Performance of Operations

• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation
followed by an all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw (p − 1) )
p
≈ 2 × (ts log p + tw m). (10)

• All-to-one reduction can be performed by performing all-to-all reduction


(dual of all-to-all broadcast) followed by a gather operation (dual of
scatter).
Improving Performance of Operations

• Since an all-reduce operation is semantically equivalent to an all-to-one


reduction followed by a one-to-all broadcast, the asymptotically optimal
algorithms for these two operations can be used to construct a similar
algorithm for the all-reduce operation.
• The intervening gather and scatter operations cancel each other.
Therefore, an all-reduce operation requires an all-to-all reduction and an
all-to-all broadcast.

You might also like