05 Notes
05 Notes
Processor/
Processor Memory
Processor/ Memory Processor/
Memory
Memory
Processor Memory
Processor/ Processor/
Memory Memory
Processor/ Interconnect
. . Memory Processor/
Memory
. .
Processor/
. . Memory
Graph-theoretic Properties:
Bisection width = minimum number of "wires" that must be removed to get two "halves"
Bounded degree = all networks in the class have vertex-degree bounded by a constant
Rules of Thumb: High bisection width and low diameter are good. Bounded degree networks are easier to build.
High symmetry - simpler code, complicated to build
Linear array
P0 P1 P2 P3 P4
Bisection width = 1
Diameter = N - 1
Convenient for VLSI signal processing and bit-level integer multiplication. The convolution of two polynomials h(x) =
f(x)g(x) of f(x) = a0 + a1x + a2x2 + . . . + an-1xn-1 and g(x) = b0 + b1x + b2x2 + . . . + bn-1xn-1 may be computed on a 2n-1
node linear array by inputting an-1, an-2, an-3 . . . a0 into the left end at the odd numbered steps and b0, b1, b2 . . . bn-1 at the
odd numbered steps. When an a and a b value arrive at a node, the values are multiplied and added to the value which will be
output as a coefficient for h(x)
2
Trace for n = 4:
p0 p1 p2 p3 p4 p5 p6
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
0 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
1 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
2 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
3 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
4 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
5 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
6 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
7 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
8 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
9 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
10 b0 • b1 • b2 • b3
Ring - symmetric version of linear array. To avoid long wraparound connection, the following arrangement may be used:
P0 P7 P1 P6 P2 P5 P3 P4
Mesh
3
N
. . .
ROUTING IS MORE DIFFICULT THAN LINEAR ARRAY. BOTTLENECKS AROUND CENTER OF MESH.
Sorting can be used to handle routing (along with "perfect matching" in graphs). Can also use randomized approaches such
as Valiant-Brebner routing.
2-d Torus - adds wrap-around connections between P(i,N-1) and P(i,0), also P(k-1,j) and P(0,j).
N-by-N torus (N > 4) has bisection width = if N even then 2N else 2(N+1),
diameter = if N even then N else N-1, 8N2 automorphisms, 1 vec
h
Bisection Width = 1 Diameter = 2*lg(N+1) - 2 Automorphisms = 22 -1, vecs = h + 1 (h=height, which is 3 in example)
Leaf selection - number leaves from 0 to 2k - 1, can switch path to leaf by sending MSB first followed by decreasing
significance bits.
Butterfly
Processor P[i,j] is also connected to processor in next column via complementing the (j+1)-st most significant bit
0 1 2 3
000
001
010
011
100
101
110
111
Fat tree - generalization of the butterfly that has been used in several commercial systems.
Each non-root node has two parents, but each non-leaf node has degree children.
The depth of the fat tree indicates the number of levels past the root level (level 0).
Level i (0 ≤ i ≤ depth) has vertices labeled (i, j, k) where 0 ≤ j < degreei and 0 ≤ k < 2depth-i.
5
Non-leaf vertex (i, j, k), i < depth has children (i + 1, j•degree + p, k/2) where 0 ≤ p < degree.
0 1 4 5
0 1
1-d 2 3 0 1
2-d
2 3
6 3-d 7
6
4 5
0 1 12 13
2 3
8 9
11
6 7 10
14 15
4-d
Note: Cube connection results from replacing each vertex by a ring of k vertices
Derivation of automorphisms:
Hamiltonian cycle of processors - use reflected Gray code (address = i xor (i >>1) trick)
Benes Network: A multi-stage switching network variation of the butterfly/hypercube with (2k+1)2k binary switches. We look at this
network to get an initial understanding of static (permutation) routing and then examine the same concepts for hypercubes.
7
LSB Complements
Forward Backward
Butterfly Butterfly
Commonly written as
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Switches can be set to give an "edge-disjoint" path for any permutation. For permutation
(0 1 2 34 5 6 7
5 3 4 70 1 2 6 ) we get:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
The existence of a switch setting for each permutation can be proven by viewing the Benes network as a recursive structure:
8
UPPER .
.
NETWORK .
.
.
.
. LOWER .
. NETWORK .
. .
If we can successfully route a n-permutation problem to the upper and lower networks, we then have two n/2-permutation problems.
2-permutations are trivially routed. n-permutation problems are always solvable based on Hall’s Matching Theorem:
A 2N-node bipartite graph G=(U,V,E) has a perfect matching if and only if for all subsets S ⊆ U, |N(S)| > |S|, where N(S) denotes the
nodes in V that are adjacent to a node in S.
Corollary: If all vertices in a bipartite graph are incident on k edges, then there are k disjoint perfect matchings. (The set of k
matchings may not be unique. The number of perfect matchings is known as the permanent of the binary adjacency matrix for the
graph.)
Translation:
bipartite: two-colorable
U: Nodes of color 1
V: Nodes of color 2
Perfect matching: Set of edges that will include all vertices, but no vertex is on two of the edges
Condition will be satisfied by routing graph, since each node is incident to two edges.
All "outside" switches have two packets to be routed to the other side.
Example:
(
01234567
53470126 )
0/1 1 0 0/1
1
0
3
2/3 2 2/3
2
3
4 5
4/5 5 4 4/5
6 7
6/7 6/7
7 6
9
This gives (dark lines in graph correspond to using upper network)
0
0
0 0
1 1
4 5
3 2
2 2
3 7 3
6
1
4 1 4
5 5
5 4
2
6 3 6
7 7
7 6
5072
and a lower routing problem of
( )
1527
3146
3 7 2 3
3 4 0 2
2 2
3 7 3
6 6 2
1 1 4
4 1 4
5 5
5 2 3 4
2 5 1
6 3 6
7 7
7 7 6 6
Even though trial-and-error is usually sufficient for small problems, the notion of an alternating path facilitates the task. Hopcroft and
Karp developed an extremely efficient depth-first-search algorithm of this concept that runs in O(|E| √(|V|)) time, but it is not suitable
for "hand tracing". (Even more remarkably, Micali and Vazirani obtained the same bound for matching in general graphs).
The algorithm is based on incrementally increasing the size of the matching. The algorithm starts by using an initial deficient
matching (a single edge is fine or we may greedily attempt to insert each edge in the matching without backtracking by removing a
vertex). We then search for a path with the following properties:
1. The starting and terminating vertices are different and are not included in the previous matching.
2. The path alternates between k+1 edges that are not in the matching and k edges that are in the matching, i.e.
...
3. The new matching is obtained by removing the edges from the previous matching that are on the alternating path and then
including the alternating path edges that were not in the previous matching.
NOTE: Often a simple alternating path is just a single edge between two vertices not in the previous matching!
Example:
6 7 6 7 6 7 6 7
6/7 6/7 6/7 6/7 6/7 6/7 6/7 6/7
7 6 7 6 7 6 7 6
11
0/1 1 0 0/1 0/1 1 0 0/1 0/1 0/1
1 0
1 1 1
0 0 0
3 3 3
2/3 2 2 2/3 2/3 2 2 2/3 2/3 2 2 2/3
3 3 3
4 5 4 5 4 5
4/5 5 4 4/5 4/5 5 4 4/5 4/5 5 4 4/5
6 7 6 7 6 7
6/7 6/7 6/7 6/7 6/7 6/7
7 6 7 6 7 6
The same ideas apply to a butterfly/hypercube, but will go both ways on an edge at different times:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Pairs of processors act as switches in each layer. First (leftmost) level: (0,4),(1,5),(2,6),(3,7) Second layer: (0,2),(1,3),(4,6),(5,7)
Interpretation:
12
Upper Upper
Lower Upper
Upper Lower
Lower Lower
Left of Center Column Right of Center Column
Permutation routing is again based on perfect matching. In each pair, one processor takes upper, the other takes lower. Gives a
vertex-disjoint path.
Example: Route
( 0 12 3 4 5 6 7
3 76 0 2 1 4 5 )
4 4
0/4 0/4
0 0
5 1
1/5 1/5
1 5
2
6
2/6 2/6
2 6
3 3
3/7 7 7 3/7
Routing Graph:
Paths:
13
0 0
0 0
5 1
1 1
2 6
2 2
3 3
3 3
4 4
4 4
1 5
5 5
6 2
6 6
7 7
7 7
3160
. Lower routing problem is:
( )4167
2745
.
2 6 6 4
0/2 0/6 4/6 2/4
0 0 4 2
3 3 7 5
3/5 1/3 1/7 5/7
Routing graphs: 5 1 1 7
Corresponding paths:
14
0 2
6 0
0 0
5 5 1 1
1 1
2 0 0 6
2 2
3 3 4 3
3 3
4 4 2 4
4 4
1 1 7 5
5 5
6 6 4 2
6 6
7 7 5 7
7 7
5 5 5 1 1 1
1 1
2 0 3 0 6
0
2 2
3 3 30 3 3
3 3
4 4 4 2 2 4
4 4
1 1 1 7 7 5
5 5
6 6 6 4 4 2
6 6
7 7 7 5 5 7
7 7
Given an arbitrary permutation for packets on a 2-d mesh, perfect matching (as in hypercubic networks) may be used to achieve a
static routing. The algorithm will give a 3N - 3 step strategy for routing a given permutation on a N x N mesh. The result can be
generalized to multidimensional meshes. The algorithm has three phases, only the first phase must be precomputed:
Phase 1: Permute the packets within each column so that at most one packet in each row is destined for each column via
perfect matching
Phase 2: Route each packet within its row to the correct column
Phase 3: Route each packet within its column to its final destination
Example:
0 1 2
0 2,1 1,2 0,2
1 2,2 0,0 1,0
2 0,1 2,0 1,1
The routing graph has one edge per packet, based on the starting column and the destination column for each:
16
Start Destination
0 0
2,1 0,0
1,0
2,0
1 0,1 1
1,2
1,1
2,2
2 2
0,2
Three perfect matchings are then derived. All packets in the ith matching are routed to row i in the first phase.
Start Destination Start Destination Start Destination
0 2,1 0 0 0,0 0 0 0
1,0
2,0
1 1 1 0,1 1 1 1
1,2
1,1
2,2
2 2 2 2 2 2
0,2
Match 1 Match 2 Match 3
0 1 2
0 2,1 1,2 1,0
1 0,1 0,0 0,2
2 2,2 2,0 1,1
0 1 2
Special case - each processor is the destination for a single packet, but a processor may be the source for multiple
packets
Inverse problem - each processor is the source for a single packet, but a processor may be the destination for
multiple packets
Modification - always choose the packet that has the farthest to go, still takes no more than n - 1 steps
0 1 2 3 4 5
245 13 0
0 1 2 3 4 5
24 15 3 0
0 1 2 3 4 5
2 14 5 03
0 1 2 3 4 5
12 04 35
0 1 2 3 4 5
01 2 34 5
0 1 2 3 4 5
0 1 2 3 4 5
Mesh (n-by-n)
Routing for y-dimension takes n - 1 steps using result from linear arrays.
The linear array result also applies when each node starts with one packet, but
a node may receive multiple packets - like in the x-dimension
Takes 2n - 2 steps, but may queue almost 2n/3 packets, consider the situation:
Each of the three portions of the first two rows has about n/3 elements destined for column n/3. At mesh
node (2,n/3), for each element that leaves (out the bottom) two additional elements will be queued
Hypercube
Algorithm: Processor scans low-order to high-order comparing processor address and the destination address
until the a mismatch is found, then send out that edge
Example: Suppose processors with addresses of form 0...0, <x> send to processors with addresses of form
<y>, 0...0 (0...0, <x>, and <y> each have k bits).
Switching Techniques
Circuit switching: Physical switches are set to reserve entire path for full bandwidth.
Store-and-forward (packet switching): A packet of a message is forwarded to next processor on path only after the entire
packet has been received. Path is not established up front.
Virtual cut-through: If next router along path has input buffer space, then the flits in a packet will be pipelined. Otherwise,
there is sufficient space to store an entire packet.
Wormhole routing: Uses pipelining, but has smaller buffers and uses flow-control to stall the flits of a packet, possibly
among several routers. Prone to deadlock, so physical channels are divided into virtual channels that allow sufficient
progress as long as cycle(s) of virtual channels are avoided.
Broadcasting Models:
rootWork=0;
for (receiveDim=0; receiveDim < k; receiveDim++)
{
if bit receiveDim of rank == 1
Set bit receiveDim of rootWork
if rootWork==rank
break;
}
for (i=0; i < k; i++)
if i ≥ receiveDim
if bit i of processorRank is 0
Send data value to node rank + 2i
else
Receive data value from node rank - 2i
Lower bounds:
1. Diameter of hypercube = k
2. In a given round, a processor may receive up to k messages. Each processor needs to receive 2k - 1 messages, so
at least ceiling((2k - 1)/k )rounds are needed.
k ceiling((2k - 1)/k )
2 2
3 3
4 4
5 7
6 11
7 19
8 32
9 57
10 103
11 187
12 342
Algorithm: Design a restricted one-to-all broadcast tree that can be easily replicated with any processor as the root of the
broadcast. The algorithm will use no more than one link in each dimension in each round of communication..
First, we construct a graph that uses the necklace notion to group together processors whose addresses are the same under
bitwise rotation. Two necklaces will be connected by an edge if some processor for one necklace is adjacent to some
processor for the other necklace. For example, we use k = 6:
20
000000/1
000001/6
011111/6
111111/1
A broadcast tree is designed for the value at processor 000000 by using depth-first search (breadth-first is also fine) to
navigate among necklaces with k processors:
000000/1
1
000001/6
2
6
5
011111/6
111111/1
These 9 edges give the first 9 rounds of communication. The underlined bits emphasize that only one link for each
dimension is used in each round.
21
Round 1: 000000 → 000001
000000 → 000010
000000 → 000100
000000 → 001000
000000 → 010000
000000 → 100000
In rounds 10 and 11, the algorithm ‘‘cleans up’’ for the necklaces with < k processors:
Several details that guarantee the success of the clean-up rounds have been omitted. These rounds are only needed when k is
not prime. There is much flexibility in generating a solution, i.e. there are many alternate schemes.
The last detail is to replicate for broadcasts for other processors besides 000000. This is easily done by taking the bits in the
address of the broadcasting processor and exclusive or’ing this address onto the sending and receiving address for each of the
transmissions in all rounds.
The end result is that all links will be busy in every round except perhaps the last one.
Sorting Techniques:
Uses n/2 rounds (each with two transposition steps) to sort n values on linear array.
0 1 2 3 4 5 6 7
23
Mesh Sorting
Unlike other topologies, there is no single "obvious" desirable way to place an ordered sequence onto a mesh for 2 dimensions, much
less higher dimensions. It is usually taken that given a good way to obtain a particular arrangement, it is not a problem to permute the
arrangement (based on static routing for meshes, discussed earlier). The following "shearsort" algorithm for 2-d meshes is practical
(uses odd-even transposition in rows or columns), but not optimal (misses by a logarithmic factor). The algorithm takes N (2log N +1)
steps where the mesh is N x N. The output is in "snakelike" ordering for rows.
Phases 1, 3, 5, ... , 2log(N) + 1 sort all rows (in Ο(N) parallel time for each phase)
Odd rows are sorted so that smaller numbers are at the left. Even rows are sorted so that smaller numbers are at the right.
Phases 2, 4, 6, ..., 2log(N) sort all columns (in Ο(N) parallel time for each phase)
Example:
10 2 12 8 2 8 10 12
16 5 1 14 16 14 5 1
3 9 7 13 3 7 9 13
6 15 4 11 15 11 6 4
2 7 5 1 1 2 5 7
3 8 6 4 8 6 4 3
15 11 9 12 9 11 12 15
16 14 10 13 16 14 13 10
1 2 4 3 1 2 3 4
8 6 5 7 8 7 6 5
9 11 12 10 9 10 11 12
16 14 13 15 16 15 14 13
Why does it work? The algorithm is oblivious (since odd-even transpose is also oblivious.). By the 0-1 sorting lemma, if an oblivious
algorithm will correctly sort any input sequence with 0s and 1s, then the algorithm will sort any sequence correctly. (The proof of the
0-1 sorting lemma shows that if an oblivious sort fails on some sequence, then there exists a sequence of 0s and 1s that will also make
the algorithm fail.)
Problems:
1. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):
24
(
01234567
45670123 )
2. Use Gray codes to show that a 64-node hypercube is isomorphic to a 4 x 4 x 4 torus.
3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.
4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.
5. How many automorphisms are there for a complete binary tree with h = 4?
6. How many vertex equivalence classes does a 4x5 torus have? A 4x5 mesh?
8. For purposes of this exercise only, let us define a class of networks to be scalable if a larger network in that class may be
constructed from smaller network(s) of that class by only including additional vertices and edges. In particular, the drastic measures
of removing edges or vertices are prohibited. Indicate which network classes are scalable and which are not.
9. Consider an r-dimension array with N = N1 = Nr and N is odd. What is the bisection width?
1. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):
( 0 1 2 34 5 6 7
4 5 6 70 1 2 3 )
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
25
0/1
0 0 0/1
1 1
2/3 2 2 2/3
3 3
4 4
4/5 5 4/5
5
6 6
6/7 7 7 6/7
0 0
0 0
1 4 4 1
3 3
2 2
3 7 7 3
1 1
4 4
5 5
5 5
2 2
6 6
6 6
7 7
0 4
0/4 4 0 0/4
3 7
3/7 3/7
7 3
1 5
1/5 5 1 1/5
2 6
2/6 2/6
6 2
0 0 4 0
0 0
1 4 4 1
7 3
3 4 0 3
2 2
3 7
3 7 7 3
1 5 1 1
4 4
5 6 2 5
5 5
2 1 2
5
6 6
6 2 6 6
7 7
26
0 4
0/4 4 0 0/4
3 7
3/7 3/7
7 3
1 5
1/5 5 1 1/5
2 6
2/6 2/6
6 2
0 4
0 0
5 1
1 1
6 2
2 2
7 3
3 3
4 0
4 4
1 5
5 5
2 6
6 6
3 7
7 7
27
0 4
0/6 6 2 4/2
5 1
5/7 1/3
7 3
2 6
4/2 0/6
4 0
1 5
1/3 5/7
3 7
0 0 4 4
0 0
5 7 3 1
1 1
6 6 2 2
2 2
7 5 1 3
3 3
4 2 6 0
4 4
1 1 5 5
5 5
2 4 0 6
6 6
3 3 7 7
7 7
28
2. Use Gray codes to show that a 64-node hypercube is isomorphic to a 4 x 4 x 4 torus.
Each torus node has a three component address (x, y, z) where each component value is 0, 1, 2, or 3. To map to a hypercube,
each of the three components is mapped to two bits in the hypercube address using the 2-bit Gray code:
0 00
1 01
2 11
3 10
To see that adjacencies are preserved, consider torus node (1, 2, 3) and its neighbors:
(1, 2, 3) 011110
(0, 2, 3) 001110
(2, 2, 3) 111110
(1, 3, 3) 011010
(1, 1, 3) 010110
(1, 2, 2) 011111
(1, 2, 0) 011100
3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.
3,4
2,1 2,1
4,1 4,1
1,2 1,2
2,1 4 4 4 4
4,1
1,2 1,4
4 4
1,4 Matching 1 Matching 2
29
Start Destination Start Destination
Column Column Column Column
1 1 1 1
4,3
4,4 4,4
2 2 2 2
2,4 3,3 3,3
4,1 4,1
1,2
4 4 4 4
Matching 3 Matching 4
3,1 4,2 2,3 1,4 3,1 4,2 2,3 1,4 1,1 1,2 1,3 1,4
3,2 1,3 3,4 2,1 2,1 3,2 1,3 3,4 2,1 2,2 2,3 2,4
4,3 2,4 1,1 1,2 1,1 1,2 4,3 2,4 3,1 3,2 3,3 3,4
4,4 3,3 2,2 4,1 4,1 2,2 3,3 4,4 4,1 4,2 4,3 4,4
Routing within columns Sort within rows based Sort within columns
for four matchings on column destination based on column destination
4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.
0000/1
1
0001/4
2
0011/4 0101/2
3 4
0111/4
4
1111/1
5. How many automorphisms are there for a complete binary tree with h = 4?
30
Since each parent node has two orientations for its children, there are 215 = 32768 automorphisms
6. How many vertex equivalence classes does a 4x5 torus have? A 4x5 mesh?
Torus: 1
Mesh: 6
10
8. ... Indicate which network classes are scalable and which are not.
Scalable: linear array, mesh, binary tree, butterfly, fat tree, hypercube, benes
9. Consider an r-dimension array with N = N1 = Nr and N is odd. What is the bisection width?