0% found this document useful (0 votes)

8 views30 pages

05 Notes

The document discusses various approaches to processor/memory interconnects, highlighting indirect and direct methods. It details graph-theoretic properties such as bisection width, diameter, and symmetry, along with specific interconnection networks like linear arrays, rings, meshes, and hypercubes. Additionally, it covers routing techniques, including the Benes network and bipartite matching, emphasizing their significance in efficient data communication.

Uploaded by

engomma2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views30 pages

05 Notes

Uploaded by

engomma2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

CSE 4351 Notes 5: Interconnection Networks and Routing

Two Major Approaches to Processor/Memory Interconnects:

Processor/
Processor Memory
Processor/ Memory Processor/
Memory
Memory

Processor Memory

Processor/ Processor/
Memory Memory

Processor Interconnect Memory

Processor/ Interconnect
. . Memory Processor/
Memory
. .
Processor/
. . Memory

Processor Memory Processor/ Processor/

Memory Memory

Indirect (Dynamic) Direct (Static)

Graph-theoretic Properties:

Bisection width = minimum number of "wires" that must be removed to get two "halves"

Diameter = maximum distance between any pair of processors

Bounded degree = all networks in the class have vertex-degree bounded by a constant

Symmetry = number of automorphisms, number of vertex equivalence classes (vecs)

Rules of Thumb: High bisection width and low diameter are good. Bounded degree networks are easier to build.
High symmetry - simpler code, complicated to build

Linear array

P0 P1 P2 P3 P4

Bisection width = 1

Diameter = N - 1

Automorphisms = 2, vecs = ceil(N/2)

Convenient for VLSI signal processing and bit-level integer multiplication. The convolution of two polynomials h(x) =
f(x)g(x) of f(x) = a0 + a1x + a2x2 + . . . + an-1xn-1 and g(x) = b0 + b1x + b2x2 + . . . + bn-1xn-1 may be computed on a 2n-1
node linear array by inputting an-1, an-2, an-3 . . . a0 into the left end at the odd numbered steps and b0, b1, b2 . . . bn-1 at the
odd numbered steps. When an a and a b value arrive at a node, the values are multiplied and added to the value which will be
output as a coefficient for h(x)
2
Trace for n = 4:

p0 p1 p2 p3 p4 p5 p6
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
0 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
1 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
2 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
3 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
4 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
5 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
6 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
7 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
8 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
9 b0 • b1 • b2 • b3
----------------------------------------------------------------------------------------------------------------------------
a0 • a1 • a2 • a3
10 b0 • b1 • b2 • b3

INCREDIBLY EASY TO PERFORM ROUTING!!!

Ring - symmetric version of linear array. To avoid long wraparound connection, the following arrangement may be used:

P0 P7 P1 P6 P2 P5 P3 P4

Bisection width = 2, Diameter = N/2, Automorphisms = 2N, vecs = 1

Mesh
3
N

. . .

2-d most common, can have higher dimension

Bisection width = min(k,N) if max(k, N) is even and min(k, N) + 1, otherwise. Diameter = k + N - 2

Bisection width for higher dimension array:

Suppose dimensions are 2 ≤ N1 ≤ N2 ≤ . . . ≤ Nr.

If Nr is even, then bisection width is N1 • N2 • . . . • Nr-1

If Nr is odd, then bisection width is N1 • N2 • . . . • Nr-1 + bisection width of array N1 × N2 × . . . × Nr-1

Automorphisms = 8 and vecs ≈ N2/8 when N = k

ROUTING IS MORE DIFFICULT THAN LINEAR ARRAY. BOTTLENECKS AROUND CENTER OF MESH.

Sorting can be used to handle routing (along with "perfect matching" in graphs). Can also use randomized approaches such
as Valiant-Brebner routing.

2-d Torus - adds wrap-around connections between P(i,N-1) and P(i,0), also P(k-1,j) and P(0,j).
N-by-N torus (N > 4) has bisection width = if N even then 2N else 2(N+1),
diameter = if N even then N else N-1, 8N2 automorphisms, 1 vec

Complete Binary Tree

h
Bisection Width = 1 Diameter = 2*lg(N+1) - 2 Automorphisms = 22 -1, vecs = h + 1 (h=height, which is 3 in example)

Often the internal nodes are used just for communication

Leaf selection - number leaves from 0 to 2k - 1, can switch path to leaf by sending MSB first followed by decreasing
significance bits.

Butterfly

2k rows and k+1 columns

Processors in each row are connected as a linear array

Bisection width = 2k (just remove highest dimension edges)

k+1- 1
Automorphisms = 22 vecs = (k + 1)/2

Processor P[i,j] is also connected to processor in next column via complementing the (j+1)-st most significant bit
0 1 2 3
000

001

010

011

100

101

110

111

Fat tree - generalization of the butterfly that has been used in several commercial systems.

Each non-root node has two parents, but each non-leaf node has degree children.

The depth of the fat tree indicates the number of levels past the root level (level 0).

Level i (0 ≤ i ≤ depth) has vertices labeled (i, j, k) where 0 ≤ j < degreei and 0 ≤ k < 2depth-i.
5
Non-leaf vertex (i, j, k), i < depth has children (i + 1, j•degree + p, k/2) where 0 ≤ p < degree.

Example: depth = 3 and degree = 2:

0 0,0 0,1 0,2 0,3

1 0,0 0,1 1,0 1,1

2 0,0 1,0 2,0 3,0

Example: depth = 4 and degree = 2:

0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

1 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3

2 0,0 0,1 1,0 1,1 2,0 2,1 3,0 3,1

3 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0

Hypercube

Squashes together each row of butterfly, giving 2k processors

Each processor is connected to k others - by complementing exactly one bit

0 1 4 5
0 1
1-d 2 3 0 1
2-d
2 3

6 3-d 7
6
4 5
0 1 12 13
2 3
8 9
11
6 7 10
14 15
4-d
Note: Cube connection results from replacing each vertex by a ring of k vertices

Bisection width = 2k - 1 Automorphisms = k!2k vecs = 1

Derivation of automorphisms:

1. Any one of the 2k vertices may be mapped to vertex 0.

2. Any dimension is equivalent to any other, since there are k dimensions there are k! permutations.

Hamiltonian cycle of processors - use reflected Gray code (address = i xor (i >>1) trick)

1-bit: 2-bits: 3-bits: 4-bits:

0 00 000 0000
1 01 001 0001
---- 011 0011
11 010 0010
10 -------- 0110
110 0111
111 0101
101 0100
100 ----------
1100
1101
1111
1110
1010
1011
1001
1000

Benes Network: A multi-stage switching network variation of the butterfly/hypercube with (2k+1)2k binary switches. We look at this
network to get an initial understanding of static (permutation) routing and then examine the same concepts for hypercubes.
7
LSB Complements

Forward Backward
Butterfly Butterfly

MSB Shared MSB

Complement Nodes Complement

Commonly written as

0 0
1 1

2 2
3 3

4 4
5 5

6 6
7 7

Switches can be set to give an "edge-disjoint" path for any permutation. For permutation

(0 1 2 34 5 6 7

5 3 4 70 1 2 6 ) we get:

0 0
1 1

2 2
3 3

4 4
5 5

6 6
7 7

The existence of a switch setting for each permutation can be proven by viewing the Benes network as a recursive structure:
8

UPPER .
.
NETWORK .
.
.
.

. LOWER .
. NETWORK .
. .

If we can successfully route a n-permutation problem to the upper and lower networks, we then have two n/2-permutation problems.
2-permutations are trivially routed. n-permutation problems are always solvable based on Hall’s Matching Theorem:

A 2N-node bipartite graph G=(U,V,E) has a perfect matching if and only if for all subsets S ⊆ U, |N(S)| > |S|, where N(S) denotes the
nodes in V that are adjacent to a node in S.

Corollary: If all vertices in a bipartite graph are incident on k edges, then there are k disjoint perfect matchings. (The set of k
matchings may not be unique. The number of perfect matchings is known as the permanent of the binary adjacency matrix for the
graph.)

Translation:

bipartite: two-colorable
U: Nodes of color 1
V: Nodes of color 2
Perfect matching: Set of edges that will include all vertices, but no vertex is on two of the edges
Condition will be satisfied by routing graph, since each node is incident to two edges.

All "outside" switches have two packets to be routed to the other side.

Example:
(
01234567

53470126 )
0/1 1 0 0/1
1
0
3
2/3 2 2/3
2
3
4 5
4/5 5 4 4/5

6 7
6/7 6/7
7 6
9
This gives (dark lines in graph correspond to using upper network)
0
0
0 0
1 1
4 5

3 2
2 2
3 7 3
6
1
4 1 4
5 5
5 4
2
6 3 6
7 7
7 6

and an upper routing problem of

( ) 0436

5072
and a lower routing problem of
( )
1527

3146

The matching problems for these are:

0 5 5 1
0/4 0/5 1/5 1/4
4 0 1 4

3 7 2 3

3/6 2/7 2/7 3/6

6 2 7 6
The recursive structure views these two problems as separate, but there is no difficulty in solving them as a single bipartite matching
problem. We now have:
0 0 5 0
0 0
1 1
4 3 7 5

3 4 0 2
2 2
3 7 3
6 6 2

1 1 4
4 1 4
5 5
5 2 3 4
2 5 1
6 3 6
7 7
7 7 6 6

The remaining four subnetworks (single switches) are trivial.

10
FINDING A MAXIMUM BIPARTITE MATCHING

Even though trial-and-error is usually sufficient for small problems, the notion of an alternating path facilitates the task. Hopcroft and
Karp developed an extremely efficient depth-first-search algorithm of this concept that runs in O(|E| √(|V|)) time, but it is not suitable
for "hand tracing". (Even more remarkably, Micali and Vazirani obtained the same bound for matching in general graphs).

The algorithm is based on incrementally increasing the size of the matching. The algorithm starts by using an initial deficient
matching (a single edge is fine or we may greedily attempt to insert each edge in the matching without backtracking by removing a
vertex). We then search for a path with the following properties:

1. The starting and terminating vertices are different and are not included in the previous matching.
2. The path alternates between k+1 edges that are not in the matching and k edges that are in the matching, i.e.

...

= Edge not in previous matching

= Edge included in previous matching

3. The new matching is obtained by removing the edges from the previous matching that are on the alternating path and then
including the alternating path edges that were not in the previous matching.

NOTE: Often a simple alternating path is just a single edge between two vertices not in the previous matching!

Sequential code and sample input files are in files bipartiteMatch*.

Example:

0/1 1 0 0/1 0/1 1 0 0/1 0/1 1 0 0/1 0/1 1 0 0/1

1 1 1 1
0 0 0 0
3 3 3 3
2/3 2 2 2/3 2/3 2 2 2/3 2/3 2 2 2/3 2/3 2 2 2/3
3 3 3 3
4 5 4 5 4 5 4 5
4/5 5 4 4/5 4/5 5 4 4/5 4/5 5 4 4/5 4/5 5 4 4/5

6 7 6 7 6 7 6 7
6/7 6/7 6/7 6/7 6/7 6/7 6/7 6/7
7 6 7 6 7 6 7 6
11
0/1 1 0 0/1 0/1 1 0 0/1 0/1 0/1
1 0
1 1 1
0 0 0
3 3 3
2/3 2 2 2/3 2/3 2 2 2/3 2/3 2 2 2/3
3 3 3
4 5 4 5 4 5
4/5 5 4 4/5 4/5 5 4 4/5 4/5 5 4 4/5

6 7 6 7 6 7
6/7 6/7 6/7 6/7 6/7 6/7
7 6 7 6 7 6

The same ideas apply to a butterfly/hypercube, but will go both ways on an edge at different times:

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

Pairs of processors act as switches in each layer. First (leftmost) level: (0,4),(1,5),(2,6),(3,7) Second layer: (0,2),(1,3),(4,6),(5,7)
Interpretation:
12
Upper Upper

Lower Upper
Upper Lower

Lower Lower
Left of Center Column Right of Center Column

Permutation routing is again based on perfect matching. In each pair, one processor takes upper, the other takes lower. Gives a
vertex-disjoint path.

Example: Route
( 0 12 3 4 5 6 7

3 76 0 2 1 4 5 )
4 4
0/4 0/4

0 0

5 1
1/5 1/5
1 5
2
6
2/6 2/6
2 6

3 3
3/7 7 7 3/7
Routing Graph:

Paths:
13
0 0
0 0

5 1
1 1

2 6
2 2

3 3
3 3

4 4
4 4

1 5
5 5

6 2
6 6

7 7
7 7

Upper routing problem is:

( )
0523

3160
. Lower routing problem is:
( )4167

2745
.

2 6 6 4
0/2 0/6 4/6 2/4
0 0 4 2

3 3 7 5
3/5 1/3 1/7 5/7
Routing graphs: 5 1 1 7

Corresponding paths:
14
0 2
6 0
0 0

5 5 1 1
1 1

2 0 0 6
2 2

3 3 4 3
3 3

4 4 2 4
4 4

1 1 7 5
5 5

6 6 4 2
6 6

7 7 5 7
7 7

Final paths are trivial:

15
0 2 2 6 6 0
0 0

5 5 5 1 1 1
1 1

2 0 3 0 6
0
2 2

3 3 30 3 3
3 3

4 4 4 2 2 4
4 4

1 1 1 7 7 5
5 5

6 6 6 4 4 2
6 6

7 7 7 5 5 7
7 7

Static Routing for Meshes

Given an arbitrary permutation for packets on a 2-d mesh, perfect matching (as in hypercubic networks) may be used to achieve a
static routing. The algorithm will give a 3N - 3 step strategy for routing a given permutation on a N x N mesh. The result can be
generalized to multidimensional meshes. The algorithm has three phases, only the first phase must be precomputed:

Phase 1: Permute the packets within each column so that at most one packet in each row is destined for each column via
perfect matching

Phase 2: Route each packet within its row to the correct column

Phase 3: Route each packet within its column to its final destination

Example:

Packets are initially located as follows, with destination indicated:

0 1 2
0 2,1 1,2 0,2
1 2,2 0,0 1,0
2 0,1 2,0 1,1

The routing graph has one edge per packet, based on the starting column and the destination column for each:
16
Start Destination
0 0
2,1 0,0
1,0

2,0
1 0,1 1
1,2
1,1
2,2

2 2
0,2

Three perfect matchings are then derived. All packets in the ith matching are routed to row i in the first phase.
Start Destination Start Destination Start Destination
0 2,1 0 0 0,0 0 0 0

1,0

2,0
1 1 1 0,1 1 1 1
1,2
1,1
2,2

2 2 2 2 2 2
0,2
Match 1 Match 2 Match 3

Thus, phase 1 will give:

0 1 2
0 2,1 1,2 1,0
1 0,1 0,0 0,2
2 2,2 2,0 1,1

Phase 2 routes within rows according to column destinations:

0 1 2

0 1,0 2,1 1,2

1 0,0 0,1 0,2
2 2,0 1,1 2,2

Phase 3 routes within columns according to row destinations:

17
0 1 2
0 0,0 0,1 0,2
1 1,0 1,1 1,2
2 2,0 2,1 2,2

Greedy ("Shortest Path") Routing

Routing decisions are made on-the-fly in a simple way

Linear Array - Completely solved by greedy approach

Special case - each processor is the destination for a single packet, but a processor may be the source for multiple
packets

Inverse problem - each processor is the source for a single packet, but a processor may be the destination for
multiple packets

Modification - always choose the packet that has the farthest to go, still takes no more than n - 1 steps

0 1 2 3 4 5

245 13 0

0 1 2 3 4 5

24 15 3 0

0 1 2 3 4 5

2 14 5 03

0 1 2 3 4 5

12 04 35

0 1 2 3 4 5

01 2 34 5

0 1 2 3 4 5

0 1 2 3 4 5
Mesh (n-by-n)

Algorithm: Linear routing in x-dimension, followed by linear routing in y-dimension:

18
Queueing: For a given edge, choose the packet which must go the farthest in that dimension

Routing for y-dimension takes n - 1 steps using result from linear arrays.

The linear array result also applies when each node starts with one packet, but
a node may receive multiple packets - like in the x-dimension

Takes 2n - 2 steps, but may queue almost 2n/3 packets, consider the situation:

Each of the three portions of the first two rows has about n/3 elements destined for column n/3. At mesh
node (2,n/3), for each element that leaves (out the bottom) two additional elements will be queued

Hypercube

Algorithm: Processor scans low-order to high-order comparing processor address and the destination address
until the a mismatch is found, then send out that edge

Example: Suppose processors with addresses of form 0...0, <x> send to processors with addresses of form
<y>, 0...0 (0...0, <x>, and <y> each have k bits).

PROBLEM: All √N packets must traverse processor 0...0,0...0

General solution to congestion: Valiant-Brebner Routing

1. Packet is greedily routed to a random intermediate destination.

2. Packet is greedily routed from intermediate destination to the real destination.

Switching Techniques

Circuit switching: Physical switches are set to reserve entire path for full bandwidth.

Store-and-forward (packet switching): A packet of a message is forwarded to next processor on path only after the entire
packet has been received. Path is not established up front.

Virtual cut-through: If next router along path has input buffer space, then the flits in a packet will be pipelined. Otherwise,
there is sufficient space to store an entire packet.

Wormhole routing: Uses pipelining, but has smaller buffers and uses flow-control to stall the flits of a packet, possibly
among several routers. Prone to deadlock, so physical channels are divided into virtual channels that allow sufficient
progress as long as cycle(s) of virtual channels are avoided.

Broadcasting Models:

One-to-all (ordinary broadcast) /All-to-all (‘‘gossiping’’) / Personalized (individualized messages)

Single-port/Multi-port - degree of communication concurrency for a processor

Mono-directional/Bi-directional - meaning of an edge

19
Example: One-to-all broadcast on k-dimensional hypercube with node 0 as root.

Diameter is lower bound on number of rounds.

rootWork=0;
for (receiveDim=0; receiveDim < k; receiveDim++)
{
if bit receiveDim of rank == 1
Set bit receiveDim of rootWork
if rootWork==rank
break;
}
for (i=0; i < k; i++)
if i ≥ receiveDim
if bit i of processorRank is 0
Send data value to node rank + 2i
else
Receive data value from node rank - 2i

Easily adapted for arbitrary processor to be the root.

Example: All-to-all broadcast on k-dimensional hypercube using all links simultaneously

Lower bounds:

1. Diameter of hypercube = k

2. In a given round, a processor may receive up to k messages. Each processor needs to receive 2k - 1 messages, so
at least ceiling((2k - 1)/k )rounds are needed.

Bound 2 is more significant

k ceiling((2k - 1)/k )

2 2
3 3
4 4
5 7
6 11
7 19
8 32
9 57
10 103
11 187
12 342

Algorithm: Design a restricted one-to-all broadcast tree that can be easily replicated with any processor as the root of the
broadcast. The algorithm will use no more than one link in each dimension in each round of communication..

First, we construct a graph that uses the necklace notion to group together processors whose addresses are the same under
bitwise rotation. Two necklaces will be connected by an edge if some processor for one necklace is adjacent to some
processor for the other necklace. For example, we use k = 6:
20
000000/1

000001/6

000011/6 000101/6 001001/3

000111/6 010101/2 001101/6 001011/6

010111/6 001111/6 011011/3

011111/6

111111/1

A broadcast tree is designed for the value at processor 000000 by using depth-first search (breadth-first is also fine) to
navigate among necklaces with k processors:

000000/1
1
000001/6
2

000011/6 000101/6 001001/3

8
3 9
000111/6 010101/2 001101/6 001011/6
7
4

010111/6 001111/6 011011/3

6
5
011111/6

111111/1

These 9 edges give the first 9 rounds of communication. The underlined bits emphasize that only one link for each
dimension is used in each round.
21
Round 1: 000000 → 000001
000000 → 000010
000000 → 000100
000000 → 001000
000000 → 010000
000000 → 100000

Round 2: 000001 → 000011

000010 → 000110
000100 → 001100
001000 → 011000
010000 → 110000
100000 → 100001

Round 3: 000011 → 000111

000110 → 001110
001100 → 011100
011000 → 111000
110000 → 110001
100001 → 100011

Round 4: 000111 → 010111

001110 → 101110
011100 → 011101
111000 → 111010
110001 → 110101
100011 → 101011

Round 5: 010111 → 011111

101110 → 111110
011101 → 111101
111010 → 111011
110101 → 110111
101011 → 101111

Round 6: 011111 → 001111

111110 → 011110
111101 → 111100
111011 → 111001
110111 → 110011
101111 → 100111

Round 7: 001111 → 001101

011110 → 011010
111100 → 110100
111001 → 101001
110011 → 010011
100111 → 100110

Round 8: 001101 → 000101

011010 → 001010
110100 → 010100
101001 → 101000
22
010011 → 010001
100110 → 100010

Round 9: 000101 → 100101

001010 → 001011
010100 → 010110
101000 → 101100
010001 → 011001
100010 → 110010

In rounds 10 and 11, the algorithm ‘‘cleans up’’ for the necklaces with < k processors:

Round 10: 001011 → 001001 Necklace 001011 to necklace 001001

010110 → 010010
101100 → 100100

001011 → 011011 Necklace 001011 to necklace 011011

010110 → 110110
101100 → 101101

Round 11: 000101 → 010101 Necklace 000101 to necklace 010101

001010 → 101010

111110 → 111111 Necklace 011111 to necklace 111111

Several details that guarantee the success of the clean-up rounds have been omitted. These rounds are only needed when k is
not prime. There is much flexibility in generating a solution, i.e. there are many alternate schemes.

The last detail is to replicate for broadcasts for other processors besides 000000. This is easily done by taking the bits in the
address of the broadcasting processor and exclusive or’ing this address onto the sending and receiving address for each of the
transmissions in all rounds.

The end result is that all links will be busy in every round except perhaps the last one.

Sorting Techniques:

Odd-Even Transposition Sort

Uses n/2 rounds (each with two transposition steps) to sort n values on linear array.

1↔6 7↔5 3↔4 2↔0

1 6↔5 7↔3 4↔0 2

1↔5 6↔3 7↔0 4↔2

1 5↔3 6↔0 7↔2 4

1↔3 5↔0 6↔2 7↔4

1 3↔0 5↔2 6↔4 7

1↔0 3↔2 5↔4 6↔7

0 1↔2 3↔4 5↔6 7

0 1 2 3 4 5 6 7
23
Mesh Sorting

Unlike other topologies, there is no single "obvious" desirable way to place an ordered sequence onto a mesh for 2 dimensions, much
less higher dimensions. It is usually taken that given a good way to obtain a particular arrangement, it is not a problem to permute the
arrangement (based on static routing for meshes, discussed earlier). The following "shearsort" algorithm for 2-d meshes is practical
(uses odd-even transposition in rows or columns), but not optimal (misses by a logarithmic factor). The algorithm takes N (2log N +1)
steps where the mesh is N x N. The output is in "snakelike" ordering for rows.

Phases 1, 3, 5, ... , 2log(N) + 1 sort all rows (in Ο(N) parallel time for each phase)

Odd rows are sorted so that smaller numbers are at the left. Even rows are sorted so that smaller numbers are at the right.

Phases 2, 4, 6, ..., 2log(N) sort all columns (in Ο(N) parallel time for each phase)

Columns are sorted with smaller numbers at the top

Example:

10 2 12 8 2 8 10 12

16 5 1 14 16 14 5 1

3 9 7 13 3 7 9 13

6 15 4 11 15 11 6 4

2 7 5 1 1 2 5 7

3 8 6 4 8 6 4 3

15 11 9 12 9 11 12 15

16 14 10 13 16 14 13 10

1 2 4 3 1 2 3 4

8 6 5 7 8 7 6 5

9 11 12 10 9 10 11 12

16 14 13 15 16 15 14 13

Why does it work? The algorithm is oblivious (since odd-even transpose is also oblivious.). By the 0-1 sorting lemma, if an oblivious
algorithm will correctly sort any input sequence with 0s and 1s, then the algorithm will sort any sequence correctly. (The proof of the
0-1 sorting lemma shows that if an oblivious sort fails on some sequence, then there exists a sequence of 0s and 1s that will also make
the algorithm fail.)

Problems:

1. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):
24

(
01234567
45670123 )
2. Use Gray codes to show that a 64-node hypercube is isomorphic to a 4 x 4 x 4 torus.

3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.

3,2 3,3 2,2 2,1

3,1 4,2 1,1 1,4

4,4 1,3 2,3 1,2

4,3 2,4 3,4 4,1

4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.

5. How many automorphisms are there for a complete binary tree with h = 4?

6. How many vertex equivalence classes does a 4x5 torus have? A 4x5 mesh?

7. How many automorphisms does a 5-node ring have?

8. For purposes of this exercise only, let us define a class of networks to be scalable if a larger network in that class may be
constructed from smaller network(s) of that class by only including additional vertices and edges. In particular, the drastic measures
of removing edges or vertices are prohibited. Indicate which network classes are scalable and which are not.

9. Consider an r-dimension array with N = N1 = Nr and N is odd. What is the bisection width?

10. Draw the fat tree with depth = 2 and degree = 3.

1. Route the following permutation on a Benes network and a hypercube (e.g. butterfly):

( 0 1 2 34 5 6 7

4 5 6 70 1 2 3 )
0 0
1 1

2 2
3 3

4 4
5 5

6 6
7 7
25
0/1
0 0 0/1

1 1

2/3 2 2 2/3

3 3
4 4
4/5 5 4/5
5
6 6
6/7 7 7 6/7

0 0
0 0
1 4 4 1

3 3
2 2
3 7 7 3

1 1
4 4
5 5
5 5

2 2
6 6
6 6
7 7

0 4
0/4 4 0 0/4

3 7
3/7 3/7
7 3

1 5
1/5 5 1 1/5

2 6
2/6 2/6
6 2

0 0 4 0
0 0
1 4 4 1
7 3

3 4 0 3
2 2
3 7
3 7 7 3

1 5 1 1
4 4
5 6 2 5
5 5

2 1 2
5
6 6
6 2 6 6
7 7
26
0 4
0/4 4 0 0/4

3 7
3/7 3/7
7 3

1 5
1/5 5 1 1/5

2 6
2/6 2/6
6 2
0 4
0 0

5 1
1 1

6 2
2 2

7 3
3 3

4 0
4 4

1 5
5 5

2 6
6 6

3 7
7 7
27
0 4
0/6 6 2 4/2

5 1
5/7 1/3
7 3

2 6
4/2 0/6
4 0

1 5
1/3 5/7
3 7
0 0 4 4
0 0

5 7 3 1
1 1

6 6 2 2
2 2

7 5 1 3
3 3

4 2 6 0
4 4

1 1 5 5
5 5

2 4 0 6
6 6

3 3 7 7
7 7
28
2. Use Gray codes to show that a 64-node hypercube is isomorphic to a 4 x 4 x 4 torus.

Each torus node has a three component address (x, y, z) where each component value is 0, 1, 2, or 3. To map to a hypercube,
each of the three components is mapped to two bits in the hypercube address using the 2-bit Gray code:

torus address hypercube bits

component

0 00
1 01
2 11
3 10

To see that adjacencies are preserved, consider torus node (1, 2, 3) and its neighbors:

torus address hypercube address

(1, 2, 3) 011110
(0, 2, 3) 001110
(2, 2, 3) 111110
(1, 3, 3) 011010
(1, 1, 3) 010110
(1, 2, 2) 011111
(1, 2, 0) 011100

3. Show how to achieve the following mesh permutation using perfect matching and linear array sorting.

3,2 3,3 2,2 2,1

3,1 4,2 1,1 1,4

4,4 1,3 2,3 1,2

4,3 2,4 3,4 4,1

Start Destination Start Destination

Column Column Column Column
Start Destination 3,1
Column Column 1 1 1 1
3,1 3,2 3,2
1 3,2 1 4,3 4,3
4,4 4,4
4,3
4,4
4,2

4,2 2 1,3 2 2 1,3 2

2,4 3,3 2,4 3,3
2 1,3 2
2,4 3,3
1,1 2,2 1,1 2,2

1,1 2,2 3 2,3 3 3 3

3 2,3 3 3,4 3,4

3,4
2,1 2,1
4,1 4,1
1,2 1,2
2,1 4 4 4 4
4,1
1,2 1,4
4 4
1,4 Matching 1 Matching 2
29
Start Destination Start Destination
Column Column Column Column

1 1 1 1

4,3
4,4 4,4

2 2 2 2
2,4 3,3 3,3

1,1 2,2 2,2

3 3 3 3

4,1 4,1
1,2
4 4 4 4

Matching 3 Matching 4

3,1 4,2 2,3 1,4 3,1 4,2 2,3 1,4 1,1 1,2 1,3 1,4

3,2 1,3 3,4 2,1 2,1 3,2 1,3 3,4 2,1 2,2 2,3 2,4

4,3 2,4 1,1 1,2 1,1 1,2 4,3 2,4 3,1 3,2 3,3 3,4

4,4 3,3 2,2 4,1 4,1 2,2 3,3 4,4 4,1 4,2 4,3 4,4

Routing within columns Sort within rows based Sort within columns
for four matchings on column destination based on column destination
4. Derive an all-to-all broadcast scheme for a 4-d hypercube similar to the one used for 6-d hypercubes.

Lower bounds indicate that 4 rounds are sufficient.

0000/1
1
0001/4
2
0011/4 0101/2
3 4
0111/4
4
1111/1

5. How many automorphisms are there for a complete binary tree with h = 4?
30

Since each parent node has two orientations for its children, there are 215 = 32768 automorphisms

6. How many vertex equivalence classes does a 4x5 torus have? A 4x5 mesh?

Torus: 1
Mesh: 6

7. How many automorphisms does a 5-node ring have?

8. ... Indicate which network classes are scalable and which are not.

Scalable: linear array, mesh, binary tree, butterfly, fat tree, hypercube, benes

Not scalable: ring, torus

9. Consider an r-dimension array with N = N1 = Nr and N is odd. What is the bisection width?

(Nr - 1)/(N - 1) by using a geometric sum (1 + N1 + N2 + . . . + Nr-1).

10. Draw the fat tree with depth = 2 and degree = 3.

0 0,0 0,1 0,2 0,3

1 0,0 0,1 1,0 1,1 2,0 2,1

2 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 8,0

CNC Router Essentials: The Basics for Mastering the Most Innovative Tool in Your Workshop
From Everand
CNC Router Essentials: The Basics for Mastering the Most Innovative Tool in Your Workshop
Neil Carr
5/5 (3)
Shimon Even Graph Algorithms Computer Software Engineering Series
100% (1)
Shimon Even Graph Algorithms Computer Software Engineering Series
260 pages
Introduction To Parallel Computing: Solution Manual
No ratings yet
Introduction To Parallel Computing: Solution Manual
70 pages
Merged en
No ratings yet
Merged en
34 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
MA252 - Combinatorial Optimisation
No ratings yet
MA252 - Combinatorial Optimisation
9 pages
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
No ratings yet
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
11 pages
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
No ratings yet
Interconnection Networks: Crossbar Switch, Which Can Simultaneously Connect Any Set of
11 pages
Introduction
No ratings yet
Introduction
46 pages
Chapter 03
No ratings yet
Chapter 03
68 pages
Parallel Processors: Session 5 Interconnection Networks
No ratings yet
Parallel Processors: Session 5 Interconnection Networks
48 pages
10-Hypercube & Network
No ratings yet
10-Hypercube & Network
22 pages
Interconnect Network Topologies
No ratings yet
Interconnect Network Topologies
23 pages
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
No ratings yet
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
53 pages
Array Processors: SIMD Computer Organization
100% (1)
Array Processors: SIMD Computer Organization
45 pages
Parallel Architectures
No ratings yet
Parallel Architectures
160 pages
Interconnection Network Topology Design Trade-Offs
No ratings yet
Interconnection Network Topology Design Trade-Offs
29 pages
unit-3.2 static interconnection networks
No ratings yet
unit-3.2 static interconnection networks
10 pages
Static and Dynamic
No ratings yet
Static and Dynamic
43 pages
And Other Applications of Graphs and Networks: Jo Ellis-Monaghan
No ratings yet
And Other Applications of Graphs and Networks: Jo Ellis-Monaghan
34 pages
Distributed Memory Machines
No ratings yet
Distributed Memory Machines
10 pages
f32 Book Parallel Pres pt4
No ratings yet
f32 Book Parallel Pres pt4
106 pages
Parallel 2ndtweek Class2
No ratings yet
Parallel 2ndtweek Class2
18 pages
Lecture - 28
No ratings yet
Lecture - 28
24 pages
Parallel Algorithms: Peter Harrison and William Knottenbelt
No ratings yet
Parallel Algorithms: Peter Harrison and William Knottenbelt
65 pages
20_interconnectTopologiesFinal
No ratings yet
20_interconnectTopologiesFinal
47 pages
UBICC-Cameraready 481
No ratings yet
UBICC-Cameraready 481
9 pages
Channel Routing Quick Notes
No ratings yet
Channel Routing Quick Notes
7 pages
Interconnection Network Architectures
100% (1)
Interconnection Network Architectures
54 pages
Graph Algorithm 2
No ratings yet
Graph Algorithm 2
109 pages
Network On Chip
100% (1)
Network On Chip
62 pages
Matrix Transpose On Meshes: Theory and Practice
No ratings yet
Matrix Transpose On Meshes: Theory and Practice
5 pages
Advanced Computer Architecture CSE 8383
No ratings yet
Advanced Computer Architecture CSE 8383
56 pages
Solution 2-DD
No ratings yet
Solution 2-DD
70 pages
Chapter8 PDF
No ratings yet
Chapter8 PDF
4 pages
3RD UNIT HALF 2
No ratings yet
3RD UNIT HALF 2
8 pages
Multiprocessor Interconnection Networks Networks: CS 740 November 19, 2003
No ratings yet
Multiprocessor Interconnection Networks Networks: CS 740 November 19, 2003
8 pages
Interconnects
No ratings yet
Interconnects
25 pages
1 s2.0 0166218X9500046T Main
No ratings yet
1 s2.0 0166218X9500046T Main
15 pages
Parallel Processing Lecture3
No ratings yet
Parallel Processing Lecture3
54 pages
Lecture22 dr2
No ratings yet
Lecture22 dr2
37 pages
Intro To Communication: - Advantages
No ratings yet
Intro To Communication: - Advantages
13 pages
Chapter 05
No ratings yet
Chapter 05
32 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
annotated-HW2_Basit (1)
No ratings yet
annotated-HW2_Basit (1)
13 pages
Mesh and Hypercube
No ratings yet
Mesh and Hypercube
16 pages
Supplnotes 020209
No ratings yet
Supplnotes 020209
201 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Ch.11 Graphs: Data Structures: A Pseudocode Approach With C
No ratings yet
Ch.11 Graphs: Data Structures: A Pseudocode Approach With C
65 pages
f32 Book Parallel Pres pt4
No ratings yet
f32 Book Parallel Pres pt4
94 pages
Lecture 3 - 3 Evaluating Static Interconnection Networks
No ratings yet
Lecture 3 - 3 Evaluating Static Interconnection Networks
41 pages
Lec3 InnerconnectionNetworks
No ratings yet
Lec3 InnerconnectionNetworks
28 pages
15CS72 IAT2 Solution
No ratings yet
15CS72 IAT2 Solution
13 pages
18 Interconnects
No ratings yet
18 Interconnects
38 pages
System Interconnect Network & Topologies
No ratings yet
System Interconnect Network & Topologies
48 pages
E Mbedding B Us A ND R Ing Into H Ex - C Ell I Nterconnection N Etwork
No ratings yet
E Mbedding B Us A ND R Ing Into H Ex - C Ell I Nterconnection N Etwork
12 pages
Greedy_Algorithms_notes
No ratings yet
Greedy_Algorithms_notes
116 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
Elementary Graph Algorithms: Manoj Agnihotri M.Tech I.T Dept of CSE ACET Amritsar
No ratings yet
Elementary Graph Algorithms: Manoj Agnihotri M.Tech I.T Dept of CSE ACET Amritsar
58 pages
Switches, Routers and Networks
No ratings yet
Switches, Routers and Networks
104 pages
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Graph Theory: A First IN
100% (2)
Graph Theory: A First IN
4 pages
Math Maze Puzzle Collection #1
No ratings yet
Math Maze Puzzle Collection #1
7 pages
Class Ix Holiday Home Work - 10052024 - 140748
No ratings yet
Class Ix Holiday Home Work - 10052024 - 140748
2 pages
DAA Question Bank
No ratings yet
DAA Question Bank
9 pages
Planar Graph Drawing Takao Nishizeki Dr Md Saidur Rahman instant download
No ratings yet
Planar Graph Drawing Takao Nishizeki Dr Md Saidur Rahman instant download
77 pages
Aoa Practicals
No ratings yet
Aoa Practicals
25 pages
TEMPLET To Give Attainment Values
No ratings yet
TEMPLET To Give Attainment Values
4 pages
Nitesh DAA File
No ratings yet
Nitesh DAA File
24 pages
STD 5 Maths Data Handling
No ratings yet
STD 5 Maths Data Handling
2 pages
Introduction To Beamer and Tikz: Christian Schnell
No ratings yet
Introduction To Beamer and Tikz: Christian Schnell
48 pages
DS (3rd) May2018
No ratings yet
DS (3rd) May2018
2 pages
LP-II - AI-Lab Manual
No ratings yet
LP-II - AI-Lab Manual
33 pages
15-16 Graph Coloring
No ratings yet
15-16 Graph Coloring
32 pages
On Self-Centeredness of Tensor Product of Some Graphs
No ratings yet
On Self-Centeredness of Tensor Product of Some Graphs
10 pages
CH 2 Practice Test 2019
No ratings yet
CH 2 Practice Test 2019
4 pages
AIM: Write A Program in Java To Analyze Time Complexity of Recursive
No ratings yet
AIM: Write A Program in Java To Analyze Time Complexity of Recursive
29 pages
AUN - IT153IU - Discrete Mathematics
No ratings yet
AUN - IT153IU - Discrete Mathematics
7 pages
Jensen: Unit 4 Pretest Review
0% (1)
Jensen: Unit 4 Pretest Review
11 pages
Facebook Profiles Clustering
No ratings yet
Facebook Profiles Clustering
5 pages
(COMP3711) (2018) (F) Final 97lpo 59937
No ratings yet
(COMP3711) (2018) (F) Final 97lpo 59937
12 pages
l34 MST Prims
No ratings yet
l34 MST Prims
22 pages
Shortest Path Algorithms: Discrete Mathematics
No ratings yet
Shortest Path Algorithms: Discrete Mathematics
28 pages
Module 2
No ratings yet
Module 2
105 pages
Practice 2.12 Logarithmic Function Manipulation
No ratings yet
Practice 2.12 Logarithmic Function Manipulation
2 pages
Download Full Cohesive Subgraph Computation over Large Sparse Graphs Algorithms Data Structures and Programming Techniques Lijun Chang PDF All Chapters
100% (1)
Download Full Cohesive Subgraph Computation over Large Sparse Graphs Algorithms Data Structures and Programming Techniques Lijun Chang PDF All Chapters
65 pages
Srimaan: PG-TRB
No ratings yet
Srimaan: PG-TRB
24 pages
Definitive Guide Graph Databases For RDBMS Developer
100% (1)
Definitive Guide Graph Databases For RDBMS Developer
35 pages