VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
Content
classification
topologies
evaluating static and dynamic interconnection networks
The problem
Protocols
False sharing
snoopy systems,
directory based systems, or
combinations thereof.
1.
2.
Since each memory block has an owner its directory location is implicitly known to all
processors.
When a processor attempts to read a block for the first time, it requests the owner for the
block.
The owner suitably directs this request based on presence and state information locally
available.
When a processor writes into a memory block, it propagates an invalidate to the owner,
which in turn forwards the invalidate to all procs that have a cached copy of the block.
Note that the communication overhead associated with state update messages is
not reduced.
Distributed directories permit O(p) simultaneous coherence operations, provided
the underlying network can sustain the associated state update messages.
From this point of view, distributed directories are inherently more scalable than snoopy
systems or centralized directory systems.
The latency and bandwidth of the network become fundamental performance
bottlenecks for such systems.
Interconnection networks
Interconnection networks
Switch
Network interface
Provides the connectivity between the nodes and the network
The network interface has input and output ports that pipe data into and
out of the network
Responsibility of
packetizing data,
computing routing information,
buffering incoming and outgoing data for matching speeds of network
and processing elements
error checking.
Conventional network interfaces hang off the I/O buses
Interfaces in tightly coupled parallel machines hang off the memory bus
Since I/O buses are typically slower than memory buses => the latter can
support higher bandwidth.
Static network
(b)
Disadvantages:
Figure (a) :
Figure (b):
Assume 50% of the memory accesses (0.5k) are made to local data
Assume access time to the private mem identical to the global mem, tcycle.
Total execution time is lower bounded by 0.5 x tcycle x k + 0.5 x tcycle x k p.
p large, the organization of Figure (b) results in a lower bound
This time is a 50% improvement in lower bound on execution time compared
to the organization of Figure (a).
Crossbar Networks
Multistage netws
Remarks:
The crossbar interconnect.netw
Omega network
1.
2.
An omega network has p/2 x log p switching nodes, and the cost of
such a network grows as O(p log p).
this cost is less than the (p2) cost of a complete crossbar network.
Routing data in an omega network is accomplished as follows:
Let s be the binary representation of a processor that needs to write
some data into memory bank t.
The data traverses the link to the first switching node.
If the most significant bits of s and t are the same, then the data is
routed in pass-through mode by the switch.
If these bits are different, the data is routed through in crossover mode.
This scheme is repeated at the next switching stage using the next
most significant bit.
Traversing log p stages uses all log p bits in the binary representations
of s and t.
Examples
Completely-connected net:
Star-connected netw,
Communication between
any pair of processors is
routed through the
central processor
Linear arrays
A two-dimensional mesh:
Each node element in a 3-D cube (exception on periphery), is connected to six other
nodes, two along each of the three dimensions.
Numbering scheme:
Tree-based networks
Tree n.: one in which there is only one path between any pair of nodes
Linear arrays & star-connected net. are special cases of tree networks.
Figure: networks based on complete binary trees.
Static tree networks have a processing element at each node of the
tree (in (a)).
In a dynamic tree network, nodes at intermediate levels are switching
nodes and the leaf nodes are processing elements (in (b)).
To route a message in a tree:
the source node sends the message up the tree until it reaches the
node at the root of the smallest subtree containing both the source and
destination nodes.
Then the mess is routed down the tree towards the destination node
Fat tree
Diameter
Connectivity
Bisection width
Channel capacity
Bisection bandwidth
Cost
1. Diameter
2. Connectivity
3. Bisection width
ring is two,
two-dimensional p-node mesh without wraparound
connections is sqrt(p) and with wraparound connections is
2sqrt(p).
tree and a star is one,
completely-connected network of p nodes is p2/4.
hypercube is p/2 (from its construction)
Channel width:
Channel rate:
Channel bandwidth:
5. Bisection bandwidth
in a 2D packaging is (w2),
in a 3D packaging is (w3/2).
6. Cost
1.
2.
Example:
three bisections, A, B, and
C, each of which partitions
the network into 2 groups of
2 processing nodes each.
Each partition results in an
edge cut of four.
=> Conclude that the bisection
width of this graph is four.