15CS72_ACA_Module1_Chapter2FinalCopy
15CS72_ACA_Module1_Chapter2FinalCopy
3. Output Dependence: Two statements are output-dependent if they produce the same
output variable . It is denoted as follows
4. I/O Dependence: The read and write statements are I/O statements. The I/O dependence
occurs when the same file is referenced by both I/O statements.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
● The subscript of a variable is itself subscribed(indirect addressing mode)
LOAD R1, @100
S2 is flow dependent on S1 because the value of A is passed to R1 which is given as input to S2.
S3 is anti dependent on S2 because of R1. S2 and S4 are independent.
Consider another code fragment given below
The read write statements, S1 and S3, are I/O-dependent on each other because they both access
the same file. Also program order should be preserved during execution otherwise it may lead to
erroneous results. The dependency graph for both code fragments are shown below.
Control Dependence
1. The conditional statements are evaluated at run time and hence the execution path
followed would be different.
2. Different paths taken after a conditional branch may introduce or eliminate data
dependencies among instructions.
3. Dependence may also exist between operations performed in successive iterations of a
looping procedure. In the following, we show one loop example with and another without
control-dependent iterations.
4. The following loop has independent iterations
Resource Dependence
1. The Resource Dependence occurs due to conflicts in using shared resources like integer
units, floating point units , register or memory areas etc.
2. When the conflict is due to ALU unit it is called as ALU dependence and when the
conflict is due to storage it is called as storage dependence.
Bernstein’s Conditions
There are certain conditions when two processes are executed in parallel. Ii (Input set) is the set
of input variables for process Pi. and Oi(Output set) consists of all output variables generated
after the execution of process Pi. Now, consider two processes P1 and P2 with their input sets I1
and I2 and output sets O1 and O2, respectively. These two processes can execute in parallel and
are denoted as P1|| P2 if they are independent and follow these three Bernsteins’s conditions as
given below
I1∩ O2=ɸ
I2∩ O1=ɸ
O1∩ O2=ɸ
Bernstein’s conditions simply imply that two processes can execute in parallel if they are
flow-independent, anti-independent, and output-independent.
In general, a set of processes, P1, P2,.....,PK can execute in parallel if Bernstein’s conditions are
satisfied on a pairwise basis; that is, P1 || P2 || P3 || P4 ||.....
|| PK if and only if Pi || Pj for all i≠ j
The dependence graph is shown below. There is resource dependency between P2,P3 and P4
because it is assumed that there is only one adder unit.
In sequential execution it requires five steps and in parallel execution it requires three steps if
two adders are available as shown in Figure 2.2 . Only 5 pairs P1||P5, P2||P3, P2||P5, P5||P3, and
P4||P5 can execute in parallel if there are no resource conflicts.
In general the Parallelism relation is commutative i.e Pi || Pj implies Pj || Pi. But the relation is not
transitive i.e Pi || Pj and Pj || Pk does not imply Pi || Pk . For example we have P1||P5 and P5||P2
but P1∦ P2 which means P1 and P2 cannot be executed in parallel.
But Pi || Pj || Pk implies associativity i.e. (Pi || Pj) || Pk= Pi || (Pj || Pk) because the parallel executable
processes can be executed out of order. Violations of any one or more of the three Bernstein’s
conditions in prohibits parallelism between two processes.
Consider the execution of same instructions by two issue processors which can execute one
memory access (load or write) and one arithmetic operation simultaneously. With this hardware
restriction, the program must execute in seven cycles as shown in Figure 2.3 b. Therefore
hardware parallelism displays an average value of 8/7 = 1.14 instructions executed per cycle.
This demonstrates a mismatch between the software parallelism and the hardware parallelism.
Consider a dual processor system where each processor is a single issue processor. The hardware
parallelism is shown in figure 2.4 given below for 12 instructions executed by two processors A
and B. S1 and S2 and L5 and L6 are added instructions for Interprocessor communication.
Control Parallelism: If two or more operations are handled simultaneously. For example : In
pipelining with multiple functional units. Basically done with hardware support and
programmers need take no special actions to invoke thorn.
Data level parallelism: The same operation is performed over many data elements by many
processors simultaneously.It is practiced in both SIMD and MIMD modes on MPP systems. The
programmer would write data parallel code which is easier to write and to debug than control
parallel code. Synchronization in SIMD data parallelism is handled by the hardware.
To solve the mismatch problem between software parallelism and hardware parallelism, one
approach is to develop compilation support, and the other is through hardware redesign for more
efficient exploitation of parallelism.
The node (A,8) is obtained by combining the nodes (1,1), (2,1), (3,1), (4,1), (5,1) , (6,1) and
(11,2). The grain size, 8. of node A is the summation of all grain sizes (1 + l + l + 1 + 1 + 1 + 2 =
8) being combined.
2.2.3 Static Multiprocessor Scheduling
Grain packing will not always help in reducing the schedule(duration). The static multiprocessor
scheduling technique called Node duplication helps to reduce schedule time.
Node Duplication : In order to eliminate the idle time and to further reduce the communication
delays among processors, one can duplicate some of the nodes in more than one processor.
Figure 2.8a shows a schedule without duplicating any of the five nodes. This schedule contains
idle time as well as long interprocessor delays (8 units) between Pl and P2. In Fig. 2.8b, node A
is duplicated into A’ and assigned to P2 besides retaining the original copy A in P1. Similarly, a
duplicated node C’ is copied into Pl besides the original node C in P2. The new schedule shown
in Fig. 2.8b is almost 50% shorter than that in Fig. 2.8a. Thus Grain packing and Node
Duplication together will help to determine the grain size and corresponding schedule. Four
major steps are involved in the grain determination and the process of scheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
Bisection Width.
When the network is cut into two halves, the minimum number of edges along the cut is called as
channel bisection width b. Each edge corresponds to a channel with w bit wires. Hence wire
bisection width is B=bw which represents the wiring density of a network. The channel width
w=B/b. The wire length affects the signal latency, clock skewing and power requirements. We
label a network as symmetric if the topology is the same looking from any node.
Perfect Shuffle and exchange: The mapping is shown below. Its inverse is shown in the right
side. In general to shuffle n= 2k objects , each object is represented by k-bit binary number.
Suppose x and y is k-bit binary number, perfect shuffle maps x to y , where y is obtained by
shifting the most significant bit of x to least significant position.
Star
The star is two level tree with a high node degree at the central node of d= N-1 and a small
constant diameter of 2. Generally used in systems with a centralized supervisor node. It is shown
below in the figure.
Fat Tree.
A binary fat tree is shown in figure given below. The channel width of a fat tree increases as we
ascend from leaves to the root. The fat tree is more like a real tree in that branches get thicker
toward the root. The traffic at the root node is lowered due to high channel width. The idea of a
fat tree was applied in the Connection Machine CM-5.
The Illiac IV had 8*8 mesh with constant node degree of 4 and diameter of 7. In general the
illiac mesh is formed by wraparound connections as shown below and diameter is d=n-1 which is
only half of the diameter for a pure mesh.
The torus is shown in figure below. The torus has ring connections along each row and along
each column of the array. In general, an n*n binary torus has a node degree of 4 and a diameter
of 2(n/2). The torus is a symmetric topology. All added wraparound connections help reduce the
diameter by one half from that of the mesh.
Systolic arrays are designed for implementing the fixed algorithms. As shown in the figure
below, the systolic array is designed for matrix multiplication. The interior node degree is 6 in
this example. The commercial machine intel iWarp system was designed with a systolic
architecture.For special applications like image/signal processing ,the systolic arrays may offer a
better performance/cost ratio.
Hypercubes
In general an n-cube consists of N= 2n nodes spanning along n dimensions. A 3-cube with 8
nodes is shown in figure below. A 4-cube is formed by interconnecting the corresponding nodes
of the two 3-cube as shown in given figure below. The network diameter and node degree is n.
The Hypercube has poor scalability and difficulty in packaging. Both Intel iPSC/1, iPSC/2 and
nCUBE machines were built with the hypercube architecture.
In general k cube connected cycle can be formed using a k-cube with n= 2k cycles and each
cycle will have k nodes. Thus k-cube can be transformed to a k-CCC with k* 2k nodes. In
general the network diameter of k-CCC is 2k. The major improvement of CCC lies in its
constant node degree of 3, which is independent of the dimension of the underlying hypercube.
Also CCC is better architecture for building scalable systems if latency can be tolerated in some
way.
K-ary n-cube network
The 4-ary 3-cube network is shown below and k is 4 and n is 3. The parameter n is the dimension
of the cube and k is the radix which indicates the number of nodes along each dimension. The
number of nodes in network N is N= k n
Every node in k-ary n-cube network is identified by a n-digit address A =a1a2…….an. Also low
dimensional k-ary n-cube is called as tori and high dimensional one are called hypercubes. The
traditional torus(4-array 2-cube) is shown in figure given below but wire length between the
nodes is uneven.The wire length is made equal by folding the network as shown below in figure.
Network Throughput
The network throughput is defined as the total number of messages the network can handle per
unit time.
A hot spot is a pair of nodes that accounts for a disproportionately large portion of the total
network traffic. Hot-spot traffic can degrade the performance of the entire network by causing
congestion. The hot-spot throughput of a network is the maximum rate at which messages can
be sent from one specific node Pi to another specific node Pj.
2.4.3 Dynamic Connection Networks
Here instead of using fixed connections, switches or arbiters must be used along the connecting
paths to provide the dynamic connectivity.
Digital buses
A bus system is essentially a collection of wires and connectors for data transactions among
processors, memory modules, and peripheral devices attached to the bus. The bus is used for
only one transaction at a time between source and destination. In case of multiple requests, the
bus arbitration logic must be able to allocate or deallocate the bus, servicing the requests one at a
time. For this reason, the digital bus has been called contention bus or time sharing bus among
multiple functional modules. Figure given below shows a bus-connected multiprocessor system.
The system bus provides a common communication path between the processors, IO subsystem,
and the memory modules, secondary storage devices, network adaptors, etc. The active or
master devices (processors or IO subsystem) generate requests to address the memory. The
passive or slave devices (memories or peripherals) respond to the requests. The common bus is
used on a time-sharing basis, and important issues include the bus arbitration, interrupts
handling, coherence protocols. and transaction processing.
Switch Modules
An a*b switch module has a inputs and b outputs. A binary switch is 2*2 switch module with a
and b as 2. In theory a and b do not need to be equal. Table given below lists several commonly
used switch module sizes: 2 * 2, 4* 4, and 8*8. Each input can be connected to one or more
outputs. However, conflicts must be avoided at the output terminals. ln other words, one-to-one
and one-to-many mappings are allowed; but many-to—one mappings are not allowed due to
conflicts at the output terminal.