0% found this document useful (0 votes)
9 views

15CS72_ACA_Module1_Chapter2FinalCopy

The document discusses the conditions and dependencies necessary for parallel computing, including data, control, and resource dependencies, as well as Bernstein's conditions for parallel process execution. It also covers the roles of hardware and software in achieving parallelism, the importance of compilers, and the concepts of grain sizes and scheduling in parallel programs. Additionally, it highlights network architectures and routing functions essential for efficient communication in multiprocessor systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

15CS72_ACA_Module1_Chapter2FinalCopy

The document discusses the conditions and dependencies necessary for parallel computing, including data, control, and resource dependencies, as well as Bernstein's conditions for parallel process execution. It also covers the roles of hardware and software in achieving parallelism, the importance of compilers, and the concepts of grain sizes and scheduling in parallel programs. Additionally, it highlights network architectures and routing functions essential for efficient communication in multiprocessor systems.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Module 1: Chapter 2

2.1 Conditions of Parallelism


It is necessary to detect and exploit the parallelism for parallel computing. The significant
progress is needed in following areas
1. Development of computational models for parallel computers(e.g. PRAM model)
2. Interprocessor communication in Parallel architecture
3. System integration of parallel architecture in general computing environment.
2.1.1 Data and Resource Dependencies
Segment is collection of program instructions. Several segments can be executed parallel only
when segments are independent of each other. The dependence relation among several
instructions of the program is shown using dependence graph. The nodes of dependence graph
correspond to the program statements [instructions], and the directed edges with different labels
show the ordered relations among the statements. The analysis of dependence graphs shows
where opportunity exists for parallelization and vectorization.
Data Dependence: There are five types of Data Dependence as shown below
1. Flow Dependence: A statement S2 is flow dependent on S1 if at least one output of S1
feeds in as input to S2. Flow Dependence is denoted as S1→ S2
2. Anti Dependence: Statement S2 is antidependent on statement S1 if S2 follows S1 in
program order and if the output of S2 overlaps the input to S1. It is denoted as follows

3. Output Dependence: Two statements are output-dependent if they produce the same
output variable . It is denoted as follows

4. I/O Dependence: The read and write statements are I/O statements. The I/O dependence
occurs when the same file is referenced by both I/O statements.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
● The subscript of a variable is itself subscribed(indirect addressing mode)
LOAD R1, @100

● The subscript does not contain the loop index variable.


● A variable appears more than once with subscripts having different coefficients of
the loop variable.
● The subscript is nonlinear in the loop index variable.
Examples:
Consider the following code fragment of four instructions

S2 is flow dependent on S1 because the value of A is passed to R1 which is given as input to S2.
S3 is anti dependent on S2 because of R1. S2 and S4 are independent.
Consider another code fragment given below

The read write statements, S1 and S3, are I/O-dependent on each other because they both access
the same file. Also program order should be preserved during execution otherwise it may lead to
erroneous results. The dependency graph for both code fragments are shown below.
Control Dependence
1. The conditional statements are evaluated at run time and hence the execution path
followed would be different.
2. Different paths taken after a conditional branch may introduce or eliminate data
dependencies among instructions.
3. Dependence may also exist between operations performed in successive iterations of a
looping procedure. In the following, we show one loop example with and another without
control-dependent iterations.
4. The following loop has independent iterations

5. The following loop has dependent iterations

Resource Dependence
1. The Resource Dependence occurs due to conflicts in using shared resources like integer
units, floating point units , register or memory areas etc.
2. When the conflict is due to ALU unit it is called as ALU dependence and when the
conflict is due to storage it is called as storage dependence.
Bernstein’s Conditions
There are certain conditions when two processes are executed in parallel. I​i (Input set) is the set
of input variables for process P​i. and O​i​(Output set) consists of all output variables generated
after the execution of process P​i​. Now, consider two processes P​1 and P​2 with their input sets I​1
and I​2 and output sets O​1 and O​2​, respectively. These two processes can execute in parallel and
are denoted as P1|| P2 if they are independent and follow these three Bernsteins’s conditions as
given below
I​1​∩ O​2​=ɸ
I​2​∩ O​1​=ɸ
O​1​∩ O​2​=ɸ
Bernstein’s conditions simply imply that two processes can execute in parallel if they are
flow-independent, anti-independent, and output-independent.
In general, a set of processes, P​1​, P​2​,.....,P​K can execute in parallel if Bernstein’s conditions are
satisfied on a pairwise basis; that is, P​1​ || P​2​ || P​3​ || P​4 ||.....
​ || P​K ​ if and only if P​i​ || P​j​ for all i≠ j

Detection of Parallelism using Bernstein’s conditions : Example


Consider the five statements given below.

The dependence graph is shown below. There is resource dependency between P2,P3 and P4
because it is assumed that there is only one adder unit.
In sequential execution it requires five steps and in parallel execution it requires three steps if
two adders are available as shown in Figure 2.2 . Only 5 pairs P1||P5, P2||P3, P2||P5, P5||P3, and
P4||P5 can execute in parallel if there are no resource conflicts.
In general the Parallelism relation is commutative i.e P​i || P​j implies P​j || P​i​. But the relation is not
transitive i.e P​i || P​j and P​j || P​k does not imply P​i || P​k . For example we have P1||P5 and P5||P2
but P1∦ P2 which means P1 and P2 cannot be executed in parallel.

But P​i || P​j || P​k implies associativity i.e. (P​i || P​j​) || P​k​= P​i || (P​j || P​k​) because the parallel executable
processes can be executed out of order. Violations of any one or more of the three Bernstein’s
conditions in prohibits parallelism between two processes.

2.1.2 Hardware and Software Parallelism


In this section we will discuss about the hardware and software support needed for parallelism.
Example for Software Parallelism and Hardware Parallelism
Consider there are eight instructions (four loads and four arithmetic operations) to be executed in
three consecutive machine cycles. Four load operations are performed in the first cycle, followed
by two multiply operations in the second cycle and two add/subtract operations in the third cycle.
Therefore. the parallelism varies from 4 to 2 in three cycles. The average software parallelism is
equal to 8/3 = 2.67 instructions per cycle . It is shown in figure 2.3. A.

Consider the execution of same instructions by two issue processors which can execute one
memory access (load or write) and one arithmetic operation simultaneously. With this hardware
restriction, the program must execute in seven cycles as shown in Figure 2.3 b. Therefore
hardware parallelism displays an average value of 8/7 = 1.14 instructions executed per cycle.
This demonstrates a mismatch between the software parallelism and the hardware parallelism.

Consider a dual processor system where each processor is a single issue processor. The hardware
parallelism is shown in figure 2.4 given below for 12 instructions executed by two processors A
and B. S1 and S2 and L5 and L6 are added instructions for Interprocessor communication.
Control Parallelism: If two or more operations are handled simultaneously. For example : In
pipelining with multiple functional units. Basically done with hardware support and
programmers need take no special actions to invoke thorn.
Data level parallelism: The same operation is performed over many data elements by many
processors simultaneously.It is practiced in both SIMD and MIMD modes on MPP systems. The
programmer would write data parallel code which is easier to write and to debug than control
parallel code. Synchronization in SIMD data parallelism is handled by the hardware.

To solve the mismatch problem between software parallelism and hardware parallelism, one
approach is to develop compilation support, and the other is through hardware redesign for more
efficient exploitation of parallelism.

2.1.3 Role of Compilers


● The compilers should be optimized so that more parallelism would be exploited. Other
early optimizing compilers for exploiting parallelism included the CDC STACK.LlB,
Cray CFT, Illinois Parafrase, Rice PFC. Yale Bulldog, and Illinois IMPACT. Following
techniques are used in optimized compilers i.e. Loop transformation and Software
pipelining.
● One more solution is to design the compiler and the hardware jointly at the same time so
that mismatch problem between software and hardware parallelism can be resolved.

2.2 Program Partitioning and Scheduling


2.2.1 Grain Sizes and Latency
Grain size is the measure of amount of computation involved in software process. It is a measure
of the number of instructions in a grain( program segment) Grain size can be fine , coarse or
medium depending upon the processing levels and latency is a measure of communication
overhead. For example, the memory latency is the time required by a processor to access the
memory. The time required for two processes to synchronize with each other is called
synchronization latency.
There are different processing levels and parallelism can be exploited at each level as shown in
figure 2.5. Also lower the level , more finer will be the grain size.
Instruction Level Parallelism: ​Typical grain size would be less than 20 instructions. Also the
parallelism is detected and source code will be transformed to parallel code by optimizing
compilers. Grain size is Fine
Loop Level Parallelism: ​If loop iterations are independent they can be executed in parallel by
Vector Processor. But recursive loops are rather difficult to parallelize. A typical loop contains
less than 500 instructions. Grain size is Fine.
Procedure Level Parallelism ​: Here grain size is medium which typically corresponds to a
procedure or subroutine with less than 2000 instructions. Detection of parallelism at this level is
much more difficult than at the finer-grain levels.
Subprogram Level Parallelism​: The grain size may typically contain tens or hundreds of
thousands of instructions. Traditionally, parallelism at this level has been exploited by algorithm
designers or programmers, rather than by compilers.
Job level Parallelism: ​This corresponds to the parallel execution of essentially independent jobs
(programs) on a parallel computer. The grain size can be as high as millions of instructions in a
single program. The job level parallelism is handled by OS and program loader. The grain size is
coarse.
Communication Latency: ​The latency due to interprocessor communication should be
minimized. In general ,n tasks communicating with each other may require n(n-1)/2
communication links among them. Hence this leads to a communication bound on number of
processors allowed in a large computer system.Interprocessor communication is also affected by
the communication patterns involved like permutations and broadcast, multicast and
conference(many to many) communication.

2.2.2 Grain Packing and Scheduling


Here the main problem is to identify the number of grains and size of grain in a parallel program.
Of course, the solution is both problem-dependent and machine-dependent.

Example : Program graph before and after grain packing


A program graph shows the structure of the program. Each node of program graph is
computational unit which is denoted as(n,s) where n is node id and s is the grain size of the
node.Fine-grain nodes have a smaller grain size. and coarse-grain nodes have a larger grain size.
The edge label (v,d) between two end nodes specifies the v as input to destination node and d is
delay(memory latency or other delay).There are 17 nodes in the fine-grain program graph and 5
in the coarse-grain program graph as shown below. The coarse-grain node is obtained by
combining {grouping} multiple fine-grain nodes since it helps to reduce the communication
delays or reduce overall scheduling overhead( i.e. duration). This procedure is called as the grain
packing. Usually the schedule(time duration) for fine grain will be longer than coarse grain
schedule as shown in figure 2.7 below.

The node (A,8) is obtained by combining the nodes (1,1), (2,1), (3,1), (4,1), (5,1) , (6,1) and
(11,2). The grain size, 8. of node A is the summation of all grain sizes (1 + l + l + 1 + 1 + 1 + 2 =
8) being combined.
2.2.3 Static Multiprocessor Scheduling
Grain packing will not always help in reducing the schedule(duration). The static multiprocessor
scheduling technique called Node duplication helps to reduce schedule time.
Node Duplication : In order to eliminate the idle time and to further reduce the communication
delays among processors, one can duplicate some of the nodes in more than one processor.
Figure 2.8a shows a schedule without duplicating any of the five nodes. This schedule contains
idle time as well as long interprocessor delays (8 units) between Pl and P2. In Fig. 2.8b, node A
is duplicated into A’ and assigned to P2 besides retaining the original copy A in P1. Similarly, a
duplicated node C’ is copied into Pl besides the original node C in P2. The new schedule shown
in Fig. 2.8b is almost 50% shorter than that in Fig. 2.8a. Thus Grain packing and Node
Duplication together will help to determine the grain size and corresponding schedule. Four
major steps are involved in the grain determination and the process of scheduling optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.

2.3 : Notes for Program Flow Mechanism is given in Module 1: Chapter 3


2.4 System Interconnect Architectures
Static and Dynamic networks can be used to interconnect the computer subsystems or build
multiprocessors or multicomputers. Various topologies are existing for these networks and it can
connect multiple processors, memory modules, and I/O adapters in a centralized or distributed
system. The goal is to build a network which has low latency , high data transfer rate and wide
communication bandwidth.
2.4.1 Network properties and Routing
1. Static networks are formed by point to point fixed direct connections which will not
change during the program execution. Dynamic networks provide reconfigurable
connections between nodes which change during program execution. The switch box is
the basic component of the dynamic network. With a dynamic network connections
between nodes are established by the setting of a set of interconnected switch boxes.
2. The network is represented by a graph with finite number of nodes linked by undirected
or directed edges, The number of nodes in the graph is called as network size.
Some parameters which affect the complexity, communication efficiency and cost of network are
given below.
Node Degree and Network Diameter
The number of edges incident on node represents node degree. If the network contains directed
edges then number of incoming edges represents indegree and number of outgoing edges
represents outdegree. Then node degree is the sum of the two. The node degree influences the
number of I/O ports required and thus the cost of a node. Hence the node degree should be kept
small(constant) to reduce cost.
The Diameter D of the network is the maximum of all the shortest path length calculated from
every node to all other nodes of the network. The network diameter should be as small as
possible from communication point of view.

Bisection Width.
When the network is cut into two halves, the minimum number of edges along the cut is called as
channel bisection width b. Each edge corresponds to a channel with w bit wires. Hence wire
bisection width is B=bw which represents the wiring density of a network. The channel width
w=B/b. The wire length affects the signal latency, clock skewing and power requirements. We
label a network as symmetric if the topology is the same looking from any node.

Data Routing Functions


The data routing functions are used for data exchange among multiple processors or PE in
network. In case of distributed multi computer data routing is done through message passing
through hardware routers. Commonly seen data routing functions include shifting, rotation,
permutation(one-to-one), multicast(one to many), shuffle, exchange etc. These routing functions
can be implemented on ring, mesh, hypercube or multistage networks.

Perfect Shuffle and exchange: ​The mapping is shown below. Its inverse is shown in the right

side. In general to shuffle n= 2k objects , each object is represented by k-bit binary number.
Suppose x and y is k-bit binary number, perfect shuffle maps x to y , where y is obtained by
shifting the most significant bit of x to least significant position.

Hypercube Routing Functions


A three-dimensional binary cube network is shown in Fig. 2.15 with three routing functions. In
general, an n-dimensional hypercube has n routing functions, defined by each bit of the n-bit
address. For example one can exchange the data between adjacent nodes which differ in the least
significant bit C0, as shown in Fig. 2.15b. Similarly, two other routing patterns can be obtained
by checking the middle bit C1 and the most significant bit C2 respectively as shown in figure. .
Broadcast and Multicast: Broadcast is a one-to-all mapping. This can be easily achieved in an
SIMD computer using a broadcast bus extending from the array controller to all PEs. A.
message-passing multicomputer also has mechanisms to broadcast messages. Multicast
corresponds to a mapping from one PE to other PEs [one to many].
Network Performance: The performance of an interconnection network is affected by the
following factors:
Functionality​—This refers to how the network supports data routing, interrupt handling,
synchronization, request/message combining, and coherence.
Network Latency —This refers to the worst-case time delay for unit message to transferred
through the network.
Bandwidth​ -- This refers to the maximum data transfer rate transmitted through the network.
Hardware Complexity​ —This refers to implementation costs such as those for wires, switches,
connectors, arbitration, and interface logic.
Scalability ​—This refers to the ability of a network to be modularly expandable with a scalable
performance with increasing machine resources.

2.4.2 Static Connection Networks.


The topologies are given below.
Linear Array
Here N nodes are connected by N-1 links. Internal nodes have node degree of 2 and terminal
nodes have node degree of 1. The diameter is N-1 and bisection width is 1. It is simplest
connection topology and usually is not suitable for large value of N. Also Linear Array allows
the concurrent use of different sections of the structure by different source and destination pairs.

Ring and Chordal Ring


A ring is obtained by connecting the terminal nodes of the linear array with one extra link as
shown in figure given below in b. A ring can be unidirectional or bidirectional. The node degree
is 2 and diameter is N/2 for bidirectional ring and N for unidirectional ring.
By increasing the node degree from 2 to 3 or 4, we obtain the chordial rings as shown below in
figure given below. In general , the more links added, the higher the node degree and shorter the
network diameter.
In the extreme , the completely connected network in f has node degree of 15 and diameter of 1.
Barrel Shifter
Consider number of nodes to be N. N= 2n and barrel shifter has node degree of d= 2n-1 and
diameter D= n/2. This implies that node i is connected to node j if j-i = 2r for some r =
[0,1,......,n-1]. For N = 16 , the barrel shifter has a node degree of 7 with a diameter of 2. But. the
barrel shifter complexity is still much lower than that of the completely connected network.
Tree
The binary tree with 31 nodes and 5 levels is shown below. The binary tree will have N= 2k -1
nodes. The maximum node degree is 3 and diameter is 2(k-1). With constant node degree binary
tree is scalable . A DADO multiprocessor was built at Columbia University (1987) with a 10
level binary tree of 1023 nodes. Only drawback is that there will be a lot of traffic towards the
root.

Star
The star is two level tree with a high node degree at the central node of d= N-1 and a small
constant diameter of 2. Generally used in systems with a centralized supervisor node. It is shown
below in the figure.
Fat Tree.
A binary fat tree is shown in figure given below. The channel width of a fat tree increases as we
ascend from leaves to the root. The fat tree is more like a real tree in that branches get thicker
toward the root. The traffic at the root node is lowered due to high channel width. The idea of a
fat tree was applied in the Connection Machine CM-5.

Mesh and Torus


The mesh is a frequently used architecture which has been implemented in the Illiac IV, MPP,
DAP, and Intel Paragon with variations. It is shown in figure given below. In general, a
k-dimensional mesh with N = nk nodes has an interior node degree of 2k and the network
diameter is k(n-1). The node degree at the boundary and corner is 3 or 2.

The Illiac IV had 8*8 mesh with constant node degree of 4 and diameter of 7. In general the
illiac mesh is formed by wraparound connections as shown below and diameter is d=n-1 which is
only half of the diameter for a pure mesh.
The torus is shown in figure below. The torus has ring connections along each row and along
each column of the array. In general, an n*n binary torus has a node degree of 4 and a diameter
of 2(n/2). The torus is a symmetric topology. All added wraparound connections help reduce the
diameter by one half from that of the mesh.

Systolic arrays are designed for implementing the fixed algorithms. As shown in the figure
below, the systolic array is designed for matrix multiplication. The interior node degree is 6 in
this example. The commercial machine intel iWarp system was designed with a systolic
architecture.For special applications like image/signal processing ,the systolic arrays may offer a
better performance/cost ratio.
Hypercubes
In general an n-cube consists of N= 2n nodes spanning along n dimensions. A 3-cube with 8
nodes is shown in figure below. A 4-cube is formed by interconnecting the corresponding nodes
of the two 3-cube as shown in given figure below. The network diameter and node degree is n.
The Hypercube has poor scalability and difficulty in packaging. Both Intel iPSC/1, iPSC/2 and
nCUBE machines were built with the hypercube architecture.

Cube Connected Cycles


This architecture is modified from the hypercube. 3-cube is modified to form 3-cube connected
cycle(3-CCC) with network diameter of 6 . The idea is to replace the corner vertices of the 3
cube with a ring(cycle) of 3 nodes as shown below in the figure.

In general k cube connected cycle can be formed using a k-cube with n= 2k cycles and each
cycle will have k nodes. Thus k-cube can be transformed to a k-CCC with k* 2k nodes. In
general the network diameter of k-CCC is 2k. The major improvement of CCC lies in its
constant node degree of 3, which is independent of the dimension of the underlying hypercube.
Also CCC is better architecture for building scalable systems if latency can be tolerated in some
way.
K-ary n-cube network
The 4-ary 3-cube network is shown below and k is 4 and n is 3. The parameter n is the dimension
of the cube and k is the radix which indicates the number of nodes along each dimension. The
number of nodes in network N is N= k n

Every node in k-ary n-cube network is identified by a n-digit address A =a​1​a​2…….​a​n​. Also low
dimensional k-ary n-cube is called as tori and high dimensional one are called hypercubes. The
traditional torus(4-array 2-cube) is shown in figure given below but wire length between the
nodes is uneven.The wire length is made equal by folding the network as shown below in figure.

Network Throughput
The network throughput is defined as the total number of messages the network can handle per
unit time.
A hot spot is a pair of nodes that accounts for a disproportionately large portion of the total
network traffic. Hot-spot traffic can degrade the performance of the entire network by causing
congestion. The hot-spot throughput of a network is the maximum rate at which messages can
be sent from one specific node P​i​ to another specific node P​j​.
2.4.3 Dynamic Connection Networks
Here instead of using fixed connections, switches or arbiters must be used along the connecting
paths to provide the dynamic connectivity.
Digital buses
A bus system is essentially a collection of wires and connectors for data transactions among
processors, memory modules, and peripheral devices attached to the bus. The bus is used for
only one transaction at a time between source and destination. In case of multiple requests, the
bus arbitration logic must be able to allocate or deallocate the bus, servicing the requests one at a
time. For this reason, the digital bus has been called contention bus or time sharing bus among
multiple functional modules. Figure given below shows a bus-connected multiprocessor system.
The system bus provides a common communication path between the processors, IO subsystem,
and the memory modules, secondary storage devices, network adaptors, etc. The active or
master devices (processors or IO subsystem) generate requests to address the memory. The
passive or slave devices (memories or peripherals) respond to the requests. The common bus is
used on a time-sharing basis, and important issues include the bus arbitration, interrupts
handling, coherence protocols. and transaction processing.
Switch Modules
An a*b switch module has a inputs and b outputs. A binary switch is 2*2 switch module with a
and b as 2. In theory a and b do not need to be equal. Table given below lists several commonly
used switch module sizes: 2 * 2, 4* 4, and 8*8. Each input can be connected to one or more
outputs. However, conflicts must be avoided at the output terminals. ln other words, one-to-one
and one-to-many mappings are allowed; but many-to—one mappings are not allowed due to
conflicts at the output terminal.

Multistage Interconnection Networks(MIN)


MIN has been used in MIMD and SIMD computers. The general multistage network is given
below in the figure. Multiple a*b switches are used in each stage. The switches can be
dynamically set to establish the desired connections between the input and output.
Different classes of MINs differ in the switch modules used and in the kind of interstage
connection(ISC) patterns used. The simplest switch module would be the 2*2 switch. The ISC
patterns often used include perfect shuffle, butterfly. multiway shuffle, crossbar, cube
connection, etc.
Omega Network
The figure given below shows 4 possible switch modules used for constructing the Omega
Network and 16*16 Omega Network. Four stages of 2 X 2 switches are needed. There are 16
inputs on the left and 16 outputs on the right. The ISC pattern is the perfect shuffle over 16
objects.The outputs from each stage are connected to the inputs of the next stage using a ​perfect
shuffle​ connection system
In general, an n-input Omega network requires log​2​n stages of 2*2 switches. Each stage requires
n/2 switch modules. In total, the network uses nlog​2​n/2 switches. Each switch module is
individually controlled.
Baseline Network
The first stage contains one N*N block and second stage contains two N/2*N/2 subblocks
labelled as C0 and C1. The construction process is recursively applied to the subblocks until the
size of the subblock gets reduces to 2*2. The ultimate building blocks for subblocks are 2*2
switches each with two legitimate connection states : straight and crossover between the two
input and two outputs. A 16*16 baseline network is shown below.
CrossBar Network
The highest bandwidth and interconnection capability are provided by crossbar networks. A
crossbar network can be visualized as a single-stage switch network. Each crosspoint switch can
provide a dedicated connection path between a pair. The switch can be set on or off dynamically
upon program demand. Two types of crossbar networks are shown in figure given below.
Interprocessor Memory Crossbar network : ​The pioneering C.mmp implemented a 16*16
crossbar network which connected 16 PDP11 processors to 16 memory modules, each of which
had capability of 1 million words of memory cells. The 16 memory modules could be accessed
by the processors in parallel.Each memory module can satisfy only one processor request at a
time.Only one crosspoint switch can be set on in each column. However, several crosspoint
switches can be set on simultaneously in order to support parallel memory accesses.
Interprocessor Crossbar network
This large crossbar 224 * 224 was actually built in a vector parallel processor VPP500 by
Fujitsu Inc. (1992). The PEs are processors with attached memory. The CPs stand for control
processors which are used to supervise the entire system operation, including the crossbar
networks. In this crossbar, at one time only one crosspoint switch can be set on in each row and
each column. The interprocessor crossbar provides permutation connections among the
processors. Only one-to-one
connections are provided. Therefore , the n*n crossbar connects at most n source, destination
pairs at a time.

You might also like