0% found this document useful (0 votes)
19 views25 pages

15CS72 ACA Module1 Chapter1FinalCopy

The document discusses various architectures of parallel computers, including shared memory multiprocessors (UMA, NUMA, COMA) and distributed memory multicomputers. It also covers vector and SIMD supercomputers, detailing their operational models and examples of early commercial systems. Additionally, it introduces theoretical models like PRAM and VLSI complexity models for analyzing parallel algorithms and their performance.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

15CS72 ACA Module1 Chapter1FinalCopy

The document discusses various architectures of parallel computers, including shared memory multiprocessors (UMA, NUMA, COMA) and distributed memory multicomputers. It also covers vector and SIMD supercomputers, detailing their operational models and examples of early commercial systems. Additionally, it introduces theoretical models like PRAM and VLSI complexity models for analyzing parallel algorithms and their performance.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Scanned by CamScanner

Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
1.2 Multiprocessors and Multicomputers

The parallel computers have two different architectures i.e. shared common memory
with multiprocessor and unshared distributed memory with multicomputer.

1.2.1 Shared memory Multiprocessor


1. UMA Model
2. NUMA Model
3. COMA Model

UMA Model(Uniform Memory Access)


1. Here the physical memory is uniformly shared by all the processors.
2. All processors have equal access time to all memory words, which is why it is
called uniform memory access. Refer the diagram given below

3. The system interconnect is through bus, crossbar or multistage network.


4. When all processors have equal access to all peripheral devices, the system is
called a symmetric multiprocessor. In this case, all the processors are equally
capable of running the executive programs, such as OS kernel and I/O service
routines.
5. ln asymmetric multiprocessor only master processor can execute the operating
system and handle I/O. The remaining processors have no I/O capability and
thus are called Attached Processors. Attached processors execute user codes
under the supervision of the master processor.
NUMA(Non uniform memory Access) Model

Each processor has different access time for a particular memory word and hence the
name is non uniform memory access. There are two NUMA models given below
1. Shared Local memory NUMA Model: Example: BBN TC-2000 Butterfly
● Each processor has local memory which is shared with other processors
also. Hence collection of all local memories forms a global address
space accessible by all processors as shown in figure below

● It is faster to access a local memory with a local processor. The access of


remote memory attached to other processors takes longer due to the
added delay through the interconnection network.
2. Hierarchical Cluster Model : Example :The Cedar multiprocessor, built at the
University of Illinois
● Each cluster is a collection of multiple processors.
● All processors belonging to the same cluster are allowed to uniformly
access the Cluster Shared Memory(CSM).
● All clusters have equal access to the global memory. However, the access
time to the cluster memory is shorter than that to the global memory.
● The fastest is local memory access. The next is global memory access.
The slowest is access of remote memory.

COMA( Cache Only Memory Architecture) Model


1. Here each processor has a cache memory.
2. There is no memory hierarchy at each processor node. All the caches form a
global address space.
3. Remote access to any cache is assisted by distributed cache directory.
4. Whenever data is accessed in remote cache it gets migrated to where it will be
used. This reduces the number of redundant copies and allows a more efficient
use of memory resources.
5. Examples : Kendall Square Research's KSR-1 machine.
Some early commercial Multiprocessor systems
1. Sequent Symmetry S81
2. IBM ES/9000
3. BBN TC-2000
1.2.2 Distributed Memory Multicomputers
1. The system consists of multiple computers, often called as nodes, which are
interconnected by a message-passing network.
2. Each node is an autonomous computer consisting of a processor, local memory,
and sometimes attached disks or I/O peripherals.
3. Each node has private local memory which is not accessible by other nodes ,
hence it is also called as no remote memory access (NORMA) machines.
4. Internode communication is carried out by passing messages through the static
connection network.
5. Multiple nodes are connected through various static network topologies like rings,
tree, mesh, torus, hypercube , cube connected cycles etc.
6. Various communication patterns are demanded among the nodes, such as
one-to-one, broadcasting, permutations, and multicast patterns.
Examples :
1. Caltech Cosmic and Intel iPSC/1- uses hypercube architecture and
software controlled message switching.
2. Intel Paragon and the Parsys SuperNode 1000- uses mesh architecture
and hardware controlled message routing.

Some early commercial Multicomputer Systems


1. Intel Paragon XP/S
2. nCUBE/2 6480
3. Parasys SuperNode 1000
Gordon Bell (1992) has provided a taxonomy of MIMD machines as shown below.

Multiprocessors have single address space. Multiprocessors using centrally shared


memory have limited scalability.Multicomputers use distributed memories with multiple
address spaces. They are scalable with distributed memory.
1.3 MultiVector and SIMD Computers
1.3.1 Vector supercomputers
1. The program and data are loaded into main memory from the host computer.
2. All instructions are first decoded by the scalar control unit. If the decoded
instruction is a scalar operation it will be directly executed by the scalar
processor using the scalar functional pipelines.
3. If the instruction is decoded as vector operation , it will be sent to vector control
unit.
4. The vector control unit manages the flow of vector data between vector functional
units and main memory.
5. There are multiple vector functional units which are pipelined. Data is forwarded
from one vector functional unit to another i.e. called vector chaining.
There are two types of vector processor
1. Register Register vector processor
2. Memory Memory vector processor
Register Register Vector Processor
1. Vector registers are used to hold the vector operands, intermediate and final
vector results.
2. The vector functional pipelines retrieve operands from and put results into the
vector registers. All vector registers are programmable for user instructions.
3. The length of each vector register is usually fixed, say, sixty-four 64-bit
component registers in a vector register in a Cray Series supercomputer.

Memory to Memory Vector Processor


1. Vector operands and results are directly retrieved from and stored into the main
memory in superwords, say, 512 bits as in Cyber 205.
Some Early Commercial Vector Supercomputers
1. The DEC VAX 9000 was Digital's largest mainframe system providing concurrent
scalar and vector and multiprocessing capabilities.
2. The Cray Y-MP family offered both vector and multiprocessing capabilities.

1.3.2 SIMD Supercomputers


The operational model of SIMD machine is specified by a 5-tuple
M=(N,C,I,M,R)

1. N is the number of processing elements (PEs) in the machine. For example, the
Illiac IV had 64 PEs and the Connection Machine CM-2 had 65,536 PEs.
2. C is the set of instructions directly executed by the control unit(CU).
3. I is the set of instructions broadcast by the CU to all PEs for parallel execution.
These include arithmetic, logic, data routing, masking, and other local operations
executed by each active PE over data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into
enabled and disabled subsets.
5. R specifies the data routing schemes to be followed during inter PE
communication.
Operational Specification of MasPar MP-1 computer
1. MP-1 has 1024 to 16384 PE’s.
2. The Control unit executed scalar instructions and broadcasted vector instructions
to PE array and controls Inter PE communication.
3. Each PE was a register-based load/store RISC processor capable of handling
integer and floating point computations.
4. The masking scheme was built within each PE and continuously monitored by
the CU which could set and reset the status of each PE dynamically at run time.
5. The MP-1 had an X-Net mesh nearest 8 neighbour network plus a global
multistage crossbar router for inter-CU-PE.
Some Early Commercial SIMD Supercomputers
1. Maspar Computer Corporation MP-1 family
2. Thinking Machines Corporation CM-2
3. DAP600

1.4 PRAM and VLSI Complexity Models


These are theoretical models of parallel computer which helps for developing parallel
algorithms and scalability and programmability analysis. No real computer system can
behave exactly like the PRAM. but at the same time the PRAM model provides us with
a basis for the study of parallel algorithms and their performance in terms of time and
space complexity.

1.4.1 Parallel Random Access Machines


The PRAM model is shown in figure below. The n- processor PRAM has globally
addressable memory. The shared memory is centralized or distributed among multiple
processors. The n processors operate on a synchronized read-memory, compute, and
write-memory cycle. Four memory-update options are possible.
Exclusive Read(ER)—This allows at most one processor to read from any memory
location in each
Cycle.
Exclusive Write(EW)—This allows at most one processor to write into a memory
location at a time.
Concurrent Read(CR)—This allows multiple processors to read the same information
from the same memory cell in the same cycle.
Concurrent Write(CW) —This allows simultaneous writes to the same memory location.
Hence some policy must be set up to resolve write conflicts.
Various combinations of the above options lead to several variants of the PRAM model
as specified below.
Since CR does not create a conflict problem, variants differ mainly in how they handle
the CW conflicts.

PRAM ​Variants​: Described below are four variants of the PRAM model, depending on
how the memory reads and writes are handled.
EREW-PRAM model→ This model forbids more than one processor from reading or
writing the same memory cell simultaneously.
CREW-PRAM model→ Concurrent Reads to same memory location is allowed, but
write conflicts are avoided by mutual exclusion.
ERCW-PRAM model→ This allows exclusive read or concurrent writes to the same
memory location.
CRCW-PRAM model→ This model allows either concurrent reads or concurrent writes
to the same memory location.
In reality such parallel machines don’t exist. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW
algorithm can be no more than O(log n) times slower than any n-processor CRCW
algorithm. These models are used by computer scientists for complexity analysis,
performance analysis and scalability analysis.

1.4.2 VLSI Complexity Model


The parallel computers use VLSI chips for fabricating processor arrays, memory arrays
etc.
AT 2 Model
Let A be the chip area. The latency T is time required from when the inputs are applied
until all outputs are produced for a single problem instance.Let s be the size of the
problem. Then there exists a lower bound f(s) such that
A * T 2 >= O(f(s)). The chip is represented by the base area in the two horizontal
dimensions. The vertical dimension corresponds to time. Therefore, the
three-dimensional solid represents the history of the computation performed by the chip
as shown in figure 1.15.

Three bounds on VLSI circuits are shown below. The bounds are obtained by setting
limits on memory, l/O, and communication for implementing parallel algorithms with
VLSI chips
Memory Bound on Chip Area A
Amount of memory required for computation is influenced by the chip area. The memory
is limited by how densely information can be placed on the chip. As depicted in Fig. 1.15
, the memory requirement of a computation sets a lower bound on the chip area A.
I/O Bound on Volume AT
The volume of the rectangular cube is represented by the product AT. As information
flows through the chip for a period of time T, the number of input bits cannot exceed the
volume. The volume represents the amount of information flowing through the chip
during the entire course of the computation.
Bisection Communication Bound √AT
The bisection is represented by a vertical slice in the cube. The distance of this vertical
slice is √A and height is T. The bisection area represents the maximum amount of
information exchange between the two halves of the chip circuit during the time period
T.
Note
The efficiency of algorithm is measured through time complexity and space complexity.
The time complexity is a measure of execution time and it is a function of problem size
s. For example, a time complexity is said to be O(f(s)) if there exist positive constants cl,
c2 and s0, such that c1f(s)<=g(s)<=c2f(s) for all non negative values of s>s0.
Even the space complexity can be defined as a function of problem size s. Deterministic
algorithm is the one which produces the same output every time a program is run. Non
deterministic algorithm contains operations resulting in one outcome from a set of
possible outcomes. The set of problems which are solvable in polynomial time by
deterministic algorithms are called P-class problems. The set of problems solvable by
nondeterministic algorithms in polynomial time is called NP-class. Most computer
scientists believe that P != NP. This leads to the conjecture that there exists a subclass,
called NPC problems. Thus NP-complete problems are considered the hardest ones to
solve. Only approximation algorithms can be derived for solving NP-complete problems
in polynomial time.

You might also like