15CS72 ACA Module1 Chapter1FinalCopy
15CS72 ACA Module1 Chapter1FinalCopy
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
Scanned by CamScanner
1.2 Multiprocessors and Multicomputers
The parallel computers have two different architectures i.e. shared common memory
with multiprocessor and unshared distributed memory with multicomputer.
Each processor has different access time for a particular memory word and hence the
name is non uniform memory access. There are two NUMA models given below
1. Shared Local memory NUMA Model: Example: BBN TC-2000 Butterfly
● Each processor has local memory which is shared with other processors
also. Hence collection of all local memories forms a global address
space accessible by all processors as shown in figure below
1. N is the number of processing elements (PEs) in the machine. For example, the
Illiac IV had 64 PEs and the Connection Machine CM-2 had 65,536 PEs.
2. C is the set of instructions directly executed by the control unit(CU).
3. I is the set of instructions broadcast by the CU to all PEs for parallel execution.
These include arithmetic, logic, data routing, masking, and other local operations
executed by each active PE over data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into
enabled and disabled subsets.
5. R specifies the data routing schemes to be followed during inter PE
communication.
Operational Specification of MasPar MP-1 computer
1. MP-1 has 1024 to 16384 PE’s.
2. The Control unit executed scalar instructions and broadcasted vector instructions
to PE array and controls Inter PE communication.
3. Each PE was a register-based load/store RISC processor capable of handling
integer and floating point computations.
4. The masking scheme was built within each PE and continuously monitored by
the CU which could set and reset the status of each PE dynamically at run time.
5. The MP-1 had an X-Net mesh nearest 8 neighbour network plus a global
multistage crossbar router for inter-CU-PE.
Some Early Commercial SIMD Supercomputers
1. Maspar Computer Corporation MP-1 family
2. Thinking Machines Corporation CM-2
3. DAP600
PRAM Variants: Described below are four variants of the PRAM model, depending on
how the memory reads and writes are handled.
EREW-PRAM model→ This model forbids more than one processor from reading or
writing the same memory cell simultaneously.
CREW-PRAM model→ Concurrent Reads to same memory location is allowed, but
write conflicts are avoided by mutual exclusion.
ERCW-PRAM model→ This allows exclusive read or concurrent writes to the same
memory location.
CRCW-PRAM model→ This model allows either concurrent reads or concurrent writes
to the same memory location.
In reality such parallel machines don’t exist. The CREW algorithm runs faster than an
equivalent EREW algorithm. It has been proved that the best n-processor EREW
algorithm can be no more than O(log n) times slower than any n-processor CRCW
algorithm. These models are used by computer scientists for complexity analysis,
performance analysis and scalability analysis.
Three bounds on VLSI circuits are shown below. The bounds are obtained by setting
limits on memory, l/O, and communication for implementing parallel algorithms with
VLSI chips
Memory Bound on Chip Area A
Amount of memory required for computation is influenced by the chip area. The memory
is limited by how densely information can be placed on the chip. As depicted in Fig. 1.15
, the memory requirement of a computation sets a lower bound on the chip area A.
I/O Bound on Volume AT
The volume of the rectangular cube is represented by the product AT. As information
flows through the chip for a period of time T, the number of input bits cannot exceed the
volume. The volume represents the amount of information flowing through the chip
during the entire course of the computation.
Bisection Communication Bound √AT
The bisection is represented by a vertical slice in the cube. The distance of this vertical
slice is √A and height is T. The bisection area represents the maximum amount of
information exchange between the two halves of the chip circuit during the time period
T.
Note
The efficiency of algorithm is measured through time complexity and space complexity.
The time complexity is a measure of execution time and it is a function of problem size
s. For example, a time complexity is said to be O(f(s)) if there exist positive constants cl,
c2 and s0, such that c1f(s)<=g(s)<=c2f(s) for all non negative values of s>s0.
Even the space complexity can be defined as a function of problem size s. Deterministic
algorithm is the one which produces the same output every time a program is run. Non
deterministic algorithm contains operations resulting in one outcome from a set of
possible outcomes. The set of problems which are solvable in polynomial time by
deterministic algorithms are called P-class problems. The set of problems solvable by
nondeterministic algorithms in polynomial time is called NP-class. Most computer
scientists believe that P != NP. This leads to the conjecture that there exists a subclass,
called NPC problems. Thus NP-complete problems are considered the hardest ones to
solve. Only approximation algorithms can be derived for solving NP-complete problems
in polynomial time.