Introduction to Computer Models 1
Unit 1: Introduction to Computer Models
Notes
Structure
1.1 Introduction
1.2 The State of Computing
1.2.1 Evolution of Computer System
1.3.2 Elements of Modern Computers
1.2.3 Flynn's Classical Taxonomy
1.2.4 Performance Attributes
1.3 Multiprocessor and Multicomputers
1.3.1 Shared Memory Multiprocessors
1.3.2 Distributed Memory
1.4 Multivector and SIMD Computers
1.4.1 Vector Hardware
1.4.2 SIMD Array Processors
1.5 Architectural Development Tracks
1.5.1 Multiple-Processor Tracks
1.6 Summary
1.7 Check Your Progress
1.8 Questions and Exercises
1.9 Key Terms
1.10 Further Readings
Objectives
After studying this unit, you should be able to:
z Understand he State Of Computing.
z Know about the Multiprocessor And Multicomputers
z Understand the concept of Multivector And SIMD Computers
1.1 Introduction
From an application point of view, the mainstream of usage of computer is experiencing
a trend of four ascending levels of sophistication:
z Data processing
z Information processing
z Knowledge processing
z Intelligence processing
With more and more data structures developed, many users are shifting to
computer roles from pure data processing to information processing. A high degree of
parallelism has been found at these levels. As the accumulated knowledge bases
expanded rapidly in recent years, there grew a strong demand to use computers for
Amity Directorate of Distance & Online Education
2 Computer Architecture and Parallel Processing
knowledge processing. Intelligence is very difficult to create; its processing even more
so. Today’s computers are very fast and obedient and have many reliable memory cells
Notes to be qualified for data-information-knowledge processing.
Parallel processing is emerging as one of the key technology in area of modern
computers. Parallel appears in various forms such as lookahead, vectorization
concurrency, simultaneity, data parallelism, interleaving, overlapping, multiplicity,
replication, multiprogramming, multithreading and distributed computing at different
processing level.
1.2 The State of Computing
Modern computers are equipped with powerful hardware technology at the same time
loaded with sophisticated software packages. To access the art of computing we firstly
review the history of computers then study the attributes used for analysis of
performance of computers.
1.2.1 Evolution of Computer System
Presently the technology involved in designing of its hardware components of
computers and its overall architecture is changing very rapidly for example: processor
clock rate increase about 20% a year, its logic capacity improve at about 30% in a year;
memory speed at increase about 10% in a year and memory capacity at about 60%
increase a year also the disk capacity increase at a 60% a year and so overall cost per
bit improves about 25% a year.
But before we go further with design and organization issues of parallel computer
architecture it is necessary to understand how computers had evolved. Initially, man
used simple mechanical devices – abacus (about 500 BC) , knotted string, and the slide
rule for computation. Early computing was entirely mechanical like : mechanical
adder/subtracter (Pascal, 1642) difference engine design (Babbage, 1827) binary
mechanical computer (Zuse, 1941) electromechanical decimal machine (Aiken, 1944).
Some of these machines used the idea of a stored program a famous example of it is
the Jacquard Loom and Babbage’s Analytical Engine which is also often considered as
the first real computer. Mechanical and electromechanical machines have limited speed
and reliability because of the many moving parts. Modern machines use electronics for
most information transmission.
Computing is normally thought of as being divided into generations. Each
successive generation is marked by sharp changes in hardware and software
technologies. With some exceptions, most of the advances introduced in one
generation are carried through to later generations. We are currently in the fifth
generation.
Ist generation of computers ( 1945-54)
The first generation computers where based on vacuum tube technology. The first large
electronic computer was ENIAC (Electronic Numerical Integrator and Calculator), which
used high speed vacuum tube technology and were designed primarily to calculate the
trajectories of missiles. These are used separate memory block for program and data.
In 1946 John Von Neumann introduced the concept of stored program, in which data
and program where stored in same memory block. Based on this concept EDVAC
(Electronic Discrete Variable Automatic Computer) was built in 1951. On this concept
IAS (Institute of advance studies, Princeton) computer was built whose main
characteristic was CPU consist of two units (Program flow control and execution unit).
In general key features of this generation of computers where
1. The switching device used where vacuum tube having switching time between 0.1
to 1 milliseconds.
2. One of major concern for computer manufacturer of this era was that each of the
computer designs had a unique design. As each computer has unique design one
Amity Directorate of Distance & Online Education
Introduction to Computer Models 3
cannot upgrade or replace one component with other computer. Programs that
were written for one machine could not execute on another machine, even though
other computer was also designed from the same company. This created a major Notes
concern for designers as there were no upward-compatible machines or computer
architectures with multiple, differing implementations. And designers always tried to
manufacture a new machine that should be upward compatible with the older
machines.
3. Concept of specialized registers where introduced for example index registers were
introduced in the Ferranti Mark I, concept of register that save the return-address
instruction was introduced in UNIVAC I, also concept of immediate operands in IBM
704 and the detection of invalid operations in IBM 650 were introduced.
4. Punch card or paper tape were the devices used at that time for storing the
program. By the end of the 1950s IBM 650 became one of popular computers of
that time and it used the drum memory on which programs were loaded from punch
card or paper tape. Some high-end machines also introduced the concept of core
memory which was able to provide higher speeds. Also hard disks started
becoming popular.
5. In the early 1950s as said earlier were design specific hence most of them were
designed for some particular numerical processing tasks. Even many of them used
decimal numbers as their base number system for designing instruction set. In such
machine there were actually ten vacuum tubes per digit in each register.
6. Software used was machine level language and assembly language.
7. Mostly designed for scientific calculation and later some systems were developed
for simple business systems.
8. Architecture features
Vacuum tubes and relay memories
CPU driven by a program counter (PC) and accumulator
Machines had only fixed-point arithmetic
9. Software and Applications
Machine and assembly language
Single user at a time
No subroutine linkage mechanisms
Programmed I/O required continuous use of CPU
10. Examples: ENIAC, Princeton IAS, IBM 701
IInd generation of computers (1954 – 64)
The transistors were invented by Bardeen, Brattain and Shockely in 1947 at Bell Labs
and by the 1950s these transistors made an electronic revolution as the transistor is
smaller, cheaper and dissipate less heat as compared to vacuum tube.
Now the transistors were used instead of a vacuum tube to construct computers.
Another major invention was invention of magnetic cores for storage. These cores
where used to large random access memories. These generation computers has better
processing speed, larger memory capacity, smaller size as compared to previous
generation computer.
The key features of this generation computers were
1. The IInd generation computer were designed using Germanium transistor, this
technology was much more reliable than vacuum tube technology.
2. Use of transistor technology reduced the switching time 1 to 10 microseconds thus
provide overall speed up.
Amity Directorate of Distance & Online Education
4 Computer Architecture and Parallel Processing
2. Magnetic cores were used main memory with capacity of 100 KB. Tapes and disk
peripheral memory were used as secondary memory.
Notes 3. Introduction to computer concept of instruction sets so that same program can be
executed on different systems.
4. High level languages, FORTRAN, COBOL, Algol, BATCH operating system.
5. Computers were now used for extensive business applications, engineering design,
optimation using Linear programming, Scientific research
6. Binary number system very used.
7. Technology and Architecture
Discrete transistors and core memories
I/O processors, multiplexed memory access
Floating-point arithmetic available
Register Transfer Language (RTL) developed
8. Software and Applications
High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and
subroutine libraries. Batch operating system was used although mostly single user
at a time .
9. Example: CDC 1604, UNIVAC LARC, IBM 7090
IIIrd Generation computers(1965 to 1974)
In 1950 and 1960 the discrete components ( transistors, registers capacitors) were
manufactured packaged in a separate containers. To design a computer these discrete
unit were soldered or wired together on a circuit boards. Another revolution in computer
designing came when in the 1960s, the Apollo guidance computer and Minuteman
missile were able to develop an integrated circuit (commonly called ICs).
These ICs made the circuit designing more economical and practical. The IC based
computers are called third generation computers. As integrated circuits, consists of
transistors, resistors, capacitors on single chip eliminating wired interconnection, the
space required for the computer was greatly reduced.
By the mid-1970s, the use of ICs in computers became very common. Price of
transistors reduced very greatly. Now it was possible to put all components required for
designing a CPU on a single printed circuit board. This advancement of technology
resulted in development of minicomputers, usually with 16-bit words size these system
have a memory of range of 4k to [Link] began a new era of microelectronics where it
could be possible design small identical chips ( a thin wafer of silicon’s). Each chip has
many gates plus number of input output pins.
Key Features of IIIrd Generation Computers
1. The use of silicon based ICs, led to major improvement of computer system.
Switching speed of transistor went by a factor of 10 and size was reduced by a
factor of 10, reliability increased by a factor of 10, power dissipation reduced by a
factor of 10. This cumulative effect of this was the emergence of extremely powerful
CPUS with the capacity of carrying out 1 million instruction per second.
2. The size of main memory reached about 4MB by improving the design of magnetic
core memories also in hard disk of 100 MB become feasible.
3. On line system become feasible. In particular dynamic production control systems,
airline reservation systems, interactive query systems, and real time closed lop
process control systems were implemented.
4. Concept of Integrated database management systems were emerged.
5. 32 bit instruction formats
Amity Directorate of Distance & Online Education
Introduction to Computer Models 5
6. Time shared concept of operating system.
7. Technology and Architecture features
Notes
Integrated circuits (SSI/MSI)
Microprogramming
Pipelining, cache memories, lookahead processing
8. Software and Applications
Multiprogramming and time-sharing operating systems
Multi-user applications
9. Examples: IBM 360/370, CDC 6600, TI ASC, DEC PDP-82
IVth Generation Computer ((1975 to 1990)
The microprocessor was invented as a single VLSI (Very large Scale Integrated circuit)
chip CPU. Main Memory chips of 1MB plus memory addresses were introduced as
single VLSI chip.
The caches were invented and placed within the main memory and microprocessor.
These VLSIs and VVSLIs greatly reduced the space required in a computer and
increased significantly the computational speed.
z Technology and Architecture feature
LSI/VLSI circuits,
semiconductor memory
Multiprocessors,
vector supercomputers,
multicomputers
Shared or distributed memory
Vector processors
Software and Applications
Multprocessor operating systems,
languages,
compilers,
parallel software tools
Examples: VAX 9000, Cray X-MP, IBM 3090, BBN TC2000
Fifth Generation Computers( 1990 onwards)
In the mid-to-late 1980s, in order to further improve the performance of the system the
designers start using a technique known as “instruction pipelining”. The idea is to break
the program into small instructions and the processor works on these instructions in
different stages of completion.
Example: The processor while calculating the result of the current instruction also
retrieves the operands for the next instruction. Based on this concept later superscalar
processor were designed, here to execute multiple instructions in parallel we have
multiple execution unit i.e., separate arithmetic-logic units (ALUs).
Now instead executing single instruction at a time, the system divide program into
several independent instructions and now CPU will look for several similar instructions
that are not dependent on each other, and execute them in parallel. The example of this
design are VLIW and EPIC.
Amity Directorate of Distance & Online Education
6 Computer Architecture and Parallel Processing
1. Technology and Architecture features
ULSI/VHSIC processors, memory, and switches
Notes
High-density packaging
Scalable architecture
Vector processors
2. Software and Applications
Massively parallel processing
Grand challenge applications
Heterogenous processing
3. Examples: Fujitsu VPP500, Cray MPP, TMC CM-5, Intel Paragon
Elements of Modern Computers
The hardware, software, and programming elements of modern computer systems can
be characterized by looking at a variety of factors in context of parallel computing these
factors are:
z Computing problems
z Algorithms and data structures
z Hardware resources
z Operating systems
z System software support
z Compiler support
Computing Problems
z Numerical computing complex mathematical formulations tedious integer or floating
-point computation
z Transaction processing accurate transactions large database management
information retrieval
z Logical Reasoning logic inferences symbolic manipulations
Algorithms and Data Structures
z Traditional algorithms and data structures are designed for sequential machines.
z New, specialized algorithms and data structures are needed to exploit the
capabilities of parallel architectures.
z These often require interdisciplinary interactions among theoreticians,
experimentalists, and programmers.
Hardware Resources
z The architecture of a system is shaped only partly by the hardware resources.
z The operating system and applications also significantly influence the overall
architecture.
z Not only must the processor and memory architectures be considered, but also the
architecture of the device interfaces (which often include their advanced
processors).
Operating System
z Operating systems manage the allocation and deallocation of resources during user
program execution.
z UNIX, Mach, and OSF/1 provide support for multiprocessors and multicomputers
Amity Directorate of Distance & Online Education
Introduction to Computer Models 7
z multithreaded kernel functions virtual memory management file subsystems
network communication services
z An OS plays a significant role in mapping hardware resources to algorithmic and
Notes
data structures.
System Software Support
z Compilers, assemblers, and loaders are traditional tools for developing programs in
high-level languages. With the operating system, these tools determine the bind of
resources to applications, and the effectiveness of this determines the efficiency of
hardware utilization and the system’s programmability.
z Most programmers still employ a sequential mind set, abetted by a lack of popular
parallel software support.
System Software Support
z Parallel software can be developed using entirely new languages designed
specifically with parallel support as its goal, or by using extensions to existing
sequential languages.
z New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system software.
z The most common approach is to extend an existing language.
Compiler Support
z Preprocessors use existing sequential compilers and specialized libraries to
implement parallel constructs
z Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
z Precompilers perform some program flow analysis, dependence checking, and
limited parallel optimzations
z Compiler directives are often inserted into source code to aid compiler parallelizing
efforts
1.2.3 Flynn's Classical Taxonomy
Among mentioned above the one widely used since 1966, is Flynn's Taxonomy. This
taxonomy distinguishes multi-processor computer architectures according two
independent dimensions of Instruction stream and Data stream. An instruction stream is
sequence of instructions executed by machine. And a data stream is a sequence of
data including input, partial or temporary results used by instruction stream. Each of
these dimensions can have only one of two possible states: Single or Multiple. Flynn’s
classification depends on the distinction between the performance of control unit and
the data processing unit rather than its operational and structural interconnections.
Following are the four category of Flynn classification and characteristic feature of each
of them.
Single Instruction Stream, Single Data Stream (SISD)
Figure 1.1: Execution of Instruction in SISD Processors
Amity Directorate of Distance & Online Education
8 Computer Architecture and Parallel Processing
The figure 1.1 is represents a organization of simple SISD computer having one
control unit, one processor unit and single memory unit.
Notes
Figure 1.2: SISD processor organization
z They are also called scalar processor i.e., one instruction at a time and each
instruction have only one set of operands.
z Single instruction: Only one instruction stream is being acted on by the CPU during
any one clock cycle
z Single data: Only one data stream is being used as input during any one clock cycle
z Deterministic execution
z Instructions are executed sequentially.
z This is the oldest and until recently, the most prevalent form of computer
z Examples: most PCs, single CPU workstations and mainframes
Single instruction stream, multiple data stream (SIMD) processors
z A type of parallel computer
z Single instruction: All processing units execute the same instruction issued by the
control unit at any given clock cycle as shown in figure 13.5 where there are
multiple processor executing instruction given by one control unit.
z Multiple data: Each processing unit can operate on a different data element as
shown if figure below the processor are connected to shared memory or
interconnection network providing multiple data to processing unit.
Figure 1.3: SIMD Processor Organization
z This type of machine typically has an instruction dispatcher, a very high-bandwidth
internal network, and a very large array of very small-capacity instruction units.
z Thus single instruction is executed by different processing unit on different set of
data as shown in figure 1.3.
z Best suited for specialized problems characterized by a high degree of regularity,
such as image processing and vector computation.
z Synchronous (lockstep) and deterministic execution
z Two varieties: Processor Arrays e.g., Connection Machine CM-2, Maspar MP-1,
MP-2 and Vector Pipelines processor e.g., IBM 9000, Cray C90, Fujitsu VP, NEC
SX-2, Hitachi S820
Amity Directorate of Distance & Online Education
Introduction to Computer Models 9
Notes
Figure 1.4: Execution of instructions in SIMD Processors
Multiple instruction stream, single data stream (MISD)
z A single data stream is fed into multiple processing units.
z Each processing unit operates on the data independently via independent
instruction streams as shown in figure 1.5 a single data stream is forwarded to
different processing unit which are connected to different control unit and execute
instruction given to it by control unit to which it is attached.
Figure 1.5: MISD Processor Organization
z Thus in these computers same data flow through a linear array of processors
executing different instruction streams as shown in figure 1.6.
z This architecture is also known as systolic arrays for pipelined execution of specific
instructions.
z Few actual examples of this class of parallel computer have ever existed. One is
the experimental Carnegie-Mellon [Link] computer (1971).
z Some conceivable uses might be:
multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single coded message.
Figure 1.6: Execution of Instructions in MISD Processors
Multiple instruction stream, multiple data stream (MIMD)
z Multiple Instruction: every processor may be executing a different instruction
stream.
Amity Directorate of Distance & Online Education
10 Computer Architecture and Parallel Processing
z Multiple Data: every processor may be working with a different data stream as
shown in figure 1.7 multiple data stream is provided by shared memory.
Notes z Can be categorized as loosely coupled or tightly coupled depending on sharing of
data and control .
z Execution can be synchronous or asynchronous, deterministic or non-deterministic
Figure 1.7: MIMD Processor Organizations
z As shown in figure 1.8 there are different processor each processing different task.
z Examples: most current supercomputers, networked parallel computer "grids" and
multi-processor SMP computers - including some types of PCs.
Figure 1.8: Execution of Instructions MIMD Processors
Here the some popular computer architecture and there types
SISD IBM 701, IBM 1620, IBM 7090, PDP VAX11/ 780
SISD (With multiple functional units) IBM360/91 (3); IBM 370/168 UP
SIMD (Word Slice Processing) Illiac – IV ; PEPE
SIMD (Bit Slice processing ) STARAN; MPP; DAP
MIMD (Loosely Coupled) IBM 370/168 MP; Univac 1100/80
MIMD(Tightly Coupled) Burroughs- D – 825
1.2.4 Performance Attributes
Performance of a system depends on
z hardware technology
z architectural features
z efficient resource management
z algorithm design
z data structures
z language efficiency
z programmer skill
z compiler technology
Amity Directorate of Distance & Online Education
Introduction to Computer Models 11
When we talk about performance of computer system we would describe how
quickly a given system can execute a program or programs. Thus we are interested in
knowing the turnaround time. Turnaround time depends on: Notes
z disk and memory accesses
z input and output
z compilation time
z operating system overhead
z CPU time
An ideal performance of a computer system means a perfect match between the
machine capability and program behavior. The machine capability can be improved by
using better hardware technology and efficient resource management. But as far as
program behavior is concerned it depends on code used, compiler used and other run
time conditions.
Also a machine performance may vary from program to program. Because there
are too many programs and it is impractical to test a CPU's speed on all of them,
benchmarks were developed. Computer architects have come up with a variety of
metrics to describe the computer performance.
Clock rate and CPI / IPC: Since I/O and system overhead frequently overlaps
processing by other programs, it is fair to consider only the CPU time used by a
program, and the user CPU time is the most important factor. CPU is driven by a clock
with a constant cycle time (usually measured in nanoseconds, which controls the rate of
internal operations in the CPU. The clock mostly has the constant cycle time (t in
nanoseconds). The inverse of the cycle time is the clock rate (f = 1/τ, measured in
megahertz).
A shorter clock cycle time, or equivalently a larger number of cycles per second,
implies more operations can be performed per unit time. The size of the program is
determined by the instruction count (Ic). The size of a program is determined by its
instruction count, Ic, the number of machine instructions to be executed by the program.
Different machine instructions require different numbers of clock cycles to execute. CPI
(cycles per instruction) is thus an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
Any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a
given program mix. The time required to execute a program containing Ic instructions is
just T = Ic * CPI * π.
Each instruction must be fetched from memory, decoded, then operands fetched
from memory, the instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is
usually k times the processor cycle time τ. The value of k depends on the memory
technology and the processor-memory interconnection scheme. The processor cycles
required for each instruction (CPI) can be attributed to cycles needed for instruction
decode and execution (p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*π.
MIPS: The millions of instructions per second, this is calculated by dividing the
number of instructions executed in a running program by time required to run the
Amity Directorate of Distance & Online Education
12 Computer Architecture and Parallel Processing
program. The MIPS rate is directly proportional to the clock rate and inversely
proportion to the CPI. All four systems attributes (instruction set, compiler, processor,
Notes and memory technologies) affect the MIPS rate, which varies also from program to
program.
MIPS does not proved to be effective as it does not account for the fact that
different systems often require different number of instruction to implement the program.
It does not inform about how many instructions are required to perform a given task.
With the variation in instruction styles, internal organization, and number of processors
per system it is almost meaningless for comparing two systems.
MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point
operations per second.'' This is often used as a ``bottom-line'' figure. If one know ahead
of time how many operations a program needs to perform, one can divide the number of
operations by the execution time to come up with a MFLOPS rating.
For example, the standard algorithm for multiplying n*n matrices requires 2n3 – n
operations (n2 inner products, with n multiplications and n-1additions in each product).
Suppose you compute the product of two 100 *100 matrices in 0.35 seconds. Then the
computer achieves
(2(100)3 – 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS
The term ``theoretical peak MFLOPS'' refers to how many operations per second
would be possible if the machine did nothing but numerical operations. It is obtained by
calculating the time it takes to perform one operation and then computing how many of
them could be done in one second.
For example, if it takes 8 cycles to do one floating point multiplication, the cycle
time on the machine is 20 nanoseconds, and arithmetic operations are not overlapped
with one another, it takes 160ns for one multiplication, and (1,000,000,000
nanosecond/1sec)*(1 multiplication / 160 nanosecond) = 6.25*106 multiplication /sec so
the theoretical peak performance is 6.25 MFLOPS. Of course, programs are not just
long sequences of multiply and add instructions, so a machine rarely comes close to
this level of performance on any real program. Most machines will achieve less than
10% of their peak rating, but vector processors or other machines with internal pipelines
that have an effective CPI near 1.0 can often achieve 70% or more of their theoretical
peak on small programs.
Throughput rate: Another important factor on which system’s performance is
measured is throughput of the system which is basically how many programs a system
can execute per unit time Ws. In multiprogramming the system throughput is often lower
than the CPU throughput Wp which is defined as
Wp = f/(Ic * CPI)
Unit of Wp is programs/second.
Ws <Wp as in multiprogramming environment there is always additional overheads
like timesharing operating system etc. An Ideal behavior is not achieved in parallel
computers because while executing a parallel algorithm, the processing elements
cannot devote 100% of their time to the computations of the algorithm.
Efficiency is a measure of the fraction of time for which a PE is usefully employed.
In an ideal parallel system efficiency is equal to one. In practice, efficiency is between
zero and one ‘s of overhead associated with parallel execution
Speed or Throughput (W/Tn) - the execution rate on an n processor system,
measured in FLOPs/unit-time or instructions/unit-time.
Speedup (Sn = T1/Tn) - how much faster in an actual machine, n processors
compared to 1 will perform the workload. The ratio T1/T∞is called the asymptotic
speedup.
Amity Directorate of Distance & Online Education
Introduction to Computer Models 13
Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup achieved by n
processors
Notes
Degree of Parallelism (DOP) - for a given piece of the workload, the number of
processors that can be kept busy sharing that piece of computation equally. Neglecting
overhead, we assume that if k processors work together on any workload, the workload
gets done k times as fast as a sequential execution.
Scalability: The attributes of a computer system which allow it to be gracefully and
linearly scaled up or down in size, to handle smaller or larger workloads, or to obtain
proportional decreases or increase in speed on a given application. The applications
run on a scalable machine may not scale well. Good scalability requires the algorithm
and the machine to have the right properties
Thus in general there are five performance factors (Ic, p, m, k, t) which are influenced
by four system attributes:
z instruction-set architecture (affects Ic and p)
z compiler technology (affects Ic and p and m)
z CPU implementation and control (affects p *t ) cache and memory hierarchy (affects
memory access latency, k ´t)
z Total CPU time can be used as a basis in estimating the execution rate of a
processor.
Programming Environments
Programmability depends on the programming environment provided to the users.
Conventional computers are used in a sequential programming environment with
tools developed for a uniprocessor computer. Parallel computers need parallel tools that
allow specification or easy detection of parallelism and operating systems that can
perform parallel scheduling of concurrent events, shared memory allocation, and shared
peripheral and communication links.
Implicit Parallelism
Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the program.
Use a parallelizing compiler to translate the source code into parallel code. The
compiler must detect parallelism and assign target machine resources. Success relies
heavily on the quality of the compiler.
Explicit Parallelism
Programmer writes explicit parallel code using parallel dialects of common languages.
Compiler has reduced need to detect parallelism, but must still preserve existing
parallelism and assign target machine resources.
Needed Software Tools
Parallel extensions of conventional high-level languages.
Integrated environments to provide different levels of program abstraction
validation, testing and debugging performance prediction and monitoring visualization
support to aid program development, performance measurement graphics display and
animation of computational results
1.3 Multiprocessor and Multicomputers
Two categories of parallel computers are discussed below namely shared common
memory or unshared distributed memory.
Amity Directorate of Distance & Online Education
14 Computer Architecture and Parallel Processing
1.3.1 Shared Memory Multiprocessors
z Shared memory parallel computers vary widely, but generally have in common the
Notes ability for all processors to access all memory as global address space.
z Multiple processors can operate independently but share the same memory
resources.
z Changes in a memory location effected by one processor are visible to all other
processors.
z Shared memory machines can be divided into two main classes based upon
memory access times: UMA , NUMA and COMA.
Uniform Memory Access (UMA)
z Most commonly represented today by Symmetric Multiprocessor (SMP) machines
z Identical processors
z Equal access and access times to memory
z Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one
processor updates a location in shared memory, all the other processors know
about the update. Cache coherency is accomplished at the hardware level.
Figure 1.9: Shared Memory (UMA)
Non-Uniform Memory Access (NUMA)
z Often made by physically linking two or more SMPs
z One SMP can directly access memory of another SMP
z Not all processors have equal access time to all memories
z Memory access across link is slower
If cache coherency is maintained, then may also be called CC-NUMA - Cache
Coherent NUMA
Figure 1.10: Shared Memory (NUMA)
The COMA model: The COMA model is a special case of NUMA machine in which
the distributed main memories are converted to caches. All caches form a global
address space and there is no memory hierarchy at each processor node.
Amity Directorate of Distance & Online Education
Introduction to Computer Models 15
Advantages:
z Global address space provides a user-friendly programming perspective to memory
Notes
z Data sharing between tasks is both fast and uniform due to the proximity of memory
to CPUs
Disadvantages:
z Primary disadvantage is the lack of scalability between memory and CPUs. Adding
more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
z Programmer responsibility for synchronization constructs that insure "correct"
access of global memory.
z Expense: it becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors.
1.3.2 Distributed Memory
z Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication
network to connect inter-processor memory.
Figure 1.11: Distributed Memory Systems
z Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space
across all processors.
z Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
z When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
z Modern multicomputer use hardware routers to pass message. Based on the
interconnection and routers and channel used the multicomputers are divided into
generation
1st generation: based on board technology using hypercube architecture and
software controlled message switching.
2nd Generation: implemented with mesh connected architecture, hardware
message routing and software environment for medium distributed –grained
computing.
3rd Generation: fine grained multicomputer like MIT J-Machine.
z The network "fabric" used for data transfer varies widely, though it can be as simple
as Ethernet.
Amity Directorate of Distance & Online Education
16 Computer Architecture and Parallel Processing
Advantages:
z Memory is scalable with number of processors. Increase the number of processors
Notes and the size of memory increases proportionately.
z Each processor can rapidly access its own memory without interference and without
the overhead incurred with trying to maintain cache coherency.
z Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
z The programmer is responsible for many of the details associated with data
communication between processors.
z It may be difficult to map existing data structures, based on global memory, to this
memory organization.
z Non-uniform memory access (NUMA) times
1.4 Multivector and Simd Computers
A vector operand contains an ordered set of n elements, where n is called the length of
the vector. Each element in a vector is a scalar quantity, which may be a floating point
number, an integer, a logical value or a character.
A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations.
1.4.1 Vector Hardware
Vector computers have hardware to perform the vector operations efficiently. Operands
cannot be used directly from memory but rather are loaded into registers and are put
back in registers after the operation. Vector hardware has the special ability to overlap
or pipeline operand processing.
Figure 1.12: Vector Hardware
Vector functional units pipelined, fully segmented each stage of the pipeline
performs a step of the function on different operand(s) once pipeline is full, a new result
is produced each clock period (cp).
Pipelining
The pipeline is divided up into individual segments, each of which is completely
independent and involves no hardware sharing. This means that the machine can be
working on separate operands at the same time. This ability enables it to produce one
result per clock period as soon as the pipeline is full.
The same instruction is obeyed repeatedly using the pipeline technique so the
vector processor processes all the elements of a vector in exactly the same way. The
Amity Directorate of Distance & Online Education
Introduction to Computer Models 17
pipeline segments arithmetic operation such as floating point multiply into stages
passing the output of one stage to the next stage as input. The next pair of operands
may enter the pipeline after the first stage has processed the previous pair of operands. Notes
The processing of a number of operands may be carried out simultaneously.
The loading of a vector register is itself a pipelined operation, with the ability to load
one element each clock period after some initial startup overhead.
1.4.2 SIMD Array Processors
The Synchronous parallel architectures coordinate Concurrent operations in lockstep
through global clocks, central control units, or vector unit controllers. A synchronous
array of parallel processors is called an array processor. These processors are
composed of N identical processing elements (PES) under the supervision of a one
control unit (CU) This Control unit is a computer with high speed registers,
local memory and arithmetic logic unit.. An array processor is basically a single
instruction and multiple data (SIMD) computers. There are N data streams; one per
processor, so different data can be used in each processor. The figure below show a
typical SIMD or array processor.
Figure 1.13: Configuration of SIMD Array Processor
These processors consist of a number of memory modules which can be either
global or dedicated to each processor. Thus the main memory is the aggregate of the
memory modules. These Processing elements and memory unit communicate with
each other through an interconnection network. SIMD processors are especially
designed for performing vector computations. SIMD has two basic architectural
organizations
(a) Array processor using random access memory
(b) Associative processors using content addressable memory.
All N identical processors operate under the control of a single instruction stream
issued by a central control unit. The popular examples of this type of SIMD
configuration is ILLIAC IV, CM-2, MP-1. Each PEi is essentially an arithmetic logic unit
(ALU) with attached working registers and local memory PEMi for the storage of
distributed data. The CU also has its own main memory for the storage of program. The
function of CU is to decode the instructions and determine where the decoded
instruction should be executed. The PE perform same function (same instruction)
synchronously in a lock step fashion under command of CU. In order to maintain
synchronous operations a global.
clock is used. Thus at each step i.e., when global clock pulse changes all
processors execute the same instruction, each on a different data (single instruction
multiple data). SIMD machines are particularly useful at in solving problems involved
with vector calculations where one can easily exploit data parallelism. In such
calculations the same set of instruction is applied to all subsets of data. Lets do addition
to two vectors each having N element and there are N/2 processing elements in the
Amity Directorate of Distance & Online Education
18 Computer Architecture and Parallel Processing
SIMD. The same addition instruction is issued to all N/2 processors and all processor
elements will execute the instructions simultaneously. It takes 2 steps to add two
Notes vectors as compared to N steps on a SISD machine. The distributed data can be loaded
into PEMs from an external source via the system bus or via system broadcast mode
using the control bus.
The array processor can be classified into two category depending how the memory
units are organized. It can be
(a) Dedicated memory organization
(b) Global memory organization
A SIMD computer C is characterized by the following set of parameter
C= <N,F,I,M>
Where N= the number of PE in the system . For example the iliac –IV has N=64 ,
the BSP has N= 16.
F= a set of data routing function provided by the interconnection network
I= The set of machine instruction for scalar vector, data routing and network
manipulation operations
M = The set of the masking scheme where each mask partitions the set of PEs into
disjoint subsets of enabled PEs and disabled PEs.
1.5 Architectural Development Tracks
The architecture of most existing computers follow certain development tracks.
Understanding features of various tracks provides insights for new architectural
development. These tracks are distinguished by similarly in computational models and
technological bases. We also review a few early representative system in each track.
15.1 Multiple-Processor Tracks
Generally speaking, a multiple- processor system can be either a shared-memory
multiprocessor or a distributed-memory multicomputer as modeled. Bell listed these
machines at the leaf nodes of the taxonomy tree. Instead of a horizontal listing, we
show a historical development along each important track of the taxonomy.
Shared –Memory Track: The following figure shows a track of multiprocessor
development employing a single address space in the entire system. The track started
with the Cmmp system developed at camegic –Mellon University. The Cmmp was an
UMA multiprocessor . sixteen PDP 11/40 processors were interconnected to 16 shared-
memory modules via a crossbar switch. A special interprocessor interrupt bus was
provided for the fast Interprocess communication, besides the shared memory. The
Cmmp project pioneered shared memory multiprocessor development, not only in the
cross bar architecture but also in the multiprocessor operating system
(Hydra)development.
Both the NTU ultracomputer project and the Illinois Cedar project were developed
with a single address space. Both systems used multistage networks as a system
interconnect. The Major achievements int the cedar project were in parallel compilers
and performance benchmarking experiments.
The Standford Dash was a NUMA multiprocessor with distributed memories forming
a global address space. Cache coherence was enforced with distributed directories.
The KSR-I was a typical COMA model. The fujitsu VPP 500 was a 222- Processor
system with a crossbar interconnect. The shared memories were distributed to all
processor nodes.
Amity Directorate of Distance & Online Education
Introduction to Computer Models 19
Message Passing Track: The cosmic cube pioneered the development of
message passing multiprocessor. Since then , Intel produced a series of medium-
grained hypercube computers . The nCUBE 2 also assumed a hypercube configuration. Notes
Figure 1.14: Two multiple-Processor tracks with and without shared memory
1.5 Summary
Modern computers are equipped with powerful hardware technology at the same time
loaded with sophisticated software packages. To access the art of computing we firstly
review the history of computers then study the attributes used for analysis of
performance of computers. Two categories of parallel computers are discussed below
namely shared common memory or unshared distributed memory. Shared memory
parallel computers vary widely, but generally have in common the ability for all
processors to access all memory as global address space. Like shared memory
systems, distributed memory systems vary widely but share a common characteristic.
Distributed memory systems require a communication network to connect inter-
processor memory. A vector operand contains an ordered set of n elements, where n is
called the length of the vector. Each element in a vector is a scalar quantity, which may
be a floating point number, an integer, a logical value or a character. A vector processor
consists of a scalar processor and a vector unit, which could be thought of as an
independent functional unit capable of efficient vector operations.
Amity Directorate of Distance & Online Education
20 Computer Architecture and Parallel Processing
1.6 Check Your Progress
Notes
Multiple Choice Questions
1. Microprocessors as switching devices are for which generation computers
(a) First Generation
(b) Second Generation
(c) Third Generation
(d) Fourth Generation
2. The system unit of a personal computer typically contains all of the following except:
(a) Microprocessor
(b) Disk controller
(c) Serial interface
(d) Modem
3. The instructions that tell a computer how to carry out the processing tasks are
referred to as computer.........
(a) programs
(b) processors
(c) input devices
(d) memory modules
4. A ............ is a microprocessor -based computing device.
(a) personal computer
(b) mainframe
(c) workstation
(d) server
5. RAM can be treated as the ......... for the computer's processor
(a) factory
(b) operating room
(c) waiting room
(d) planning room
6. The technology that stores only the essential instructions on a microprocessor chip
and thus enhances its speed is referred to as
(a) CISC
(b) RISC
(c) CD-ROM
(d) Wi-Fi
7. Which part of the computer is directly involved in executing the instructions of the
computer program?
(a) The scanner
(b) The main storage
(c) The secondary storage
(d) The processor
Amity Directorate of Distance & Online Education
Introduction to Computer Models 21
8. Where are data and programme stored when the processor uses them?
(a) Main memory
Notes
(b) Secondary memory
(c) Disk memory
(d) Programme memory
9. Personal computers use a number of chips mounted on a main circuit board. What
is the common name for such boards?
(a) Daughter board
(b) Motherboard
(c) Father board
(d) Breadboard
10. The reason for the implementation of the cache memory is
(a) To increase the internal memory of the system
(b) The difference in speeds of operation of the processor and memory
(c) To reduce the memory access and cycle time
(d) All of the above
1.7 Questions and Exercises
1. Explain the different generations of computers with respect to progress in hardware.
2. Describe the different element of modern computers.
3. Explain the Flynn’s classification of computer architectures.
4. Explain performance factors vs system attributes.
5. A 40-MHZ processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Instruction type Instruction count Clock cycle count
Integer arithmetic 45000 1
Data transfer 32000 2
Floating point 15000 2
Control transfer 8000 2
Determine the effective CPI, MIPS rate, and execution time for this program. (10
marks)
6. Explain the UMA, NUMA and COMA multiprocessor models.
7. Differentiate between shared-memory multiprocessors and distributed-memory
multicomputers.
8. What are the architectural development tracks
1.8 Key Terms
z Multiprocessor: A computer in which processors can execute separate instruction
streams, but have access to a single address space. Most multiprocessors are
shared memory machines, constructed by connecting several processors to one or
more memory banks through a bus or switch.
z Multicomputer: A computer in which processors can execute separate instruction
streams, have their own private memories and cannot directly access one another's
memories. Most multicomputers are disjoint memory machines, constructed by
joining nodes (each containing a microprocessor and some memory) via links.
z MIMD: Multiple Instruction, Multiple Data; a category of Flynn's taxonomy in which
many instruction streams are concurrently applied to multiple data sets. A MIMD
Amity Directorate of Distance & Online Education
22 Computer Architecture and Parallel Processing
architecture is one in which heterogeneous processes may execute at different
rates.
Notes z MIPS: One Million Instructions Per Second. A performance rating usually referring
to integer or non-floating point instructions
Check Your Progress: Answers
1. (d) Fourth Generation
2. (d) Modem
3. (a) programs
4. (a) personal computer
5. (c) waiting room
6. (b) RISC
7. (d) The processor
8. (a) Main memory
9. (b) Motherboard
10. (b) The difference in speeds of operation of the processor and memory
1.9 Further Readings
z [Link], Advanced Computer Architecture and Computing, Technical
Publications, 2009
z V. Rajaraman, T. Radhakrishnan, Computer Organization And Architecture, PHI
Learning [Link]., 2007
z P. V. S. RAO, Computer System Architecture, PHI Learning Pvt. Ltd., 2008
z Sajjan G. Shiva , Advanced Computer Architectures, CRC Press. Copyright, 2005
Amity Directorate of Distance & Online Education
Program and Network Properties 23
Unit 2: Program and Network Properties
Notes
Structure
2.1 Introduction
2.2 Condition of Parallelism
2.2.1 Data and Resource Dependence
2.2.2 Hardware and software parallelism
2.2.3 The Role of Compilers
2.3 Program Partitioning & Scheduling
2.3.1 Grain Size and Latency
2.3.2 Grain Packing and Scheduling
2.4 Program flow mechanism
2.5 Summary
2.6 Check Your Progress
2.7 Questions and Exercises
2.8 Key Terms
2.9 Further Readings
Objectives
After studying this unit, you should be able to:
z Understand the Condition of parallelism
z Learn the concept of Program Partitioning & Scheduling
z Discuss the Grain Packing and Scheduling
z Understand the Program flow mechanism
2.1 Introduction
The advantage of multiprocessors lays when parallelism in the program is popularly
exploited and implemented using multiple processors. Thus in order to implement the
parallelism we should understand the various conditions of parallelism. What are
various bottlenecks in implementing parallelism?
Thus for full implementation of parallelism there are three significant areas to be
understood namely computation models for parallel computing, interprocessor
communication in parallel architecture and system integration for incorporating parallel
systems. Thus multiprocessor system poses a number of problems that are not
encountered in sequential processing such as designing a parallel algorithm for the
application, partitioning of the application into tasks, coordinating communication and
synchronization, and scheduling of the tasks onto the machine.
2.2 Condition of Parallelism
The ability to execute several program segments in parallel requires each segment to
be independent of the other segments. We use a dependence graph to describe the
relations. The nodes of a dependence graph correspond to the program statement
(instructions), and directed edges with different labels are used to represent the ordered
Amity Directorate of Distance & Online Education
24 Computer Architecture and Parallel Processing
relations among the statements. The analysis of dependence graphs shows where
opportunity exists for parallelization and vectorization.
Notes
2.2.1 Data and Resource Dependence
Data dependence: The ordering relationship between statements is indicated by the
data dependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path
exists from s1 to S2 and if at least one output (variables assigned) of S1feeds in as
input (operands to be used) to S2 also called RAW hazard and denoted as
S1 → S2
2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows
S1 in the program order and if the output of S2 overlaps the input to S1 also called
RAW hazard and denoted as
S1 → S2
3. Output dependence: two statements are output dependent if they produce (write)
the same output variable. Also called WAW hazard and denoted as S1 → S2
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not
because the same variable is involved but because the same file referenced by
both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot
be determined in the following situations:
The subscript of a variable is itself subscribed( indirect addressing)
The subscript does not contain the loop index variable.
A variable appears more than once with subscripts having different coefficients
of the loop variable.
The subscript is non linear in the loop index variable.
Parallel execution of program segments which do not have total data independence
can produce non-deterministic results.
Consider the following fragment of any program:
S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
Here the Forward dependency S1to S2, S3 to S4, S2 to S2
z Anti-dependency from S2to S3
z Output dependency S1 toS3
Figure 2.1: Dependence Graph
Amity Directorate of Distance & Online Education
Program and Network Properties 25
Control Dependence: This refers to the situation where the order of the execution
of statements cannot be determined before run time. For example all condition
statement, where the flow of statement depends on the output. Different paths taken Notes
after a conditional branch may depend on the data hence we need to eliminate this data
dependence among the instructions. This dependence also exists between operations
performed in successive iterations of looping procedure. Control dependence often
prohibits parallelism from being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used
to eliminate this control dependence and exploit the parallelism.
Resource dependence
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such
as registers, integer and floating point ALUs, etc. ALU conflicts are called ALU
dependence. Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions - 1
Bernstein’s conditions are a set of conditions which must exist if two processes can
execute in parallel.
Notation
Ii is the set of all input variables for a process Pi . Ii is also called the read set or
domain of Pi. Oi is the set of all output variables for a process Pi .Oi is also called write
set
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent. The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not
transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an
equivalence relation. Intersection of the input sets is allowed.
Amity Directorate of Distance & Online Education
26 Computer Architecture and Parallel Processing
2.2.2 Hardware and software parallelism
Notes Hardware parallelism is defined by machine architecture and hardware multiplicity i.e.,
functional parallelism times the processor parallelism .It can be characterized by the
number of instructions that can be issued per machine cycle. If a processor issues k
instructions per machine cycle, it is called a k-issue processor. Conventional processors
are one-issue machines. This provide the user the information about peak attainable
performance.
Examples. Intel i960CA is a three-issue processor (arithmetic, memory access,
branch). IBM RS -6000 is a four-issue processor (arithmetic, floating-point, memory
access, branch).A machine with n k-issue processors should be able to handle a
maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is
revealed in the program’s flow graph i.e., it is defined by dependencies with in the code
and is a function of algorithm, programming style, and compiler optimization.
2.2.3 The Role of Compilers
Compilers used to exploit hardware features to improve performance. Interaction
between compiler and architecture design is a necessity in modern computer
development. It is not necessarily the case that more software parallelism will improve
performance in conventional scalar processors. The hardware and compiler should be
designed at the same time.
2.3 Program Partitioning & Scheduling
2.3.1 Grain Size and Latency
The size of the parts or pieces of a program that can be considered for parallel
execution can vary. The sizes are roughly classified using the term “granule size,” or
simply “granularity.” The simplest measure, for example, is the number of instructions in
a program part. Grain sizes are usually described as fine, medium or coarse, depending
on the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a
computer. Memory latency, for example, is the time required by a processor to access
memory. Synchronization latency is the time required for two processes to synchronize
their execution. Computational granularity and communication latency are closely
related. Latency and grain size are interrelated and some general observation are
z As grain size decreases, potential parallelism increases, and overhead also
increases.
z Overhead is the cost of parallelizing a task. The principle overhead is
communication latency.
z As grain size is reduced, there are fewer operations between communication, and
hence the impact of latency increases.
z Surface to volume: inter to intra-node comm.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level typically involves less than 20 instructions
per grain. The number of candidates for parallel execution varies from 2 to thousands,
Amity Directorate of Distance & Online Education
Program and Network Properties 27
with about five instructions or statements (on the average) being the average level of
parallelism.
Notes
Advantages:
z There are usually many candidates for parallel execution
z Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between
iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized
program construct to execute on a parallel or vector machine. Some loops (e.g.
recursive) are difficult to handle. Loop-level parallelism is still considered fine grain
computation.
Procedure-level Parallelism
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is
more difficult than with smaller grains; interprocedural dependence analysis is difficult
and history-sensitive. Communication requirement less than instruction level SPMD
(single procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain
level. Job steps can overlap across different jobs. Multiprograming conducted at this
level No compilers available to exploit medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
Corresponds to execution of essentially independent jobs or programs on a parallel
computer. This is practical for a machine with a small number of powerful processors,
but impractical for a machine with a large number of simple processors (since each
processor would take too long to process a single job).
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies
attributed to machine architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases
as memory capacity increases, limiting the amount of memory that can be used with a
given tolerance for communication latency.
Interprocessor Communication Latency
z Needs to be minimized by system designer
z Affected by signal delays and communication patterns Ex. n communicating tasks
may require n (n - 1)/2 communication links, and the complexity grows
quadratically, effectively limiting the number of processors in the system.
Communication Patterns
z Determined by algorithms used and architectural support provided
z Patterns include permutations broadcast multicast conference
z Tradeoffs often exist between granularity of parallelism and communication
demand.
2.3.2 Grain Packing and Scheduling
Two questions:
How can I partition a program into parallel “pieces” to yield the shortest execution
time?
Amity Directorate of Distance & Online Education
28 Computer Architecture and Parallel Processing
What is the optimal size of parallel grains?
Notes There is an obvious tradeoff between the time spent scheduling and synchronizing
parallel grains and the speedup obtained by parallel execution.
One approach to the problem is called “grain packing.”
Program Graphs and Packing
A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node
name, s = size (larger s = larger grain size).
Edges = { (v,d) }, where v = variable being “communicated,” and d = communication
delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly
more edges to other nodes. Packing is done to eliminate unnecessary communication
delays or reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that
communication delay requirements are observed, and no two nodes are executing on
the same processor at the same time. Some general scheduling goals
z Schedule all fine-grain activities in a node to the same processor to minimize
communication delays.
z Select grain sizes for packing to achieve better schedules for a particular parallel
machine.
Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not
always produce a shorter schedule. By duplicating nodes (that is, executing some
instructions on multiple processors), we may eliminate some interprocessor
communication, and thus produce a shorter schedule.
Program Partitioning and Scheduling
Scheduling and allocation is a highly important issue since an inappropriate scheduling
of tasks can fail to exploit the true potential of the system and can offset the gain from
parallelization. In this paper we focus on the scheduling aspect. The objective of
scheduling is to minimize the completion time of a parallel application by properly
allocating the tasks to the processors. In a broad sense, the scheduling problem exists
in two forms: static and dynamic. In static scheduling, which is usually done at compile
time, the characteristics of a parallel program (such as task processing times,
communication, data dependencies, and synchronization requirements) are known
before program execution
A parallel program, therefore, can be represented by a node- and edge-weighted
directed acyclic graph (DAG), in which the node weights represent task processing
times and the edge weights represent data dependencies as well as the communication
times between tasks. In dynamic scheduling only, a few assumptions about the parallel
program can be made before execution, and thus, scheduling decisions have to be
made on-the-fly. The goal of a dynamic scheduling algorithm as such includes not only
the minimization of the program completion time but also the minimization of the
scheduling overhead which constitutes a significant portion of the cost paid for running
the scheduler. In general dynamic scheduling is an NP hard problem.
2.4 Program Flow Mechanism
Conventional machines used control flow mechanism in which order of program
execution explicitly stated in user programs. Dataflow machines which instructions can
be executed by determining operand availability.
Amity Directorate of Distance & Online Education
Program and Network Properties 29
Reduction machines trigger an instruction’s execution based on the demand for its
results.
Notes
Control Flow vs. Data Flow In Control flow computers the next instruction is
executed when the last instruction as stored in the program has been executed where
as in Data flow computers an instruction executed when the data (operands) required
for executing that instruction is available
Control flow machines used shared memory for instructions and data. Since
variables are updated by many instructions, there may be side effects on other
instructions. These side effects frequently prevent parallel processing. Single processor
systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as
their operands are available; data is held in the instructions themselves. Data tokens
are passed from an instruction to its dependents to trigger execution.
Data Flow Features
No need for shared memory program counter control sequencer Special mechanisms
are required to detect data availability match data tokens with instructions needing them
enable chain reaction of asynchronous instruction execution
A Dataflow Architecture – 1
The Arvind machine (MIT) has N PEs and an N -by –N interconnection network. Each
PE has a token-matching mechanism that dispatches only instructions with data tokens
available. Each datum is tagged with
z address of instruction to which it belongs
z context in which the instruction is being executed
Tagged tokens enter PE through local path (pipelined), and can also be
communicated to other PEs through the routing network. Instruction address(es)
effectively replace the program counter in a control flow machine. Context identifier
effectively replaces the frame base register in a control flow machine. Since the
dataflow machine matches the data tags from one instruction with successors,
synchronized instruction execution is implicit.
An I-structure in each PE is provided to eliminate excessive copying of data
structures. Each word of the I-structure has a two-bit tag indicating whether the value is
empty, full, or has pending read requests.
This is a retreat from the pure dataflow approach. Special compiler technology
needed for dataflow machines.
Demand-Driven Mechanisms
Data-driven machines select instructions for execution based on the availability of their
operands; this is essentially a bottom-up approach.
Demand-driven machines take a top-down approach, attempting to execute the
instruction (a demander) that yields the final result. This triggers the execution of
instructions that yield its operands, and so forth. The demand-driven approach matches
naturally with functional programming languages (e.g. LISP and SCHEME).
Comparison of Flow Mechanism
Pattern driven computers : An instruction is executed when we obtain a particular data
patterns as output. There are two types of pattern driven computers.
String-reduction model: each demander gets a separate copy of the expression
string to evaluate each reduction step has an operator and embedded reference to
Amity Directorate of Distance & Online Education
30 Computer Architecture and Parallel Processing
demand the corresponding operands each operator is suspended while arguments are
evaluated
Notes
Graph-reduction model: expression graph reduced by evaluation of branches or
subgraphs, possibly in parallel, with demanders given pointers to results of reductions.
based on sharing of pointers to arguments; traversal and reversal of pointers continues
until constant arguments are encountered.
2.5 Summary
Fine-grain exploited at instruction or loop levels, assisted by the compiler. Medium-
grain (task or job step) requires programmer and compiler support. Coarse-grain relies
heavily on effective OS support. Shared-variable communication used at fine- and
medium grain levels. Message passing can be used for medium- and coarse grain
communication, but fine -grain really need better technique because of heavier
communication requirements. Control flow machines give complete control, but are
less efficient than other approaches. Data flow (eager evaluation) machines have high
potential for parallelism and throughput and freedom from side effects, but have high
control overhead, lose time waiting for unneeded arguments, and difficulty in
manipulating data structures. Reduction (lazy evaluation) machines have high
parallelism potential, easy manipulation of data structures, and only execute required
instructions. But they do not share objects with changing local state, and do require time
to propagate tokens.
2.6 Check Your Progress
Multiple Choice Questions
1. In latest generation computers, the instructions are executed
(a) Parallel only
(b) Sequentially only
(c) Both sequentially and parallel
(d) All of above
2. Which of the following are the functions of a operating system
(a) Allocates resources
(b) Monitors Activities
(c) Manages disks and files
(d) All of the above
3. Servers are computers that provide resources to other computers connected to a :
(a) networked
(b) mainframe
(c) supercomputer
(d) client
4. Which of the following contains permanent data and gets updated during the
processing of transactions?
(a) Operating System File
(b) Transaction file
(c) Software File
(d) Master file
5. Which of the following helps to protect floppy disks from data getting accidentally
erased?
(a) Access notch
Amity Directorate of Distance & Online Education
Program and Network Properties 31
(b) Write-protect notch
(c) Entry notch
Notes
(d) Input notch
6. ……………… is a technique in which some of the CPU’s address lines forming an
input to the address decoder are ignored.
(a) Microprogramming
(b) Instruction pre-fetching
(c) Pipelining
(d) Partial decoding
7. An interface that can be used to connect the microcomputer bus to …………… is
called an I/O Port.
(a) Flip Flops
(b) Memory
(c) Peripheral devices
(d) Multiplexers
8. ………….. is an electrical pathway through which the processor communicates with
the internal and external devices attached to the computer.
(a) Computer Bus
(b) Hazard
(c) Memory
(d) Disk
9. Identify the type of serial communication error condition in which A 0 is received
instead of a stop bit (which is always a 1)?
(a) Framing error
(b) Parity error
(c) Overrun error
(d) Under-run error
10. A collection of _____________ is called a micro program.
(a) large scale operations
(b) Registers
(c) DMA
(d) Microinstructions
2.7 Questions and Exercises
1. Perform a data dependence analysis on each of the following Fortran program
fragments. Show the dependence graphs among the statements with justification.
a.
S1: A = B + D
S2: C = A X 3
S3: A = A + C
S4: E = A / 2
b.
S1: X = SIN(Y)
S2: Z = X + W
Amity Directorate of Distance & Online Education
32 Computer Architecture and Parallel Processing
S3: Y = -2.5 X W
S4: X = COS(Z)
Notes
3. Explain data, control and resource dependence.
4. Describe Bernstein’s conditions.
5. Explain the hardware cum software parallelism and the role of compilers
6. Explain the following
(a) Instruction level
(b) Loop level
(c) Procedure level
(d) Subprogram level and
(e) Job/program level parallelism.
7. Compare dataflow, control-flow computers and reduction computer architectures.
8. Explain reduction machine models with respect to demand-driven mechanism
2.8 Key Terms
z Dependence Graph: A directed graph whose nodes represent calculations and
whose edges represent dependencies among those calculations. If the calculation
represented by node k depends on the calculations represented by nodes i and j,
then the dependence graph contains the edges i-k and j-k.
z Data Dependency: a situation existing between two statements if one statement
can store into a location that is later accessed by the other statement
z Granularity: The size of operations done by a process between communications
events. A fine grained process may perform only a few arithmetic operations
between processing one message and the next, whereas a coarse grained process
may perform millions
z Control-flow: Computers refers to an architecture with one or more program
counters that determine the order in which instructions are executed.
z Dataflow: A model of parallel computing in which programs are represented as
dependence graphs and each operation is automatically blocked until the values on
which it depends are available. The parallel functional and parallel logic
programming models are very similar to the dataflow model.
Check Your Progress: Answers
1. c. Both sequentially and parallel
2. d. All of the above
3. b. mainframe
4. d. Master file
5. b. Write-protect notch
6. d. Partial decoding
7. c. Peripheral devices
8. a. Computer Bus
9. a. Framing error
10. d. Microinstructions
Amity Directorate of Distance & Online Education
Program and Network Properties 33
2.9 Further Readings
Notes
z S.S. Jadhav, Advanced Computer Architecture and Computing, Technical
Publications, 2009
z V. Rajaraman, T. Radhakrishnan, Computer Organization And Architecture, PHI
Learning [Link]., 2007
z P. V. S. RAO, Computer System Architecture, PHI Learning Pvt. Ltd., 2008
z Sajjan G. Shiva , Advanced Computer Architectures, CRC Press. Copyright, 2005
Amity Directorate of Distance & Online Education