Module -4 - Parallel Processing
Module -4 - Parallel Processing
Answer: Cache
4. Parallelism
4.1 Introduction
Parallel Processing
✓ Parallel processing can be described as a class of techniques which enables the
system to achieve simultaneous data-processing tasks to increase the
computational speed of a computersystem.
✓ A parallel processing system can carry out simultaneous data-processing to
achieve faster executiontime.
✓ For instance, while an instruction is being processed in the ALU component of the
CPU, the next instruction can be read frommemory.
✓ The primary purpose of parallel processing is to enhance the computer processing
capability and increase itsthroughput,
✓ A parallel processing system can be achieved by having a multiplicity of
functional units that perform identical or different operationssimultaneously.
✓ The data can be distributed among various multiple functionalunits.
✓ The following diagram shows one possible way of separating the execution unit
into eight functional units operating inparallel.
✓ The operation performed in each functional unit is indicated in each block if the
diagram:
✓ The adder and integer multiplier performs the arithmetic operation with integer
numbers.
✓ The floating-point operationsare separated into three circuits operating in parallel.
✓ The logic, shift, and increment operations can be performed concurrently on
differentdata.
✓ All units are independent of each other, so one number can be shifted while
another number is beingincremented.
✓ Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism, with multi-core and multi-processor computers
having multiple processing elements within a singlemachine.
✓ In some cases parallelism is transparent to the programmer, such as in bit-level or
instruction-levelparallelism.
✓ But explicitly parallel algorithms, particularly those that use concurrency, are more
difficult to write than sequential ones, because concurrency introduces several new
classesofpotentialsoftwarebugs,ofwhichraceconditionsarethemostcommon.
✓ Communication and synchronization between the different subtasks are typically
some of the greatest obstacles to getting optimal parallel programperformance.
Types of Parallelism:
1. Bit-level parallelism: It is the form of parallel computing which is based on the
increasing processor’s size. It reduces the number of instructions that the system
must execute in order to perform a task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum of
two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8
higher-order bits, thus requiring two instructions to perform the operation. A 16-
bit processor can perform the operation with just oneinstruction.
2. Instruction-level parallelism: A processor can only address less than one
instructionforeachclockcyclephase.Theseinstructionscanbere-orderedand
grouped which are later on executed concurrently without affecting the result of the
program. This is called instruction-levelparallelism.
3. Task Parallelism: Task parallelism employs the decomposition of a task into
subtasks and then allocating each of the subtasks for execution. The processors
perform execution of sub tasksconcurrently.
4. Data-level parallelism (DLP) – Instructions from a single stream operate
concurrently on several data – Limited by non-regular data manipulation
patterns and by memorybandwidth
Architectural Trends
✓ When multiple operations are executed in parallel, the number of cycles needed to
execute the program isreduced.
✓ However, resources are needed to support each of the concurrentactivities.
✓ Resources are also needed to allocate localstorage.
✓ The best performance is achieved by an intermediate action plan that uses
resources to utilize a degree of parallelism and a degree oflocality.
✓ Generally, the history of computer architecture has been divided into four
generations having following basic technologies−
• Vacuum tubes
• Transistors
• Integratedcircuits
• VLSI
✓ Till 1985, the duration was dominated by the growth in bit-levelparallelism.
✓ 4-bit microprocessors followed by 8-bit, 16-bit, and soon.
✓ To reduce the number of cycles needed to perform a full 32-bit operation, the
widthofthedatapathwasdoubled.Lateron,64-bitoperationswereintroduced.
✓ The growth in instruction-level-parallelism dominated the mid-80s tomid-90s.
✓ The RISC approach showed that it was simple to pipeline the steps of instruction
processingsothatonanaverageaninstructionisexecutedinalmosteverycycle.
✓ Growth in compiler technology has made instruction pipelines more productive.
✓ In mid-80s, microprocessor-based computers consistedof
• An integer processingunit
• A floating-pointunit
• A cachecontroller
• SRAMs for the cachedata
• Tagstorage
✓ As chip capacity increased, all these components were merged into a singlechip.
✓ Thus, a single chip consisted of separate hardware for integer arithmetic, floating
point operations, memory operations and branchoperations.
✓ Other than pipelining individual instructions, it fetches multiple instructions at a
time and sends them in parallel to different functional units whenever possible.
This type of instruction level parallelism is called superscalarexecution.
FLYNN‘S CLASSIFICATION
✓ Flynn's taxonomy is a specific classification of parallel computer architectures that
are based on the number of concurrent instruction (single or multiple) and data
streams (single or multiple) available in thearchitecture.
✓ The four categories in Flynn's taxonomy are thefollowing:
1. (SISD) single instruction, singledata
2. (SIMD) single instruction, multipledata
3. (MISD) multiple instruction, singledata
4. (MIMD) multiple instruction, multipledata
✓ Instruction stream: is the sequence of instructions asexecuted by themachine
✓ Data Stream is a sequence of data including input, or partialor temporary result,
called by the instructionStream.
✓ Instructions are decoded by the control unit and then ctrl unit send the instructions
to the processing units for execution.•
✓ Data Stream flows between the processors and memory bidirectionally.
SISD
An SISD computing system is a uniprocessor machine which is capable of executing a
single instruction, operating on a single datastream.
SIMD
• An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different datastreams
✓ Machines based on an SIMD model are well suited to scientific computing since
they involve lots of vector and matrixoperations.
✓ So that theinformation can be passed to all the processing elements (PEs)
organized data elements of vectors can be divided into multiple sets(N-sets for N PE
systems) and each PE can process one dataset.
✓ Dominant representative SIMD systems is Cray’s vector processingmachine.
MISD
✓ An MISD computing system is a multiprocessor machinecapable of executing
different instructions on different PEs but all of them operating on the same
dataset .
✓ The system performs different operations on the same data set. Machines built
using the MISD model are not useful in most of the application, a few machines
are built, but none of them are availablecommercially.
MIMD
✓ An MIMD system is a multiprocessor machine which is capable of executing
multiple instructions on multiple datasets.
✓ Each PE in the MIMD model has separate instruction and data streams;mtherefore
machines built usingthis m odel are capable to any kind ofapplication.
✓ Unlike SIMD and MISD machines, PEs in MIMD mac h ines work
asynchronously.
✓ MIMD machines arebroa d ly categorized into
• shared-memoryMIMD and
• distributed-memoryMIMD
based on the way PEs are coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the
PEs are connected to a single global memory and they all have access to it. The
communication between PEs in this model takes place through the shared memory,
modification of the data stored in the global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon Graphics machines
and Sun/IBM’s SMP (SymmetricMulti-Processing).
VECTOR ARCHITECTURES
✓ A multithreaded CPU is not a parallel architecture, strictly speaking; multithreading
is obtained through a single CPU, but it allows a programmer to design and develop
applications as a set of programs that can virtually execute in parallel:
namely,threads.
✓ Multithreading is solution to avoid waiting clock cycles as the missing data is
fetched: making the CPU manage more peer-threads concurrently; if a thread gets
blocked, the CPU can execute instructions of another thread, thus keeping
functional unitsbusy.
✓ Each thread must have a private Program Counter and a set of private registers,
separate from otherthreads.
✓ In a traditional scalar processor, the basic data type is an n-bitword.
✓ The architecture often exposes a register file of words, and the instruction set is
composed of instructions that operate on individualwords.
✓ In a vector architecture, there is support of a vector datatype, where a vector is a
collection of VL n-bit words (VL is the vectorlength).
✓ There may also be a vector register file, which was a key innovation of the Cray
architecture.
✓ Previously, vector machines operated on vectors stored in mainmemory.
✓ Figures 1 and 2 illustrate the difference between vector and scalar data types, and
the operations that can be performed onthem.
✓ Vector load/store instructions provide the ability to do strided and scatter / gather
memory accesses, which take data elements distributed throughout memory and
pack them into sequential vectors/streams placed in vector/streamregisters.
✓ This promotes datalocality.
✓ It results in less data pollution, since only useful data is loaded from the memory
system.
✓ It provides latency tolerance because there can be many simultaneous outstanding
memoryaccesses.
✓ Vector instructions such as VLD and VST provide thiscapability.
HARDWARE MULTITHREADING
Multithreading
• A mechanism by which the instruction streams is divided into several smaller
streams
(threads) and can be executed in parallel is calledmultithreading.
Hardware Multithreading
• Increasing utilization of a processor by switching to another thread when one
thread is stalled is known as hardwaremultithreading.
Thread
• A thread includes the program counter, the register state, and the
stack. It isa lightweight process; whereas threads commonly share a single
address space, processesdon't.
Thread Switch
• The act of switching processor control from one thread to another within
the same process. It is much less costly than a processorswitch.
Process
• A process includes one or more threads, the address space, and the
operating system state. Hence, a process switch usually invokes the operating
system, but not a threadswitch.
Types of Multi-threading
1. Fine-grainedMultithreading
2. Coarse-grainedMultithreading
3. SimultaneousMultithreading
Coarse-grained Multithreading
A version of hardware multithreading that implies switching between threads only
after significant events, such as a last-level cachemiss.
• This change relieves the need to have thread switching be extremely fast and
is much less likely to slow down the execution of an individual thread, since
instructions from other threads will only be issued when a thread encounters
a costlystall.
Advantage
• To have very fast threadswitching.
• Doesn't slow downthread.
Disadvantage
• It is hard to overcome throughput losses from shorter stalls, due to pipeline
start -upcosts.
• Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline
must beemptied.
• New thread must fill pipeline before instructions cancomplete.
• Due to this start-up overhead, coarse-grained multithreading is much more
useful for reducing the penalty of high-cost stalls, where pipeline refill is
negligible compared to the stalltime.
Fine-grained Multithreading
• A version of hardware multithreading that implies switching between threads
after every instruction resulting in interleaved execution of multiple threads. It
switches from one thread to another at each clockcycle.
• This interleaving is often done in a round-robin fashion, skipping any threads
that are stalled at that clockcycle.
To make fine-grained multithreading practical, the processor must be able to switch
threads on every clockcycle.
Advantage
• Vertical waste iseliminated.
• Pipeline hazards cannotarise.
• Zero switchingoverhead
• Ability to hide latency within a thread i.e., it can hide the throughput losses
that arise from both short and longstalls.
• Instructions from other threads can be executed when one threadstalls.
• High executionefficiency
• Potentially less complex than alternative high performanceprocessors.
Disadvantage
• Clock cycles are wasted if a thread has little operation toexecute.
• Needs a lot of threads toexecute.
• It is expensive than coarse-grainedmultithreading.
• It slows down the execution of the individual threads, since a thread that is
ready to execute without stalls will be delayed by instructions from other
threads.
Simultaneous multithreading (SMT)
• It is a variation on hardware multithreading that uses the resources of a
multiple-issue, dynamically scheduled pipelined processor to exploit thread-
level parallelism at the same time it exploitsinstruction level parallelism.
• The key insight that motivates SMT is that multiple-issue processors often
have more functional unit parallelism available than most single threads can
effectively use.
Since SMT relies on the existing dynamic mechanisms, it does not switch resources
every cycle.
• Instead, SMT is always executing instructions from multiple threads, to
associate instruction slots and renamed registers with their properthreads.
Advantage
• It is ability to boost utilization by dynamically scheduling functional
units among multiplethreads.
• It increases hardware designfacility.
• It produces better performance and add resources to a fine grainedmanner.
Disadvantage
It cannot improve performance if any of the shared resources are the limiting
bottlenecks for theperformance.
Physical memory uniformly shared by all processors, with equal access time to
all words.
• Processors may have ocal cache memories. Peripherals also shared in some
fashion.
• UMA architecture models are of two20types,
Symmetric:
• All processors have equal access to allperipheral
devices. All processors are identical.
Asymmetric:
• One processor (master) executes the operating system other
processors may be of different types and may be dedicated to
specialtasks.
Non Uniform Memory Access (NUMA) multiprocessors
• In shared memory multiprocessor systems, local memories can be connected
with every processor. The collections of all local memories form the global
memory beingshared.
• In this way, global memory is distributed to all the processors. In this case, the
access to a local memory is uniform for its corresponding processor as itisattached
to the localmemory.
• But if one reference is to the local memory of some other remote processor,
then the access is notuniform.
• It depends on the location of the memory. Thus, all memory words are not
accessed uniformly. All local memories form a global address space accessible
by allprocessors
• Programming NUMAs are harder but NUMAs can scale to larger sizes and
have lower latency to localmemory
• Memory is common to all the processors. Processors easily communicate by
means of sharedvariables.
• These systems differ in how the memory and peripheral resources are
shared ordistributed
• The access time varies with the location of the memoryword.
Superscalar Processors
It was first invented in 1987. It is a machine which is designed to improve the performance of the scalar
processor. In most applications, most of the operations are on scalar quantities. Superscalar approach
produces the high performance general purpose processors.
The main principle of superscalar approach is that it executes instructions independently in different
pipelines. As we already know, that Instruction pipelining leads to parallel processing thereby speeding up
the processing of instructions. In Superscalar processor, multiple such pipelines are introduced for different
operations, which further improves parallel processing.
There are multiple functional units each of which is implemented as a pipeline. Each pipeline consists of
multiple stages to handle multiple instructions at a time which support parallel execution of instructions.
It increases the throughput because the CPU can execute multiple instructions per clock cycle. Thus,
superscalar processors are much faster than scalar processors.
A scalar processor works on one or two data items, while the vector processor works with multiple data items.
A superscalar processor is a combination of both. Each instruction processes one data item, but there are
multiple execution units within each CPU thus multiple instructions can be processing separate data items
concurrently.
While a superscalar CPU is also pipelined, there are two different performance enhancement techniques.
It is possible to have a non-pipelined superscalar CPU or pipelined non-superscalar CPU. The
superscalar technique is associated with some characteristics, these are:
1. Instructions are issued from a sequential instruction stream.
2. CPU must dynamically check for data dependencies.
3. Should accept multiple instructions per clock cycle.
Superscalar Architecture
• Superscalar Architecture (SSA) describes a microprocessor design that execute more than one instruction at a
time during a single clock cycle
• The design is sometimes called “Second Generation RISC”. Another term used to describe superscalar
processors is multiple instruction issue processors.
• In a SSA design, the processor or the instruction compiler is able to determine whether an instruction can be
carried out independently of other sequential instructions, or whether it has a dependency on another
instruction and must be executed sequentially.
• In a SSA, several scalar instructions can be initiated simultaneously and executed independently.
• A long series of innovations aimed at producing ever- faster microprocessors.
• Includes all features of pipelining but, in addition, there can be several instructions executing simultaneously
in the same pipeline stage.
• SSA introduces a new level of parallelism, called instruction-level parallelism.
SimpleScalar Architecture
• SimpleScalar is an open source computer architecture simulator which is written using ‘C’ programming
language.
• A set of tools that model a virtual computer system with CPU, Cache and Memory Hierarchy.
• Using the tool, users can model applications that simulate programs running on a range of modern
processors and systems.
• The tool set includes sample simulators ranging from a fast functional simulator to a detailed.
Scalar to Superscalar
• The simplest processors are scalar processors. Each instruction executed by a scalar processor typically
manipulates one or two data items at a time.
• In a superscalar CPU the dispatcher reads instructions from memory and decides which one can be run in
parallel.
• Therefore a superscalar processor can be proposed having multiple parallel pipelines, each of which is
processing instructions simultaneously from a single instruction thread.
Implement Superscalar
• A SSA processor fetches multiple instructions at a time, and attempts to find nearby instructions that are
independent of each other and therefore can be executed in parallel.
• Based on the dependency analysis, the processor may issue and execute instructions in an order that differs
from that of the original machine code.
• The processor may eliminate some unnecessary dependencies by the use of additional registers.
Superpipelining
• Superpipelining is based on dividing the stages into several sub- stages, and thus increasing the number of
instructions which are handled by the pipeline at the same time.
• For example, by dividing each stage into two sub-stages, a pipeline can perform at twice the speed in the
ideal situation:
➢ No duplication of hardware is needed for these stages. 14 Superpipelining Figure: Duplication of
hardware is for Superscalar
➢ Tasks that require less than half a clock cycle.
Superscalar Issues to Consider
➢ Tasks can be divided into the following
o Parallel decoding
o Superscalar instruction issue
o Parallel instruction execution
➢ preserving sequential consistency of exception processing.
➢ preserving sequential consistency of execution.
➢ Parallel decoding – more complex task for scalar processors
➢ Parallel instruction execution task – While instructions are executed in parallel, instructions are usually.
completed out of order in respect to a sequential operating procedure.
➢ Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies
the restrictive effects of control and data dependencies on the processor performance.
Limitations of Superscalar
➢ Instruction-fetch inefficiencies caused by both branch delays and instruction misalignment.
➢ not worthwhile to explore highly- concurrent execution hardware, rather, it is more appropriate to explore
economical execution hardware.
➢ degree of intrinsic parallelism in the instruction stream (instructions requiring the same computational
resources from the CPU).
➢ complexity and time cost of the dispatcher and associated dependency checking logic.
➢ branch instruction processing.
VLIW PROCESSORS
➢ Very long instruction word or VLIW refers to a processor architecture designed to take advantage of
instruction level parallelism.
o Instruction of a VLIW processor consists of multiple independent operations grouped together.
o There are Multiple Independent Functional Units in VLIW processor architecture.
o Each operation in the instruction is aligned to a functional unit.
o All functional units share the use of a common large register file.
➢ This type of processor architecture is intended to allow higher performance without the inherent complexity of
some other approaches.
Different Approaches
Other approaches to improving performance in processor architectures :
• Pipelining: Breaking up instructions into sub-steps so that instructions can be executed partially at the same
time.
• Superscalar architectures: Dispatching individual instructions to be executed completely independently in
different parts of the processor.
• Out-of-order execution: Executing instructions in an order different from the program
Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are
completed.However, operations 1 and 2 do not depend on any other operation, so they can be calculated
simultaneously.If we assume that each operation can be completed in one unit of time then these three
instructions can be completed in a total of two units of time. Giving an ILP of 3/2.
VLIW Compiler
• Compiler is responsible for static scheduling of instructions in VLIW processor.
• Compiler finds out which operations can be executed in parallel in the program.
• It groups together these operations in single instruction which is the very large instruction word.
• Compiler ensures that an operation is not issued before its operands are ready.
VLIW Instruction
• One VLIW instruction word encodes multiple operations which allows them to be initiated in a single clock
cycle.
• The operands and the operation to be performed by the various functional units are specified in the
instruction itself.
• One instruction encodes at least one operation for each execution unit of the device.
• So length of the instruction increases with the number of execution units
• To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and on some
architectures are much wider up to 1024 bits.
VLIW Instruction
ILP in VLIW
Consider the computation of y = a1x1 + a2x2 + a3x3
On a sequential processor On the VLIW processor
with 2 load/store units,
1 multiply unit and 1 add unit
cycle 1: load cycle 1: load
a1 cycle 2: load a1
x1 cycle 3: load load
a2 cycle 4: load x1 cycle 2: load
x2 a2
cycle 5: multiply z1 a1 load x2
x1 cycle 6: multiply z2 Multiply z1 a1
a2 x2 cycle 7: add y z1 x1 cycle 3: load a3
z2 load x3
cycle 8: load a3 Multiply z2 a2
cycle 9: load x2 cycle 4: multiply z3 a3
x3 x3
cycle 10: multiply z1 a3 add y z1
x3 cycle 11: add y y z2 z2 cycle 5: add y y
z3
requires 11 cycles requires 5 cycles
Block Diagram
Working
• Long instruction words are fetched from the memory.
• A common multi-ported register file for fetching the operands and storing the results.
• Parallel random access to the register file is possible through the read/write cross bar.
• Execution in the functional units is carried out concurrently with the load/store operation of data between
RAM and the register file.
• One or multiple register files for FX and FP data.
• Rely on compiler to find parallelism and schedule dependency free program code.