MODULE- III:
Parallel processing and Multicore architecture
1303102-4: Explore multiple processor organizations (SSID, SIMD, MISD, MIMD) and memory
architectures
Contents
Multiple Processor Organization: SISD, SIMD, MISD, MIMD. Uniform memory access (UMA), Non
uniform memory access (NUMA), CC-NUMA.
Multicore: Hardware and software performance issues, need of multicore. Multicore organization,
heterogeneous multicore organization: CPU and GPU. Case study: Intel Core i7 5960X Module
parallel processor
• A parallel processor is a computer system designed with multiple
processing units (like CPUs or cores) that can execute instructions
concurrently.
• This allows for parallel processing, where a task is broken down into
smaller subtasks that can be processed simultaneously, leading to
faster computation and improved performance.
• Main concept in Parallel Processing
• Multiple Processing Units:
• Parallel processors have more than one processing unit, enabling them to work on different
parts of a problem at the same time.
• Concurrent Execution:
• These processing units execute instructions simultaneously, rather than sequentially like in a
single-processor system.
• Improved Performance:
• By dividing tasks and processing them concurrently, parallel processors can significantly speed
up the execution of complex computations and large datasets.
• Example:
• A multi-core processor, where each core acts as a separate processing unit , It is an
example of a parallel processor.
• Applications:
• Parallel processing is widely used in various fields, including scientific computing,
data analysis, machine learning, and graphics processing, where large amounts of
data need to be processed quickly.
What is Parallel Processing ?
• Parallel processing is used to increase the computational speed of
computer systems by performing multiple data-processing
operations simultaneously.
• For example, while an instruction is being executed in ALU, the next
instruction can be read from memory.
• The system can have two or more ALUs and be able to execute
multiple instructions at the same time.
• In addition, two or more processing is also used to speed up
computer processing capacity and increases with parallel processing,
and with it, the cost of the system increases.
Multiple Processor Organization
• Types of Parallel Processor Systems: A taxonomy first introduced by Flynn. the following
categories of parallel computer systems are proposed.
Flynn’s Taxonomy
Multiple Processor Organization
• Single instruction, single data (SISD) stream: A single processor executes a single instruction
stream to operate on data stored in a single memory. Uniprocessors fall into this category.
• Single instruction, multiple data (SIMD) stream: A single machine instruction controls the
simultaneous execution of a number of processing elements on a lockstep basis. Each processing
element has an associated data memory, so that instructions are executed on different sets of
data by different processors. Vector and array processors fall into this category
• Multiple instruction, single data (MISD) stream: A sequence of data is transmitted to a set of
processors, each of which executes a different instruction sequence. This structure is not
commercially implemented.
• Multiple instruction, multiple data (MIMD) stream: A set of processors simultaneously execute
different instruction sequences on different data sets. SMPs, clusters, and NUMA systems fit into
this category.
Multiple Processor Organization
Single Instruction, Single Data (SISD)
• An SISD computing system is a uniprocessor machine which is capable
of executing a single instruction, operating on a single data stream.
• In SISD, machine instructions are processed in a sequential manner
and computers adopting this model are popularly called sequential
computers. Most conventional computers have SISD architecture. All
the instructions and data to be processed have to be stored in
primary memory.
• The speed of the processing element in the SISD model is
limited(dependent) by the rate at which the computer can transfer
information internally. Dominant representative SISD systems are IBM
PC, workstations
• Breakdown:
• Single Instruction: Only one instruction is executed at a given time.
• Single Data: The instruction operates on a single data item at a time.
• Sequential Execution: Instructions are processed one after the other, in a
linear order.
Single Instruction, Single Data (SISD)
• Key characteristics of SISD:
• Uniprocessor: Typically involves a single processor.
• Sequential Processing: Instructions are executed in a step-
by-step manner.
• Basic Architecture: Represents the foundational design of
many simple computers.
• Examples of SISD systems:
• Traditional, single-core computers.
• Older mainframe and mini-computers.
• Some embedded systems.
SISD describes the core operation of a basic, non-
parallel processing system
Single Instruction, Multiple Data (SIMD)
• An SIMD system is a multiprocessor machine
capable of executing the same instruction on all
the CPUs but operating on different data
streams.
• Machines based on an SIMD model are well
suited to scientific computing since they involve
lots of vector and matrix operations.
• So that the information can be passed to all the
processing elements (PEs) organized data
elements of vectors can be divided into multiple
sets(N-sets for N PE systems) and each PE can
process one data set.
• SIMD systems is Cray’s vector processing
machine.
Single Instruction, Multiple Data (SIMD)
• Detailed explanation:
• Parallelism: SIMD leverages data-level parallelism by processing multiple data points
concurrently. Instead of processing data sequentially, it uses multiple processing elements to
operate on different data points at the same time, under the control of a single instruction.
• Instruction Stream: A single instruction stream provides instructions to all processing
elements. This means that all processors execute the same instruction, but on different data.
• Examples:
• SIMD is implemented in various architectures, including vector processors, GPUs (Graphics
Processing Units), and multimedia extensions (like MMX, SSE) in CPUs.
• Benefits:
• SIMD can significantly speed up computations by reducing the number of instructions
needed to process a large dataset. This is especially beneficial for applications that require
repetitive operations on large amounts of data.
• Implementation:
• SIMD instructions are often integrated into specialized instruction sets within CPUs and
GPUs. For example, Intel's Advanced Vector Extensions (AVX) are a set of SIMD instructions.
• Software Support:
• Compilers and programming languages (like Intel's ISPC) are designed to take advantage of
SIMD capabilities, allowing developers to write code that can be automatically vectorized for
efficient parallel execution.
Multiple Instruction, Single Data (MISD)
• It is a computer architecture where multiple processing
units operate on the same data stream, but each unit
executes a different instruction.
• It's one of the four main classifications in Flynn's
taxonomy .
• MISD is primarily a theoretical concept and is rarely used
in practical, commercially available systems.
• An MISD computing system is a multiprocessor machine
capable of executing different instructions on different
PEs but all of them operating on the same dataset
• Key Characteristics of MISD:
• Multiple Instructions: Each processing unit receives and
executes a unique instruction.
• Single Data Stream: All processing units work on the same
input data.
• Redundancy: A key application is in fault-tolerant systems
where the same data is processed by multiple units to ensure
reliability.
Multiple Instruction, Single Data (MISD)
• Example:
• Imagine a scenario where a spacecraft's navigation
system needs to be highly reliable. An MISD
architecture could be used where multiple processors
each perform different calculations on the same
sensor data (eg, position, velocity). If one processor
fails or produces an incorrect result, the other
processors can provide a backup, ensuring the
spacecraft's continued operation.
• Theoretical Nature:
• While MISD offers redundancy and fault tolerance, it's
generally not as efficient as other architectures like
SIMD (Single Instruction, Multiple Data) or MIMD
(Multiple Instruction, Multiple Data) for most
applications. Finding practical, commercially available
examples of MISD is challenging.
Multiple Instruction, Multiple Data(MIMD)
• It is a parallel processing architecture where multiple processors execute
different instructions on different data sets concurrently.
• This approach allows for high levels of parallelism and is commonly used in
supercomputers, clusters, and multi-core processors.
• An MIMD system is a multiprocessor machine which is capable of executing
multiple instructions on multiple data sets.
• Each PE(Processing Element) in the MIMD model has separate instruction and
data streams; therefore machines built using this model are capable to any kind
of application.
• Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.
• Detailed breakdown:
• Multiple Instructions: Each processor in a MIMD system can execute a different program
or sequence of instructions.
• Multiple Data: Each processor can work on its own unique data set, enabling parallel
processing of different data streams.
• Asynchronous Operation: MIMD processors typically operate independently and
asynchronously, meaning they don't need to be synchronized at every step.
Multiple Instruction, Multiple Data(MIMD)
• Examples:
• MIMD is the basis for many parallel computing systems,
including:
• Supercomputers: These massive machines use a large
number of processors to tackle complex computational
problems.
• Computer Clusters and Grids: These systems connect
multiple computers to work together as a single large
machine.
• Symmetric Multiprocessor (SMP) systems: These systems
have multiple processors on a single motherboard.
• Multi-core Processors: Modern CPUs often have multiple
processing cores on a single chip, each acting as a separate
processor.
• Advantages:
• MIMD offers high flexibility and scalability, allowing it to
tackle a wide range of computational tasks.
• Disadvantages:
• Programming MIMD systems can be more complex than
other architectures due to the need for explicit
synchronization and data communication between
processors.
Multiple Processor Organization
• MIMDs can be further subdivided by the means in which the processors
communicate (Figure 20.1)
• If the processors share a common memory, then each processor accesses
programs and data stored in the shared memory, and processors
communicate with each other via that memory. The most common form of
such systems is known as a symmetric multiprocessor (SMP)
• Uniform memory access (UMA): All processors have access to all parts of main
memory using loads and stores. The memory access time of a processor to all
regions of memory is the same. The access times experienced by different
processors are the same. The SMP and cluster is example of UMA.
• Nonuniform memory access (NUMA): All processors have access to all parts of main
memory using loads and stores. The memory access time of a processor differs
depending on which region of main memory is accessed. The last statement is true
for all processors; however, for different processors, which memory regions are
slower and which are faster differs.
• Cache-coherent NUMA (CC-NUMA): A NUMA system in which cache coherence is
maintained among the caches of the various processors.
SMP: symmetric multi processor
1.There are two or more similar processors of comparable capability.
2. These processors share the same main memory and I/O facilities and are
interconnected by a bus or other internal connection scheme, such that
memory access time is approximately the same for each processor.
3. All processors share access to I/O devices, either through the same channels
or through different channels that provide paths to the same device.
4. All processors can perform the same functions (hence the term symmetric).
5. The system is controlled by an integrated operating system that provides
interaction between processors and their programs at the job, task, file, and
data element levels.
SMP: symmetric multi processor
Uniform memory access (UMA):
• Uniform Memory Access (UMA) is a computer architecture where all
processors in a system have equal and uniform access time to all memory
locations. This means that regardless of which processor accesses a specific
memory location, the access time will be the same for all processors.
• It is used shared memory approach
• It is a tightly coupled architecture
• Equal access time for each processor so it can access any memory location
with the same latency time (Latency Time :- the time delay between a processor issuing a request for
data and that data becoming available for processing)
• Examples of UMA Systems:
• Early symmetric multiprocessor systems (SMP)
• Sun Starfire servers
• Compaq Alpha servers
• HP V-series servers.
UMA
• Types of UMA
1. Symmetric multiprocessing
• All Programming Elements(PE) have equal
access to all peripherals
2. Asymmetric Multiprocessing
• Only one subset of Processing Element have
peripheral access. (like master and slave)
Uniform memory access (UMA):
• Advantages of Uniform Memory Access (UMA)
• Easy to Implement: UMA architecture is relatively easy to implement, as all
processors or cores have equal access to the memory . This makes it an ideal
choice for small-scale systems, such as desktop computers or low-end servers.
• Low Latency: Since all memory locations have equal access times, UMA provides
low latency, which ensures that processors or cores can access memory quickly
and efficiently. This makes UMA ideal for high-performance computing
applications that require fast memory access.
• Low Cost: UMA architecture is relatively inexpensive to implement, as it requires
only a single shared memory bus to connect all processors or cores to the
memory pool. This makes it an ideal choice for low-cost computing systems
Uniform Memory Access (UMA)
• Disadvantages of Uniform Memory Access (UMA)
• Limited Scalability: UMA architecture is not scalable beyond a certain point,
as adding more processors or cores to the system can cause contention for
the memory bus. This can result in reduced system performance as
processors or cores have to wait for memory access.
• Limited Bandwidth: UMA architecture provides limited bandwidth, as all
processors or cores share a single memory bus. This can result in reduced
performance for memory-intensive applications.
• Limited Memory Capacity: UMA architecture provides limited memory
capacity, as all processors or cores share a single memory pool. This can limit
the amount of memory available to each processor or core, which can affect
system performance.
Non Uniform Memory Access (NUMA)
• In UMA Bottleneck Problem occur over the internet to
overcome this new model developed called NUMA
• Time required for accessing the memory will not be
Uniform
• It is a computer memory design used in multiprocessing
systems where the time it takes to access memory depends on
which memory location a processor is trying to access.
(memory access time is not equal)
• In a NUMA architecture, processors have faster access to
their local memory (memory physically located close to
them) than to non-local memory (memory located further
away, potentially on another processor's node
• In NUMA, where different memory controller is used.
• Non-uniform Memory Access is faster than UMA.
• Non-uniform Memory Access is applicable for real-time
applications and time-critical applications
Characteristics of NUMA:
• Non-Uniform Memory Access Times:
• Unlike Uniform Memory Access (UMA) where all memory locations are
equally accessible, NUMA systems have varying memory access times
depending on the memory's location relative to the processor.
• Local vs. Remote Memory:
• Processors in a NUMA system have dedicated local memory, and they can also
access remote memory on other nodes.
• Nodes:
• NUMA systems are typically organized into nodes, where each node contains
a processor and its associated local memory.
• Interconnect:
• Nodes are connected by an interconnect that allows processors to
communicate and access remote memory.
NUMA
NUMA
• Advantages of a Non-Uniform Memory Access (NUMA)
• Improved performance: By providing each processor with its own local
memory, NUMA can reduce memory access times and improve overall system
performance.
• Scalability: NUMA systems are highly scalable and can handle large workloads
by adding additional processors and memory nodes.
• Reduced memory contention: NUMA can help reduce memory contention by
allowing each processor to access its own local memory, reducing the need for
multiple processors to access the same memory location.
• Disadvantages of Non-Uniform Memory Access (NUMA)
• Complexity: NUMA systems can be complex to design and implement, as they
require specialized hardware and software to manage memory access.
• Higher cost: NUMA systems can be more expensive than UMA systems due to
the additional hardware and software required.
• Performance variability: In some cases, the performance of a NUMA system
may be lower than that of a UMA system, especially if the workload requires
frequent access to shared memory.
Difference between UMA and NUMA
S. No. UMA NUMA
1. UMA stands for Uniform Memory Access. NUMA stands for Non-uniform Memory Access.
In Uniform Memory Access, Single memory In Non-uniform Memory Access, Different memory
2.
controller is used. controller is used.
3. Uniform Memory Access is slower Non-uniform Memory Access is faster
Non-uniform Memory Access has more bandwidth than
4. Uniform Memory Access has limited bandwidth.
uniform Memory Access.
Uniform Memory Access is applicable for general Non-uniform Memory Access is applicable for real-time
5.
purpose applications and time-sharing applications. applications and time-critical applications.
In uniform Memory Access, memory access time is In non-uniform Memory Access, memory access time is
6.
balanced or equal. not equal.
There are 3 types of buses used in uniform Memory While in non-uniform Memory Access, There are 2 types
7.
Access which are: Single, Multiple and Crossbar. of buses used which are: Tree and hierarchical.
Examples of UMA architecture- Examples NUMA architecture-
•Sun Starfire Servers •Cray
8.
•Compaq alpha server •TC-2000
•HP v series •BBN and others.
CC- NUMA
• CC-NUMA (Cache-Coherent Non-Uniform Memory Access) is a
Multiprocessing architecture that combines the ease of access to a
single shared memory with the scalable performance of NUMA
architecture.
• A CC-NUMA system offers fast access to local memory on each node
(where there is a CPU and its associated memory) and slower access
to memory on remote nodes.
• Its main advantage is cache coherence, which automatically keeps
memory data synchronized across all processors
CC-NUMA
• Figure 20.12 depicts a typical CC-NUMA
organization.
• There are multiple independent nodes,
each of which is, in effect, an SMP
organization.
• Thus, each node contains multiple
processors, each with its own L1 and L2
caches, plus main memory.
• The node is the basic building block of
the overall CC-NUMA organization.
CC-NUMA
• Each node in the CC-NUMA system includes some
main memory.
• From the point of view of the processors, however,
there is only a single addressable memory, with each
location having a unique system-wide address.
• When a processor initiates a memory access, if the
requested memory location is not in that processor’s
cache, then the L2 cache initiates a fetch operation.
• If the desired line is in the local portion of the main
memory, the line is fetched across the local bus.
• If the desired line is in a remote portion of the main
memory, then an automatic request is sent out to fetch
that line across the interconnection network, deliver
it to the local bus, and then deliver it to the requesting
cache on that bus.
• All of this activity is automatic and transparent to
the processor and its cache.
CC-NUMA
• A NUMA system without cache coherence is more or less equivalent to a cluster.
• The commercial products that have received much attention recently are CC-NUMA systems,
which are quite distinct from both SMPs and clusters.
Motivation
• The processor limit in an SMP is one of the driving motivations behind the development of cluster
systems.
• However, with a cluster, each node has its own private main memory.
• Applications do not see a large global memory. In effect, coherence is maintained in software
rather than hardware.
• This memory granularity affects performance and, to achieve maximum performance,
software must be tailored to this environment.
• One approach to achieving large-scale multiprocessing while retaining the flavor of SMP is
NUMA.
Benefits of CC-NUMA
• Scalability: Allows systems with a large number of processors to be
scalable.
• Performance: Provides good performance for workloads requiring a
high degree of concurrency and data processing.
• Ease of programming: Provides a transparent shared memory view,
where the hardware automatically manages cache coherence,
simplifying application development.
Disadvantages of CC-NUMA
1. Higher Cost:
• CC-NUMA systems require specialized hardware and software to manage the non-
uniform memory access, leading to increased development and manufacturing costs
compared to simpler UMA (Uniform Memory Access) systems.
• The cost of interconnects and memory controllers designed for NUMA architectures
also contributes to the overall higher price tag.
2. Complexity:
• Designing and implementing CC-NUMA systems is inherently more complex than
UMA systems. This complexity stems from the need to manage memory access
across multiple nodes and ensure cache coherence.
• Software development for NUMA architectures can also be challenging, requiring
careful consideration of memory placement and access patterns to optimize
performance.
3. Performance Variability:
• While CC-NUMA can offer significant performance advantages for workloads with
good data locality, it can also exhibit performance degradation if processors
frequently access memory on remote nodes.
• Remote memory access latency is typically higher than local memory access, and
contention on the interconnect can further impact performance when multiple
processors try to access remote memory simultaneously.
Disadvantages of CC-NUMA
4. Cache Coherence Overhead:
• Maintaining cache coherence across multiple nodes in a CC-NUMA system involves
significant overhead, including the need for inter-processor communication and
directory lookups.
• When multiple processors access the same memory location concurrently, this can
lead to cache invalidations and performance penalties.
5. Programming Challenges:
• Writing efficient code for CC-NUMA systems requires careful attention to data
placement and access patterns to minimize remote memory access.
• Operating systems and programming languages may provide tools and mechanisms
to help manage NUMA-related issues, but these often require specialized knowledge
and expertise.
6. Scalability Limitations:
• While CC-NUMA systems are scalable, the complexity of managing memory access
across a large number of nodes can become a limiting factor.
• The interconnect bandwidth and latency between nodes can become bottlenecks as
the system scales, impacting the performance of applications that require frequent
data sharing across multiple nodes.
Cluster- Distributed Memory
An important and relatively recent
development in computer system
design is clustering. Clustering is an
alternative to symmetric
multiprocessing as an approach to
providing high performance and
high availability, and is particularly
attractive for server applications.
Advantage
• Absolute scalability: It is possible to create large clusters that far surpass the power of even the
largest standalone machines. A cluster can have tens, hundreds, or even thousands of machines,
each of which is a multiprocessor.
• Incremental scalability: A cluster is configured in such a way that it is possible to add new
systems to the cluster in small increments. Thus, a user can start out with a modest system and
expand it as needs grow, without having to go through a major upgrade in which an existing small
system is replaced with a larger system.
• High availability: Because each node in a cluster is a standalone computer, the failure of one node
does not mean loss of service. In many products, fault tolerance is handled automatically in
software.
• Superior price/performance: By using commodity building blocks, it is possible to put together a
cluster with equal or greater computing power than a single large machine, at much lower cost.
Multicore
• A multicore processor, also known as a chip multiprocessor, combines
two or more processor units (called cores) on a single piece of silicon
(called a die).
• Typically, each core consists of all of the components of an
independent processor,
• Registers,
• ALU,
• Pipeline hardware,
• Control unit,
• Plus L1 instruction and data caches.
• In addition to the multiple cores, contemporary multicore chips also include L2 cache and,
increasingly, L3 cache.
• The most highly integrated multicore processors, known as systems on
chip (SoCs), also include memory and peripheral controllers.
Hardware Performance Issues
Increase in Parallelism and Complexity
• Pipelining: Individual instructions are executed through a pipeline of stages so
that while one instruction is executing in one stage of the pipeline, another
instruction is executing in another stage of the pipeline.
• Superscalar: Multiple pipelines are constructed by replicating execution
resources. This enables parallel execution of instructions in parallel pipelines,
so long as hazards are avoided.
• Simultaneous multithreading (SMT): Register banks are expanded so that
multiple threads can share the use of pipeline resources.
With each of these innovations, designers have over the years attempted to increase
the performance of the system by adding complexity.
There is a practical limit to how far this trend can be taken, because with more
stages, there is the need for more logic, more interconnections, and more control
signals.
Hardware Performance Issues
Power Consumption
To maintain the trend of higher performance
,power requirements have grown exponentially as
chip density and clock frequency have risen.
Power considerations provide another motive for
moving toward a multicore organization.
Pollack’s Rule
“performance is roughly proportional to square
root of increase in complexity”
Before 2005 Performance improved mainly by increasing frequency and transistor count. Power Consumption Increased with
frequency and transistors until early 2000s.After that, power density issues forced designers to hold power levels steady.
After 2005 Frequency and power hit physical limits. Designers turned to multicore architecture (increasing cores instead of
frequency).Multicore takes advantage of chip density while avoiding high power density.
Software Performance Issues
• small amounts of serial code impact performance.
• According to Amdahl’s law
Speedup= time to execute program on a single processor/ time to execute program on N parallel
processor
=1/((1-f)+f/N)
• where f is the fraction of code infinitely parallelizable with no schedule overhead
• N is the number of Parallel Processors
Multicore Organization
• Organization of multicore systems, as being viewed, essentially
includes three main decisive parameters mentioned below:
• The number of processor cores on the chip
• The number of levels of cache memory to be employed
• The amount of cache memory that is to be shared.
• Each parameter is again specified by its number, size, capacity,
capability, its placement, and also the way in which it would interact
with the others and will eventually determine the type of the
organization that the multicore system would have
Four general organizations of multicore systems
• Figure a illustrates an organization in which the
only on-chip cache is LI cache which is again
divided into instruction and data caches, with
each core having its own dedicated LI cache.
• This type of organization is often found in some
of the earlier multicore computer chips and is
still seen in use in some embedded chips.
• An example of this organization is the ARM11
MPCore.
• The organization as shown in Figure b
indicates that there is enough room
available on the chip that enables it to build
a separate on-chip dedicated unified L2
cache, apart from the existing on-chip LI
split cache (I-cache and D-cache) in each
core.
• This type of multicore organization is found
in AMD Opteron.
Four general organizations of multicore systems
• Figure C shows an arrangement
almost the same as that shown in
Figure B but with a little difference
that the on-chip L2 cache in Figure C
is a larger one and is used here in a
manner shared by all cores.
• The Intel Core Duo processor
UltraSPARC T2 has this organization.
• With the increasing potential of VLSI technology
that provides the system designer with an
abundance of hardware capabilities, the amount
of space now available in the chip as well as the
total transistor count obtainable on the chip
continues to grow.
• For the improving the performance further, a
separate shared unified L3 cache, apart from
usual dedicated LI and L2 caches for each core
processor. This organization is illustrated in
Figure 8.32d. The implementation of Intel Core
i7 is an example of this organizational approach
Case study
• Intel Core i3, i5, i7, i9 Processors Multicore Computers
• Main differences refer to:
• Performance/Heat
• Number of cores
• Maximum Main Memory/Cache Capacity
• Hyperthreading (yes/no)
• Turbo Boost (possibility of increasing frequency and heat)
• Built in Graphic Processor (yes/no)
• Price
• Cache Coherence
• Link:- https://youtu.be/r_ZE1XVT8Ao?si=Z8MU5DU84JN3LVpO