PARALLEL COMPUTING
MODULE-1
• Introduction to parallel programming
• Parallel hardware and parallel software
• Classifications of parallel computers
• SIMD systems
• MIMD systems
• Interconnection networks
• Cache coherence
• Shared-memory vs. distributed-memory
• Coordinating the processes/threads
• Shared-memory, Distributed-memory
Introduction
• From 1986 to 2003, microprocessor performance increased by over 50% per year, allowing users and
developers to wait for the next generation.
• Since 2003, single-processor performance improvement has slowed to less than 4% per year, causing a
dramatic difference in performance.
• This difference is associated with a change in processor design, with major microprocessor
manufacturers focusing on parallelism.
• This change has led to the performance of most serial programs being the same on a system with
multiple processors.
• This raises questions about why single-processor systems aren't fast enough, why microprocessor
manufacturers can't develop faster single-processor systems, and why parallel systems aren't built.
Why we need ever-increasing performance
The rise in computational power has revolutionized fields like science, the internet,
and entertainment, enhancing medical imaging and web searches. However, it also
raises the number of problems we can solve, such as climate modelling and protein
folding, and limits our ability to study complex configurations.
Why we’re building parallel systems
• The increasing density of transistors on integrated circuits has led to a
significant increase in single-processor performance. However, this
increase also increases power consumption, which is dissipated as heat.
As transistor density has slowed, it is becoming impossible to continue
increasing the speed of integrated circuits. To continue building powerful
computers, the industry has adopted parallelism, putting multiple, simple,
complete processors on a single chip, known as multicore processors
Why we need to write parallel
programs
• Most programs written for single-core systems cannot exploit the presence of
multiple cores. To run multiple instances of a program on a multicore system, we
need to either rewrite serial programs to be parallel or write translation programs
that automatically convert serial programs into parallel programs. However,
researchers have limited success writing programs that convert serial programs in
languages like C, C++, and Java into parallel programs. Efficient parallel
implementation of a serial program may require devising a new algorithm.
How do we write parallel programs
• Parallel computing divides work among cores to solve problems using task-
parallelism and data-parallelism approaches. Task-parallelism divides tasks among
cores, while data-parallelism divides data among cores. In a global sum example,
data-parallelism involves the same operations on assigned elements, while task-
parallelism involves receiving and adding partial sums. Writing parallel programs is
similar to serial programming, but more complex due to coordination,
communication, load balancing, and synchronization. Powerful parallel programs
use explicit parallel constructs, requiring careful code execution.
What we’ll be doing
• This text teaches C program writing using four APIs: MPI, Pthreads, OpenMP,
and CUDA. It discusses parallel systems based on memory access and cores'
independence. There are shared-memory and distributed-memory systems, with
each core having its own private memory. The text also covers Single-Instruction
Multiple-Data (SIDD) systems, which are essential for modern GPUs. The text
uses different APIs for different system types, including MPI for distributed
memory MIMD systems, Pthreads for shared memory MIMD systems, OpenMP
for both, and CUDA for Nvidia GPUs.
Concurrent, parallel, distributed
• Parallel computing refers to programs that can run multiple tasks simultaneously on
physically close-to-each other cores, either sharing the same memory or connected via a
high-speed network. In contrast, distributed computing involves programs that may need to
cooperate with other programs to solve a problem. Both parallel and distributed programs
are concurrent, but a multitasking operating system is also concurrent. Parallel programs
typically run multiple tasks simultaneously on physically close-to-each cores, while
distributed programs are more loosely coupled, executed by multiple computers separated
by large distances. However, there is no general agreement on these terms.
Parallel hardware and parallel
software
The von Neumann architecture:
• The classical von Neumann architecture consists of main memory, a central processing unit (CPU), and an
interconnection between the memory and CPU. The CPU is divided into a control unit and a data path, with the
control unit deciding which instructions to execute and the datapath executing the actual instructions. Data and
instructions are transferred between the CPU and memory via an interconnect, which has traditionally been a
bus. More recent systems use more complex interconnects. A von Neumann machine executes a single
instruction at a time, with each instruction operating on only a few pieces of data. The separation of memory and
CPU is known as the von Neumann bottleneck, as the interconnect determines the rate at which instructions and
data can be accessed. To address this bottleneck and improve computer performance, computer engineers and
scientists have experimented with modifications to the basic von Neumann architecture.
Processes, multitasking, and threads
• The operating system manages computer hardware and software resources,
determining program execution and memory allocation.
• It creates processes, including executable programs, memory blocks, callstacks,
heaps, resource descriptions, security information, and process state.
• Modern operating systems are multitasking, allowing simultaneous execution of
multiple programs.
• Threading divides programs into independent tasks, sharing executable, memory, and
I/O devices. Processes start and stop threads as lines
Modifications to the von Neumann
model
• The basics of caching
• Caching is a method used to address the von Neumann bottleneck in computer systems by
using a wider interconnection to transport more data or instructions in a single memory
access. It involves using cache blocks or cache lines, which store 8 to 16 times as much
information as a single memory location. The CPU checks each level before accessing main
memory, calling a cache hit or miss when information is unavailable. Two basic approaches
to dealing with inconsistency are write-through caches and write-back caches. Overall,
caching efficiently accesses data and instructions while minimizing memory usage
Cache mappings
• Cache design involves deciding where lines should be stored, with options
ranging from fully associative to direct mapped. Intermediate and n-way set
associative caches allow for multiple lines to be mapped to different locations,
determining which line should be replaced or evicted.
Caches and programs: an example
• The CPU cache is controlled by system hardware, but programmers can
indirectly control it through spatial and temporal locality. Two-
dimensional arrays are stored in "row-major" order. The first pair of
nested loops generally performs better than the second, especially when
running code on a system with MAX = 1000. This is because the first
pair of loops access elements in contiguous blocks.
•
Virtual memory
• Caches are essential for accessing instructions and data in main memory,
but may not fit all instructions in large programs or data sets. Virtual
memory was developed to function as a secondary storage, keeping only
active programs in main memory and idle ones in swap space. Programs
are assigned virtual page numbers and a table maps them to physical
addresses. A translation-lookaside buffer (TLB) caches a small number
of entries from the table in fast memory.
Instruction-levelparallelism
• Instruction-level parallelism (ILP) is a method that enhances processor
performance by allowing multiple processor components or functional units to
execute instructions simultaneously. There are two main approaches: pipeline,
where functional units are arranged in stages, and multiple issue, where multiple
instructions can be initiated simultaneously. Pipeline reduces execution time but
can stall due to delays. Multiple issue processors replicate functional units and
use speculation, impacting shared-memory programming.
Hardware multithreading
• ILP is challenging to execute instructions simultaneously due to its
long sequence of dependent statements. Thread-level parallelism
(TLP) provides coarser-grained parallelism. Hardware multithreading
allows systems to continue working even when tasks stall, but requires
rapid switching between threads. Simultaneous multithreading (SMT)
exploits superscalar processors.
Parallel hardware
• Multiple issue and pipelining are considered parallel hardware, as
they allow different functional units to be executed simultaneously.
These extensions of the basic von Neumann model are limited to
hardware visible to the programmer.
Classifications of parallel computers
• This text discusses two classifications of parallel computers: Flynn's
taxonomy, which classifies computers based on the number of instruction
streams and data streams they can manage simultaneously, and
distributed memory systems, which differentiate between systems that
support only a single instruction stream (SIMD) and systems that support
multiple instruction streams (MIMD). Both classifications help
understand how cores access memory and coordinate their work.
SIMD systems
• SIMD systems are parallel systems that operate on multiple data streams by applying the same
instruction to multiple data items. They have a single control unit and multiple datapaths, where
each datapath applies the instruction to the current data item or is idle. SIMD systems are ideal
for parallelizing simple loops that operate on large arrays of data, but they often struggle with
other types of parallel problems. Data-parallelism, obtained by dividing data among processors
and applying the same instructions to their subsets, is the most efficient approach. SIMD systems
have a history, with Thinking Machines being one of the largest manufacturers of parallel
supercomputers in the early 1990s. More recently, GPUs and desktop CPUs are using aspects of
SIMD computing.
Vector processors
• Vector processors operate on arrays or vectors of data, unlike
conventional CPUs. They are fast, easy to use, and have high memory
bandwidth. However, they don't handle irregular data structures and
have a finite scalability limit. Current systems scale by increasing the
number of vector processors, not the vector length.
Graphics processing units
• Real-time graphics APIs represent an object's surface using points,
lines, and triangles, converting it into pixels. GPUs optimize
performance using SIMD parallelism, high data movement, and
hardware multithreading. They can use shared or distributed memory
and are popular for high-performance computing, with several
languages developed to exploit their power.
MIMD systems
• MIMD systems support multiple simultaneous instruction streams operating on
multiple data streams. They consist of independent processing units or cores, each
with its own control unit and datapath. MIMD systems are usually asynchronous,
with no global clock and no relation between system times on different processors.
There are two main types: shared-memory systems and distributed-memory
systems. Shared-memory systems connect autonomous processors to a memory
system via an interconnection network, while distributed-memory systems pair
processors with their own private memory.
Shared-memory systems
• Shared-memory systems use multicore processors with multiple CPUs or
cores on a single chip. These processors can either directly connect to
main memory or have direct connections to a block of main memory. The
interconnect can be uniform memory access (UMA) or nonuniform
memory access (NUMA). UMA systems are easier to program due to the
same access times for all cores, while NUMA systems offer faster access
to directly connected memory and can use larger amounts of memory.
Distributed-memory systems
• Clusters are widely available distributed-memory systems composed
of commodity systems connected by a commodity interconnection
network. These systems, sometimes called hybrid systems, have
shared-memory nodes with multicore processors. The grid provides
infrastructure to transform large geographically distributed computers
into a unified system, which is heterogeneous.
Interconnection networks
• Interconnects significantly impact distributed and shared-memory
system performance. Slow interconnects can significantly degrade
parallel program performance. While some interconnects share
similarities, they differ enough to separate shared-memory and
distributed-memory interconnects.
Shared-memory interconnects
• Shared memory systems traditionally used buses for connecting
processors and memory, but as devices connected, contention
increases, reducing performance. Crossbars, faster and cheaper than
buses, allow simultaneous communication among devices, but high
costs.
Distributed-memory interconnects
• Distributed-memory interconnects are direct and indirect, connecting
switches directly to processor-memory pairs. Ring interconnects have
more links than toroidal meshes, allowing multiple simultaneous
communications but requiring processors to wait. Toroidal meshes are
more expensive.
Cache coherence
• CPU caches are managed by system hardware, leaving programmers without
direct control. This affects shared-memory systems, where unpredictable behavior
occurs regardless of write-through or write-back policies. The cache coherence
problem is significant, as single-processor systems lack mechanisms to ensure that
cached values stored by other processors are updated.
Snooping cache coherence
• Snooping cache coherence and directory-based cache coherence are two
main approaches to ensure cache coherence in systems. Snooping comes
from bus-based systems, where cores share a bus, and any signal
transmitted on the bus can be seen by all connected cores. This ensures
cache coherence by notifying other cores that the cache line containing x
has been updated. Snooping works with both write-through and write-back
caches, with write-through caches requiring no additional traffic.
Directory-based cache coherence
• Snooping cache coherence is not scalable in large networks due to the high cost
of broadcasts and the need for broadcasts every time a variable is updated. This
makes it difficult for larger systems to scale. Directory-based cache coherence
protocols attempt to solve this problem by using a directory to store the status of
each cache line. Each core/memory pair stores the structure that specifies the
status of the cache lines in its local memory. This allows only the cores storing a
cache variable to be contacted when a variable is updated.
Shared-memory vs. distributed-
memory
• Parallel computing systems are often not shared-memory due to
hardware issues such as the cost of scaling interconnects. Buses are
suitable for systems with few processors due to conflicts over access,
while distributed-memory interconnects like the hypercube and
toroidal mesh are relatively inexpensive, making them better suited for
problems requiring vast amounts of data or computation.
Parallel software
• Parallel hardware is increasingly used in desktop and server systems, mobile
phones, and tablets. However, parallel software is still in its infancy, with most
system software and popular application programs using multiple cores. This
presents a challenge as hardware and compilers cannot consistently increase
application performance. To maintain performance and power, developers must
learn to write applications that exploit shared- and distributed-memory
architectures and MIMD and SIMD systems. This involves understanding the
terminology and processes involved in parallel systems.
Shared-memory
• Shared-memory programs use shared or private variables, with shared variables accessible
by any thread and private variables only accessible by one thread. Communication is
usually implicit through shared variables. Dynamic threads are commonly used in shared-
memory programs, where a master thread waits for work requests and forks worker
threads. This makes efficient use of system resources. An alternative is the static thread
paradigm, where all threads are forked after setup by the master thread and run until all
work is completed. This may be less efficient but has potential for better performance and
is closer to the most widely used paradigm for distributed-memory programming.
Distributed-memory
• Message-passing is a widely used API for parallel program development, allowing
processes to access their private memories. These APIs can be used with shared-
memory hardware and are executed by starting multiple processes. They provide send
and receive functions, with processes identifying each other by ranks. The Get_rank
function returns the calling process's rank, and processes branch based on their ranks.
Message-passing is powerful but low-level, requiring detailed management and
potentially prohibitively expensive data structures for serial programs.
•