0% found this document useful (0 votes)

38 views40 pages

Parallel Computing Module1

Uploaded by

jayakshayar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views40 pages

Parallel Computing Module1

Uploaded by

jayakshayar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

PARALLEL COMPUTING

MODULE-1
• Introduction to parallel programming
• Parallel hardware and parallel software
• Classifications of parallel computers
• SIMD systems
• MIMD systems
• Interconnection networks
• Cache coherence
• Shared-memory vs. distributed-memory
• Coordinating the processes/threads
• Shared-memory, Distributed-memory
Introduction
• From 1986 to 2003, microprocessor performance increased by over 50% per year, allowing users and
developers to wait for the next generation.
• Since 2003, single-processor performance improvement has slowed to less than 4% per year, causing a
dramatic difference in performance.
• This difference is associated with a change in processor design, with major microprocessor
manufacturers focusing on parallelism.
• This change has led to the performance of most serial programs being the same on a system with
multiple processors.
• This raises questions about why single-processor systems aren't fast enough, why microprocessor
manufacturers can't develop faster single-processor systems, and why parallel systems aren't built.
Why we need ever-increasing performance

The rise in computational power has revolutionized fields like science, the internet,
and entertainment, enhancing medical imaging and web searches. However, it also
raises the number of problems we can solve, such as climate modelling and protein
folding, and limits our ability to study complex configurations.
Why we’re building parallel systems

• The increasing density of transistors on integrated circuits has led to a

significant increase in single-processor performance. However, this
increase also increases power consumption, which is dissipated as heat.
As transistor density has slowed, it is becoming impossible to continue
increasing the speed of integrated circuits. To continue building powerful
computers, the industry has adopted parallelism, putting multiple, simple,
complete processors on a single chip, known as multicore processors
Why we need to write parallel
programs
• Most programs written for single-core systems cannot exploit the presence of
multiple cores. To run multiple instances of a program on a multicore system, we
need to either rewrite serial programs to be parallel or write translation programs
that automatically convert serial programs into parallel programs. However,
researchers have limited success writing programs that convert serial programs in
languages like C, C++, and Java into parallel programs. Efficient parallel
implementation of a serial program may require devising a new algorithm.
How do we write parallel programs

• Parallel computing divides work among cores to solve problems using task-
parallelism and data-parallelism approaches. Task-parallelism divides tasks among
cores, while data-parallelism divides data among cores. In a global sum example,
data-parallelism involves the same operations on assigned elements, while task-
parallelism involves receiving and adding partial sums. Writing parallel programs is
similar to serial programming, but more complex due to coordination,
communication, load balancing, and synchronization. Powerful parallel programs
use explicit parallel constructs, requiring careful code execution.
What we’ll be doing

• This text teaches C program writing using four APIs: MPI, Pthreads, OpenMP,
and CUDA. It discusses parallel systems based on memory access and cores'
independence. There are shared-memory and distributed-memory systems, with
each core having its own private memory. The text also covers Single-Instruction
Multiple-Data (SIDD) systems, which are essential for modern GPUs. The text
uses different APIs for different system types, including MPI for distributed
memory MIMD systems, Pthreads for shared memory MIMD systems, OpenMP
for both, and CUDA for Nvidia GPUs.
Concurrent, parallel, distributed
• Parallel computing refers to programs that can run multiple tasks simultaneously on
physically close-to-each other cores, either sharing the same memory or connected via a
high-speed network. In contrast, distributed computing involves programs that may need to
cooperate with other programs to solve a problem. Both parallel and distributed programs
are concurrent, but a multitasking operating system is also concurrent. Parallel programs
typically run multiple tasks simultaneously on physically close-to-each cores, while
distributed programs are more loosely coupled, executed by multiple computers separated
by large distances. However, there is no general agreement on these terms.
Parallel hardware and parallel
software
The von Neumann architecture:
• The classical von Neumann architecture consists of main memory, a central processing unit (CPU), and an
interconnection between the memory and CPU. The CPU is divided into a control unit and a data path, with the
control unit deciding which instructions to execute and the datapath executing the actual instructions. Data and
instructions are transferred between the CPU and memory via an interconnect, which has traditionally been a
bus. More recent systems use more complex interconnects. A von Neumann machine executes a single
instruction at a time, with each instruction operating on only a few pieces of data. The separation of memory and
CPU is known as the von Neumann bottleneck, as the interconnect determines the rate at which instructions and
data can be accessed. To address this bottleneck and improve computer performance, computer engineers and
scientists have experimented with modifications to the basic von Neumann architecture.
Processes, multitasking, and threads
• The operating system manages computer hardware and software resources,
determining program execution and memory allocation.
• It creates processes, including executable programs, memory blocks, callstacks,
heaps, resource descriptions, security information, and process state.
• Modern operating systems are multitasking, allowing simultaneous execution of
multiple programs.
• Threading divides programs into independent tasks, sharing executable, memory, and
I/O devices. Processes start and stop threads as lines
Modifications to the von Neumann
model
• The basics of caching

• Caching is a method used to address the von Neumann bottleneck in computer systems by
using a wider interconnection to transport more data or instructions in a single memory
access. It involves using cache blocks or cache lines, which store 8 to 16 times as much
information as a single memory location. The CPU checks each level before accessing main
memory, calling a cache hit or miss when information is unavailable. Two basic approaches
to dealing with inconsistency are write-through caches and write-back caches. Overall,
caching efficiently accesses data and instructions while minimizing memory usage
Cache mappings

• Cache design involves deciding where lines should be stored, with options
ranging from fully associative to direct mapped. Intermediate and n-way set
associative caches allow for multiple lines to be mapped to different locations,
determining which line should be replaced or evicted.
Caches and programs: an example

• The CPU cache is controlled by system hardware, but programmers can

indirectly control it through spatial and temporal locality. Two-
dimensional arrays are stored in "row-major" order. The first pair of
nested loops generally performs better than the second, especially when
running code on a system with MAX = 1000. This is because the first
pair of loops access elements in contiguous blocks.

•
Virtual memory

• Caches are essential for accessing instructions and data in main memory,
but may not fit all instructions in large programs or data sets. Virtual
memory was developed to function as a secondary storage, keeping only
active programs in main memory and idle ones in swap space. Programs
are assigned virtual page numbers and a table maps them to physical
addresses. A translation-lookaside buffer (TLB) caches a small number
of entries from the table in fast memory.
Instruction-levelparallelism

• Instruction-level parallelism (ILP) is a method that enhances processor

performance by allowing multiple processor components or functional units to
execute instructions simultaneously. There are two main approaches: pipeline,
where functional units are arranged in stages, and multiple issue, where multiple
instructions can be initiated simultaneously. Pipeline reduces execution time but
can stall due to delays. Multiple issue processors replicate functional units and
use speculation, impacting shared-memory programming.
Hardware multithreading

• ILP is challenging to execute instructions simultaneously due to its

long sequence of dependent statements. Thread-level parallelism
(TLP) provides coarser-grained parallelism. Hardware multithreading
allows systems to continue working even when tasks stall, but requires
rapid switching between threads. Simultaneous multithreading (SMT)
exploits superscalar processors.
Parallel hardware

• Multiple issue and pipelining are considered parallel hardware, as

they allow different functional units to be executed simultaneously.
These extensions of the basic von Neumann model are limited to
hardware visible to the programmer.
Classifications of parallel computers

• This text discusses two classifications of parallel computers: Flynn's

taxonomy, which classifies computers based on the number of instruction
streams and data streams they can manage simultaneously, and
distributed memory systems, which differentiate between systems that
support only a single instruction stream (SIMD) and systems that support
multiple instruction streams (MIMD). Both classifications help
understand how cores access memory and coordinate their work.
SIMD systems
• SIMD systems are parallel systems that operate on multiple data streams by applying the same
instruction to multiple data items. They have a single control unit and multiple datapaths, where
each datapath applies the instruction to the current data item or is idle. SIMD systems are ideal
for parallelizing simple loops that operate on large arrays of data, but they often struggle with
other types of parallel problems. Data-parallelism, obtained by dividing data among processors
and applying the same instructions to their subsets, is the most efficient approach. SIMD systems
have a history, with Thinking Machines being one of the largest manufacturers of parallel
supercomputers in the early 1990s. More recently, GPUs and desktop CPUs are using aspects of
SIMD computing.
Vector processors

• Vector processors operate on arrays or vectors of data, unlike

conventional CPUs. They are fast, easy to use, and have high memory
bandwidth. However, they don't handle irregular data structures and
have a finite scalability limit. Current systems scale by increasing the
number of vector processors, not the vector length.
Graphics processing units

• Real-time graphics APIs represent an object's surface using points,

lines, and triangles, converting it into pixels. GPUs optimize
performance using SIMD parallelism, high data movement, and
hardware multithreading. They can use shared or distributed memory
and are popular for high-performance computing, with several
languages developed to exploit their power.
MIMD systems

• MIMD systems support multiple simultaneous instruction streams operating on

multiple data streams. They consist of independent processing units or cores, each
with its own control unit and datapath. MIMD systems are usually asynchronous,
with no global clock and no relation between system times on different processors.
There are two main types: shared-memory systems and distributed-memory
systems. Shared-memory systems connect autonomous processors to a memory
system via an interconnection network, while distributed-memory systems pair
processors with their own private memory.
Shared-memory systems

• Shared-memory systems use multicore processors with multiple CPUs or

cores on a single chip. These processors can either directly connect to
main memory or have direct connections to a block of main memory. The
interconnect can be uniform memory access (UMA) or nonuniform
memory access (NUMA). UMA systems are easier to program due to the
same access times for all cores, while NUMA systems offer faster access
to directly connected memory and can use larger amounts of memory.
Distributed-memory systems

• Clusters are widely available distributed-memory systems composed

of commodity systems connected by a commodity interconnection
network. These systems, sometimes called hybrid systems, have
shared-memory nodes with multicore processors. The grid provides
infrastructure to transform large geographically distributed computers
into a unified system, which is heterogeneous.
Interconnection networks

• Interconnects significantly impact distributed and shared-memory

system performance. Slow interconnects can significantly degrade
parallel program performance. While some interconnects share
similarities, they differ enough to separate shared-memory and
distributed-memory interconnects.
Shared-memory interconnects

• Shared memory systems traditionally used buses for connecting

processors and memory, but as devices connected, contention
increases, reducing performance. Crossbars, faster and cheaper than
buses, allow simultaneous communication among devices, but high
costs.
Distributed-memory interconnects

• Distributed-memory interconnects are direct and indirect, connecting

switches directly to processor-memory pairs. Ring interconnects have
more links than toroidal meshes, allowing multiple simultaneous
communications but requiring processors to wait. Toroidal meshes are
more expensive.
Cache coherence

• CPU caches are managed by system hardware, leaving programmers without

direct control. This affects shared-memory systems, where unpredictable behavior
occurs regardless of write-through or write-back policies. The cache coherence
problem is significant, as single-processor systems lack mechanisms to ensure that
cached values stored by other processors are updated.
Snooping cache coherence

• Snooping cache coherence and directory-based cache coherence are two

main approaches to ensure cache coherence in systems. Snooping comes
from bus-based systems, where cores share a bus, and any signal
transmitted on the bus can be seen by all connected cores. This ensures
cache coherence by notifying other cores that the cache line containing x
has been updated. Snooping works with both write-through and write-back
caches, with write-through caches requiring no additional traffic.
Directory-based cache coherence

• Snooping cache coherence is not scalable in large networks due to the high cost
of broadcasts and the need for broadcasts every time a variable is updated. This
makes it difficult for larger systems to scale. Directory-based cache coherence
protocols attempt to solve this problem by using a directory to store the status of
each cache line. Each core/memory pair stores the structure that specifies the
status of the cache lines in its local memory. This allows only the cores storing a
cache variable to be contacted when a variable is updated.
Shared-memory vs. distributed-
memory
• Parallel computing systems are often not shared-memory due to
hardware issues such as the cost of scaling interconnects. Buses are
suitable for systems with few processors due to conflicts over access,
while distributed-memory interconnects like the hypercube and
toroidal mesh are relatively inexpensive, making them better suited for
problems requiring vast amounts of data or computation.
Parallel software

• Parallel hardware is increasingly used in desktop and server systems, mobile

phones, and tablets. However, parallel software is still in its infancy, with most
system software and popular application programs using multiple cores. This
presents a challenge as hardware and compilers cannot consistently increase
application performance. To maintain performance and power, developers must
learn to write applications that exploit shared- and distributed-memory
architectures and MIMD and SIMD systems. This involves understanding the
terminology and processes involved in parallel systems.
Shared-memory
• Shared-memory programs use shared or private variables, with shared variables accessible
by any thread and private variables only accessible by one thread. Communication is
usually implicit through shared variables. Dynamic threads are commonly used in shared-
memory programs, where a master thread waits for work requests and forks worker
threads. This makes efficient use of system resources. An alternative is the static thread
paradigm, where all threads are forked after setup by the master thread and run until all
work is completed. This may be less efficient but has potential for better performance and
is closer to the most widely used paradigm for distributed-memory programming.
Distributed-memory
• Message-passing is a widely used API for parallel program development, allowing
processes to access their private memories. These APIs can be used with shared-
memory hardware and are executed by starting multiple processes. They provide send
and receive functions, with processes identifying each other by ranks. The Get_rank
function returns the calling process's rank, and processes branch based on their ranks.
Message-passing is powerful but low-level, requiring detailed management and
potentially prohibitively expensive data structures for serial programs.
•

Module1 1
No ratings yet
Module1 1
32 pages
Overview of Parallel Hardware Concepts
No ratings yet
Overview of Parallel Hardware Concepts
60 pages
Module 1
No ratings yet
Module 1
67 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Unit 1
No ratings yet
Unit 1
143 pages
Unit 1 Modern Processors
100% (1)
Unit 1 Modern Processors
52 pages
Pda 2
No ratings yet
Pda 2
105 pages
PC Module1
No ratings yet
PC Module1
28 pages
Module 1
No ratings yet
Module 1
108 pages
Understanding Multi-Core Processor Architectures
No ratings yet
Understanding Multi-Core Processor Architectures
32 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
40 pages
Parallel Computing Concepts Explained
No ratings yet
Parallel Computing Concepts Explained
90 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
22 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
CPU Architecture and Parallel Processing
No ratings yet
CPU Architecture and Parallel Processing
5 pages
Parallel Computing IA1
No ratings yet
Parallel Computing IA1
29 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
DS1822 - Parallel Computing - Unit 1
No ratings yet
DS1822 - Parallel Computing - Unit 1
23 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
22 pages
PP CS
No ratings yet
PP CS
89 pages
Overview of Parallel Processing Types
No ratings yet
Overview of Parallel Processing Types
31 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Shared Memory in Parallel Computing
No ratings yet
Shared Memory in Parallel Computing
26 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Parallel and Distributed Computing Overview
No ratings yet
Parallel and Distributed Computing Overview
33 pages
CS 213: Parallel Processing Syllabus
No ratings yet
CS 213: Parallel Processing Syllabus
26 pages
Parallel Processing in Computer Design
No ratings yet
Parallel Processing in Computer Design
43 pages
Parallel Computing Concepts Explained
No ratings yet
Parallel Computing Concepts Explained
22 pages
Overview of Parallel Computing Concepts
No ratings yet
Overview of Parallel Computing Concepts
28 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
32 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Module 1
No ratings yet
Module 1
30 pages
Understanding Parallel Processing Techniques
No ratings yet
Understanding Parallel Processing Techniques
26 pages
Applications and Architecture of Parallel Processors
No ratings yet
Applications and Architecture of Parallel Processors
41 pages
Unit 5
No ratings yet
Unit 5
96 pages
Understanding Serial vs. Parallel Computing
No ratings yet
Understanding Serial vs. Parallel Computing
5 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Theory of Parallelism in Computing
No ratings yet
Theory of Parallelism in Computing
48 pages
Unit 1 (1) High Performance Computing
No ratings yet
Unit 1 (1) High Performance Computing
54 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Understanding Parallel Processing Concepts
No ratings yet
Understanding Parallel Processing Concepts
45 pages
Multiprocessor Memory Architectures
No ratings yet
Multiprocessor Memory Architectures
10 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Memory Performance in Parallel Computing
No ratings yet
Memory Performance in Parallel Computing
11 pages
Unit 1
No ratings yet
Unit 1
21 pages
Upd 2
No ratings yet
Upd 2
87 pages
High Performance Computing
No ratings yet
High Performance Computing
121 pages
Binary Trees Assignment Guide
No ratings yet
Binary Trees Assignment Guide
4 pages
Log
No ratings yet
Log
3 pages
MC2100 Series Configuration Guide
No ratings yet
MC2100 Series Configuration Guide
11 pages
20EC211 ESIOT Test 2 Answer Key
No ratings yet
20EC211 ESIOT Test 2 Answer Key
8 pages
Tessent Testing 3d Ic
No ratings yet
Tessent Testing 3d Ic
2 pages
Programming for Problem Solving Exam 2023
No ratings yet
Programming for Problem Solving Exam 2023
4 pages
Vectorization of Insertion Sort Using Altivec
No ratings yet
Vectorization of Insertion Sort Using Altivec
8 pages
CYBERSECURITY-Week - 4
No ratings yet
CYBERSECURITY-Week - 4
11 pages
Cheat Sheet
100% (3)
Cheat Sheet
3 pages
ToadForOracle 16.3 ReleaseNotes
No ratings yet
ToadForOracle 16.3 ReleaseNotes
40 pages
Direct-Mapped Cache Protocols Explained
No ratings yet
Direct-Mapped Cache Protocols Explained
25 pages
Industrial Automation OPC Servers Guide
No ratings yet
Industrial Automation OPC Servers Guide
1 page
i.MX6UL 6ULL FAQ V1
No ratings yet
i.MX6UL 6ULL FAQ V1
26 pages
LAB 1 Basic Configurations
No ratings yet
LAB 1 Basic Configurations
3 pages
380X Upgrade Brochure US
No ratings yet
380X Upgrade Brochure US
2 pages
High-Performance PIM Scheduling Algorithm
No ratings yet
High-Performance PIM Scheduling Algorithm
4 pages
CSE 211: Data Structures Lecture Notes I: Ender Ozcan, Şebnem Baydere
No ratings yet
CSE 211: Data Structures Lecture Notes I: Ender Ozcan, Şebnem Baydere
11 pages
Template Class Chain Implementation
No ratings yet
Template Class Chain Implementation
15 pages
Computer Palak Sir Liberty
No ratings yet
Computer Palak Sir Liberty
89 pages
Software-Manual Trak-Soft 20180806 EN
No ratings yet
Software-Manual Trak-Soft 20180806 EN
33 pages
Oracle DBA Guide: Tasks & Best Practices
No ratings yet
Oracle DBA Guide: Tasks & Best Practices
18 pages
MATLAB Applications
No ratings yet
MATLAB Applications
252 pages
Java Garbage Collection Explained
No ratings yet
Java Garbage Collection Explained
5 pages
Arcgis Netstore Guide10.2.7
100% (1)
Arcgis Netstore Guide10.2.7
420 pages
Fish in Flac
No ratings yet
Fish in Flac
280 pages
Focmec Readme - J
No ratings yet
Focmec Readme - J
4 pages
Lesson Exemplar
No ratings yet
Lesson Exemplar
10 pages
Exercise Workbook For Student 5: SAP B1 On Cloud - AIS
No ratings yet
Exercise Workbook For Student 5: SAP B1 On Cloud - AIS
43 pages
Fundamental of Comp Sem 1 Que Paper 2
No ratings yet
Fundamental of Comp Sem 1 Que Paper 2
3 pages
IBM Tape Encryption Methods Explained
No ratings yet
IBM Tape Encryption Methods Explained
3 pages

Parallel Computing Module1

Uploaded by

Parallel Computing Module1

Uploaded by

PARALLEL COMPUTING

• The increasing density of transistors on integrated circuits has led to a

• The CPU cache is controlled by system hardware, but programmers can

• Instruction-level parallelism (ILP) is a method that enhances processor

• ILP is challenging to execute instructions simultaneously due to its

• Multiple issue and pipelining are considered parallel hardware, as

• This text discusses two classifications of parallel computers: Flynn's

• Vector processors operate on arrays or vectors of data, unlike

• Real-time graphics APIs represent an object's surface using points,

• MIMD systems support multiple simultaneous instruction streams operating on

• Shared-memory systems use multicore processors with multiple CPUs or

• Clusters are widely available distributed-memory systems composed

• Interconnects significantly impact distributed and shared-memory

• Shared memory systems traditionally used buses for connecting

• Distributed-memory interconnects are direct and indirect, connecting

• CPU caches are managed by system hardware, leaving programmers without

• Snooping cache coherence and directory-based cache coherence are two

• Parallel hardware is increasingly used in desktop and server systems, mobile

You might also like