0% found this document useful (0 votes)

185 views31 pages

HPC Unit 3

This document provides an overview of parallel computing and parallel computer architectures. It discusses shared-memory and distributed-memory parallel computers, and how they use multiple processors or cores to solve problems cooperatively. It also describes supercomputer performance based on the Top500 list, and covers communication networks used to connect compute elements in distributed-memory systems. Basic performance metrics for these networks include latency, bandwidth, effective bandwidth, and bisection bandwidth.

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views31 pages

HPC Unit 3

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Unit 3

Parallel Computers
Objectives

An introduction to the fundamental variants of parallel

computers
The shared-memory type The
distributed-memory type
Basic design rules and performance characteristics for
communication networks
What is parallel
computing?

Parallel computing—multiple hardware “compute elements”

(processor cores) solve a problem in a cooperative way.
All modern supercomputer architectures depend heavily on
parallelism—a large number of compute elements.
A “peek” into supercomputers through
Top500

The Top500 list (https://www.top500.org/)

A list of the world’s 500 most powerful supercomputers

Ranking by the measured performance of the LINPACK
benchmark
Solve a dense system of linear equations (the system size freely
adjustable)
Metric: number of floating-point operations executed per
second
Mostly reflect the FP capability of a supercomputer
Relevance of LINPACK is debatable
The list is updated twice a year
History of
Top500
Top supercomputers of today (November 2018)
Top1 systems of today and past
Taxonomy of parallel computing paradigms

Dominating concepts:

SIMD (Single Instruction, Multiple Data)—A single

instruction stream, either on a single processor (core) or on
multiple computing elements, provides parallelism by operating
on multiple data streams concurrently. (Hardware examples:
vector processors, SIMD-capable modern superscalar
microprocessors and GPUs.)
MIMD (Multiple Instruction, Multiple Data)—Multiple
instructions streams on multiple processor (cores) operate on
different data items concurrently. (Hardware examples:
shared-memory and distributed-memory parallel computers.)

The focus of this chapter is on multiprocessor MIMD parallelism.

Shared-memory
computers
A shared-memory parallel computer has a number of CPUs (cores)
that work on a shared physical address space.
Two varieties:

Uniform Memory Access (UMA) systems hae a “flat” memory

model: latency and bandwidth are the same for all processors
and all memory locations. (Typically, single multicore
processor chips are “UMA machines”.)
Cache-coherent Nonuniform Memory Access (ccNUMA)
systems have a physically distributed memory that is logically
shared. The aggregated memory appears as one single address
space. Memory access performance depends on the which CPU
(core) accesses which parts of memory (“local”
vs. ”remote” access).
Caches are not (completely) shared

A shared-memory system, no matter UMA or ccNUMA, has

multiple CPU cores.
Although there is a single address space (shared memory), there are
private caches, or partially shared caches, for the different CPU
cores.
Therefore, copies the same cache line may reside in several local
caches.
Cache
coherence

Problematic situations when a cache line resides in several caches:

If the cache line in one of the caches is modified, the other

caches’ contents are outdated (thus invalid).
If different parts of the cache line are modified by different
processors in their local caches → no one has the correct
cache line anymore.

Cache coherence protocols guarantee consistency between cached

data and data in the shared memory at all times.
Example of UMA

Dual-socket Xeon Clovertown CPUs

ccNUMA for scalable memory bandwidth

A locality domain (LD) is a set of processor cores together

with locally connected memory. This “local” memory can be
accessed by the set of processor cores in the most efficient
way, without resorting to a network of any kind.
Each LD is a UMA building block.
Multiple LDs are linked via a coherent interconnect, which can
mediate direct, cache-coherent memory accesses. (This
mechanism is transparent for the programmer.)
The whole ccNUMA system has a shared address space
(memory), runs a single OS instance.
Example of ccNUMA
Penalty for non-local transfers

The locality problem: Non-local memory transfers (between LDs)

are more costly than local transfers (within a LD).
The contention problem: If two processors from different LDs
access memory in the same LD, fighting for memory bandwidth.
Both problems can be “solved” (alleviated) by carefully observing
the data access patterns of an application and restricting data
access of each processor (mostly) to its own LD, through proper
programming.
A “purely” distributed-memory computer

“A programmer’s view”: Each processor is connected to its exclusive

local memory (not shared by any other CPUs).
No such “purely” distributed-memory computer today.
Typical modern distributed-memory systems

A cluster of shared-memory “compute nodes”, interconnected via a

communication network.
Each node comprises at least one network interface (NI) that
mediates the connection to the communication network.
A serial process runs on each CPU (core). Between the nodes,
processes can communicate by means of the network.
The layout and speed of the network has a considerable impact on
application performance.
Hierarchical hybrid systems
Network
s

There are different network technologies and topologies for

connecting the compute elements.

The following is a brief overview of the topological and performance

aspects of different types of communication networks.
Basic performance characteristics of networks

Point-to-point communication (from one process to another

process)
Bisection bandwith (a measure of the “whole” network)
Simple performance of point-to-point communication

Time spent on transferring a message of size N [bytes] from a

“sender” process to a “receiver” process:
N
T = TA+
B
This is a simplified model:

TA: latency
B: maximum network point-to-point bandwith [bytes/sec]

TA and B are considered as constants, but in reality they can both

depend on N, as well as on the locations of the two processes.
Effective bandwidth

Due to latency TA, the actual data transfer rate will be lower than
B:
N
B eff =
TA + N
B
The effective bandwidth Beff approaches B when N is large enough.
“Ping-pong”
benchmark
“Ping-pong” benchmark (cont’d)

Pseudo code:
Example of “ping-pong” measurements

Beff is measured for different values of N; The values of TA and B

can be deduced by “fitting” the measurements with the theoretical
model.
Bisection bandwidth

How to quantify the “total” communication capacity of a network?

When all the compute elements are sending or receiving data at the
same time:

“competition” (even collision) may lead to that the aggregated

bandwidth, the sum of all effective bandwidths for all
point-to-point connections, is lower than the theoretical limit.

Bisection bandwidth of a network, Bb, is the sum of the

bandwidths of the minimal number of connections cut when
splitting the system into two equal-sized parts.
Illustration of bisection bandwidth
Different types of a communication network

Buses
Switched and fat-tree networks
Mesh networks
Buses

Can be used by exactly one communicating device at a time.

Easy to implement, featuring lowest latency at small
utilization.
The most important drawback is blocking.
Buses are susceptible for failures.
Switched and fat-tree
networks

All communicating devices are organized into groups.

The devices in one group are connected to a switch.
Switches are connected with each other (as a fat-tree
hierarchy)
The “distance” between two commuicating devices—number of
“hops”.
Mesh networks

In form of a multidimensional (hyper)cubes.

Each compute element is located at a Cartesian grid
intersecton.
Connections are wrapped around the boundaries, to form a
torus topology.

Taxonomy of Parallel Computing Paradigms
No ratings yet
Taxonomy of Parallel Computing Paradigms
9 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
4th
No ratings yet
4th
84 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Lecture-27 Interconnection Networks+chapter-5 Slides-Version-2
No ratings yet
Lecture-27 Interconnection Networks+chapter-5 Slides-Version-2
70 pages
Distributed Systems R19 - Unit-1
No ratings yet
Distributed Systems R19 - Unit-1
35 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
18 pages
Unit 1
No ratings yet
Unit 1
25 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Introduction
No ratings yet
Introduction
34 pages
Acn - Subject Notes
No ratings yet
Acn - Subject Notes
64 pages
2. Parallel Computers
No ratings yet
2. Parallel Computers
39 pages
Classification Based On Memory Access Architecture Shared Memory General Characteristics: General Characteristics
No ratings yet
Classification Based On Memory Access Architecture Shared Memory General Characteristics: General Characteristics
4 pages
COME6102 Chapter 1 Introduction 2 of 2
No ratings yet
COME6102 Chapter 1 Introduction 2 of 2
8 pages
Large Computer Systems and Pipelining: Homework
No ratings yet
Large Computer Systems and Pipelining: Homework
11 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
MCP ppt
No ratings yet
MCP ppt
19 pages
Aca Notes
No ratings yet
Aca Notes
63 pages
3rd
No ratings yet
3rd
4 pages
CS6801 MULTI CORE ARCHITECTURE AND PROGRAMMING - Watermark
No ratings yet
CS6801 MULTI CORE ARCHITECTURE AND PROGRAMMING - Watermark
96 pages
Lec 5 SharedArch PDF
No ratings yet
Lec 5 SharedArch PDF
16 pages
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
No ratings yet
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
16 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
15 Parallel Processing
No ratings yet
15 Parallel Processing
36 pages
Multiprocessors and Multicomputers
No ratings yet
Multiprocessors and Multicomputers
27 pages
07 Multiprocessors MF PDF
No ratings yet
07 Multiprocessors MF PDF
99 pages
Networks: Second Ed Ition
No ratings yet
Networks: Second Ed Ition
36 pages
Unit 2.1 (4)
No ratings yet
Unit 2.1 (4)
18 pages
Aca Notes: Scalability
No ratings yet
Aca Notes: Scalability
13 pages
Unit 3 Interconnection Network: Structure Page Nos
No ratings yet
Unit 3 Interconnection Network: Structure Page Nos
18 pages
Computer Networks - Physical Layer
No ratings yet
Computer Networks - Physical Layer
40 pages
4 - Interconnection Networks
No ratings yet
4 - Interconnection Networks
57 pages
Distributed Operating Syst EM: 15SE327E Unit 1
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
49 pages
ACA Assignment 4
No ratings yet
ACA Assignment 4
16 pages
CN Notes
No ratings yet
CN Notes
90 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
No ratings yet
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory
22 pages
Distributed Os
No ratings yet
Distributed Os
30 pages
UNIT 1 Networking Fundamentals
No ratings yet
UNIT 1 Networking Fundamentals
39 pages
Computer Networking 318
No ratings yet
Computer Networking 318
5 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
Network and Ip Technology
No ratings yet
Network and Ip Technology
77 pages
Telecomms Concepts
No ratings yet
Telecomms Concepts
111 pages
1 Introduction CN 1112
No ratings yet
1 Introduction CN 1112
19 pages
Distributed Shared Memory: Writes To A Logical Shared Address by One Thread Are Visible To Reads of The Other Threads
No ratings yet
Distributed Shared Memory: Writes To A Logical Shared Address by One Thread Are Visible To Reads of The Other Threads
41 pages
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
No ratings yet
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
81 pages
Lecture 3.2.4 (Various Interconnection Networks)
No ratings yet
Lecture 3.2.4 (Various Interconnection Networks)
5 pages
Data Mudule 2
No ratings yet
Data Mudule 2
13 pages
computernetwork
No ratings yet
computernetwork
23 pages
3unitCOMPUTER NETWORK
No ratings yet
3unitCOMPUTER NETWORK
56 pages
ICT short notes
No ratings yet
ICT short notes
4 pages
Cloud vs Edge
From Everand
Cloud vs Edge
Isaac Berners-Lee
No ratings yet
Distributed Systems and Beyond
From Everand
Distributed Systems and Beyond
Pasquale De Marco
No ratings yet
HPC Clusters Demystified
From Everand
HPC Clusters Demystified
Alisa Turing
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Unit - I MP&MC
No ratings yet
Unit - I MP&MC
30 pages
Minimization of DFA
No ratings yet
Minimization of DFA
5 pages
1-Liquid Simulation - Phoenix FD 4 For 3ds Max - Chaos Group Help
No ratings yet
1-Liquid Simulation - Phoenix FD 4 For 3ds Max - Chaos Group Help
3 pages
Cs614 Solved Current Subjective Final Term by Junaid
No ratings yet
Cs614 Solved Current Subjective Final Term by Junaid
8 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
34 pages
Mariadb High Performance: Chapter No. 1 "Performance Introduction"
No ratings yet
Mariadb High Performance: Chapter No. 1 "Performance Introduction"
36 pages
Parallel Computing
100% (1)
Parallel Computing
53 pages
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
No ratings yet
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
27 pages
c04400043 PDF
No ratings yet
c04400043 PDF
96 pages
L12A Introduction To Multiprocessors Part I
No ratings yet
L12A Introduction To Multiprocessors Part I
61 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
Ca Unit 4 Prabu
No ratings yet
Ca Unit 4 Prabu
24 pages
Day One:: Deploying BGP Rib Sharding and Update Threading
No ratings yet
Day One:: Deploying BGP Rib Sharding and Update Threading
41 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Lecture-4 Parallel hardware-Jameel-NNL
No ratings yet
Lecture-4 Parallel hardware-Jameel-NNL
39 pages
Parallel and Distributed Computing Complete Notes
No ratings yet
Parallel and Distributed Computing Complete Notes
41 pages
Parallel Architecture Classification
50% (2)
Parallel Architecture Classification
41 pages
Advanced Computer Architecture: 1.0 Objective
No ratings yet
Advanced Computer Architecture: 1.0 Objective
27 pages
Classification - Shared Memory Systems
No ratings yet
Classification - Shared Memory Systems
3 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Advanced Computer Arc.
No ratings yet
Advanced Computer Arc.
128 pages
17bce2396 VL2019205006307 Da
No ratings yet
17bce2396 VL2019205006307 Da
7 pages
Case Study On Amazon Ec2
100% (1)
Case Study On Amazon Ec2
30 pages
Euro 2021
No ratings yet
Euro 2021
97 pages
Ritesh Kumar Jha 26900121014 Pcc-cs402
No ratings yet
Ritesh Kumar Jha 26900121014 Pcc-cs402
9 pages
BP 2015 Microsoft SQL Server
No ratings yet
BP 2015 Microsoft SQL Server
55 pages
What Is Parallel Processing
No ratings yet
What Is Parallel Processing
4 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Notes PDC
No ratings yet
Notes PDC
6 pages
Java Performance Mindmap
No ratings yet
Java Performance Mindmap
1 page
MongoDB Performance Best Practices
No ratings yet
MongoDB Performance Best Practices
15 pages
Slides
No ratings yet
Slides
36 pages

HPC Unit 3

Uploaded by

HPC Unit 3

Uploaded by

Unit 3

An introduction to the fundamental variants of parallel

Parallel computing—multiple hardware “compute elements”

The Top500 list (https://www.top500.org/)

A list of the world’s 500 most powerful supercomputers

SIMD (Single Instruction, Multiple Data)—A single

The focus of this chapter is on multiprocessor MIMD parallelism.

Uniform Memory Access (UMA) systems hae a “flat” memory

A shared-memory system, no matter UMA or ccNUMA, has

Problematic situations when a cache line resides in several caches:

If the cache line in one of the caches is modified, the other

Cache coherence protocols guarantee consistency between cached

Dual-socket Xeon Clovertown CPUs

A locality domain (LD) is a set of processor cores together

The locality problem: Non-local memory transfers (between LDs)

“A programmer’s view”: Each processor is connected to its exclusive

A cluster of shared-memory “compute nodes”, interconnected via a

There are different network technologies and topologies for

The following is a brief overview of the topological and performance

Point-to-point communication (from one process to another

Time spent on transferring a message of size N [bytes] from a

TA and B are considered as constants, but in reality they can both

Beff is measured for different values of N; The values of TA and B

How to quantify the “total” communication capacity of a network?

“competition” (even collision) may lead to that the aggregated

Bisection bandwidth of a network, Bb, is the sum of the

Can be used by exactly one communicating device at a time.

All communicating devices are organized into groups.

In form of a multidimensional (hyper)cubes.

You might also like