0% found this document useful (0 votes)
185 views31 pages

HPC Unit 3

This document provides an overview of parallel computing and parallel computer architectures. It discusses shared-memory and distributed-memory parallel computers, and how they use multiple processors or cores to solve problems cooperatively. It also describes supercomputer performance based on the Top500 list, and covers communication networks used to connect compute elements in distributed-memory systems. Basic performance metrics for these networks include latency, bandwidth, effective bandwidth, and bisection bandwidth.

Uploaded by

Sudha Palani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views31 pages

HPC Unit 3

This document provides an overview of parallel computing and parallel computer architectures. It discusses shared-memory and distributed-memory parallel computers, and how they use multiple processors or cores to solve problems cooperatively. It also describes supercomputer performance based on the Top500 list, and covers communication networks used to connect compute elements in distributed-memory systems. Basic performance metrics for these networks include latency, bandwidth, effective bandwidth, and bisection bandwidth.

Uploaded by

Sudha Palani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit 3

Parallel Computers
Objectives

An introduction to the fundamental variants of parallel


computers
The shared-memory type The
distributed-memory type
Basic design rules and performance characteristics for
communication networks
What is parallel
computing?

Parallel computing—multiple hardware “compute elements”


(processor cores) solve a problem in a cooperative way.
All modern supercomputer architectures depend heavily on
parallelism—a large number of compute elements.
A “peek” into supercomputers through
Top500

The Top500 list (https://www.top500.org/)

A list of the world’s 500 most powerful supercomputers


Ranking by the measured performance of the LINPACK
benchmark
Solve a dense system of linear equations (the system size freely
adjustable)
Metric: number of floating-point operations executed per
second
Mostly reflect the FP capability of a supercomputer
Relevance of LINPACK is debatable
The list is updated twice a year
History of
Top500
Top supercomputers of today (November 2018)
Top1 systems of today and past
Taxonomy of parallel computing paradigms

Dominating concepts:

SIMD (Single Instruction, Multiple Data)—A single


instruction stream, either on a single processor (core) or on
multiple computing elements, provides parallelism by operating
on multiple data streams concurrently. (Hardware examples:
vector processors, SIMD-capable modern superscalar
microprocessors and GPUs.)
MIMD (Multiple Instruction, Multiple Data)—Multiple
instructions streams on multiple processor (cores) operate on
different data items concurrently. (Hardware examples:
shared-memory and distributed-memory parallel computers.)

The focus of this chapter is on multiprocessor MIMD parallelism.


Shared-memory
computers
A shared-memory parallel computer has a number of CPUs (cores)
that work on a shared physical address space.
Two varieties:

Uniform Memory Access (UMA) systems hae a “flat” memory


model: latency and bandwidth are the same for all processors
and all memory locations. (Typically, single multicore
processor chips are “UMA machines”.)
Cache-coherent Nonuniform Memory Access (ccNUMA)
systems have a physically distributed memory that is logically
shared. The aggregated memory appears as one single address
space. Memory access performance depends on the which CPU
(core) accesses which parts of memory (“local”
vs. ”remote” access).
Caches are not (completely) shared

A shared-memory system, no matter UMA or ccNUMA, has


multiple CPU cores.
Although there is a single address space (shared memory), there are
private caches, or partially shared caches, for the different CPU
cores.
Therefore, copies the same cache line may reside in several local
caches.
Cache
coherence

Problematic situations when a cache line resides in several caches:

If the cache line in one of the caches is modified, the other


caches’ contents are outdated (thus invalid).
If different parts of the cache line are modified by different
processors in their local caches → no one has the correct
cache line anymore.

Cache coherence protocols guarantee consistency between cached


data and data in the shared memory at all times.
Example of UMA

Dual-socket Xeon Clovertown CPUs


ccNUMA for scalable memory bandwidth

A locality domain (LD) is a set of processor cores together


with locally connected memory. This “local” memory can be
accessed by the set of processor cores in the most efficient
way, without resorting to a network of any kind.
Each LD is a UMA building block.
Multiple LDs are linked via a coherent interconnect, which can
mediate direct, cache-coherent memory accesses. (This
mechanism is transparent for the programmer.)
The whole ccNUMA system has a shared address space
(memory), runs a single OS instance.
Example of ccNUMA
Penalty for non-local transfers

The locality problem: Non-local memory transfers (between LDs)


are more costly than local transfers (within a LD).
The contention problem: If two processors from different LDs
access memory in the same LD, fighting for memory bandwidth.
Both problems can be “solved” (alleviated) by carefully observing
the data access patterns of an application and restricting data
access of each processor (mostly) to its own LD, through proper
programming.
A “purely” distributed-memory computer

“A programmer’s view”: Each processor is connected to its exclusive


local memory (not shared by any other CPUs).
No such “purely” distributed-memory computer today.
Typical modern distributed-memory systems

A cluster of shared-memory “compute nodes”, interconnected via a


communication network.
Each node comprises at least one network interface (NI) that
mediates the connection to the communication network.
A serial process runs on each CPU (core). Between the nodes,
processes can communicate by means of the network.
The layout and speed of the network has a considerable impact on
application performance.
Hierarchical hybrid systems
Network
s

There are different network technologies and topologies for


connecting the compute elements.

The following is a brief overview of the topological and performance


aspects of different types of communication networks.
Basic performance characteristics of networks

Point-to-point communication (from one process to another


process)
Bisection bandwith (a measure of the “whole” network)
Simple performance of point-to-point communication

Time spent on transferring a message of size N [bytes] from a


“sender” process to a “receiver” process:
N
T = TA+
B
This is a simplified model:

TA: latency
B: maximum network point-to-point bandwith [bytes/sec]

TA and B are considered as constants, but in reality they can both


depend on N, as well as on the locations of the two processes.
Effective bandwidth

Due to latency TA, the actual data transfer rate will be lower than
B:
N
B eff =
TA + N
B
The effective bandwidth Beff approaches B when N is large enough.
“Ping-pong”
benchmark
“Ping-pong” benchmark (cont’d)

Pseudo code:
Example of “ping-pong” measurements

Beff is measured for different values of N; The values of TA and B


can be deduced by “fitting” the measurements with the theoretical
model.
Bisection bandwidth

How to quantify the “total” communication capacity of a network?


When all the compute elements are sending or receiving data at the
same time:

“competition” (even collision) may lead to that the aggregated


bandwidth, the sum of all effective bandwidths for all
point-to-point connections, is lower than the theoretical limit.

Bisection bandwidth of a network, Bb, is the sum of the


bandwidths of the minimal number of connections cut when
splitting the system into two equal-sized parts.
Illustration of bisection bandwidth
Different types of a communication network

Buses
Switched and fat-tree networks
Mesh networks
Buses

Can be used by exactly one communicating device at a time.


Easy to implement, featuring lowest latency at small
utilization.
The most important drawback is blocking.
Buses are susceptible for failures.
Switched and fat-tree
networks

All communicating devices are organized into groups.


The devices in one group are connected to a switch.
Switches are connected with each other (as a fat-tree
hierarchy)
The “distance” between two commuicating devices—number of
“hops”.
Mesh networks

In form of a multidimensional (hyper)cubes.


Each compute element is located at a Cartesian grid
intersecton.
Connections are wrapped around the boundaries, to form a
torus topology.

You might also like