0% found this document useful (0 votes)

86 views26 pages

2 - Parallel Computer Architecture - 1

This document discusses parallel computer architectures, specifically shared memory vs message passing architectures. It describes tightly coupled multiprocessors with shared global memory that use synchronization for shared data access, versus loosely coupled multiprocessors connected via a network that use message passing. It discusses challenges with scaling shared memory architectures and techniques like NUMA. It provides examples of historical parallel computer evolution and drivers towards multi-core processors.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views26 pages

2 - Parallel Computer Architecture - 1

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

PARALLEL COMPUTER

ARCHITECTURE

1
REVIEW: SHARED MEMORY VS.
MESSAGE PASSING
 Loosely coupled multiprocessors
 Noshared global memory address space
 Multicomputer network
 Network-based multiprocessors
 Usually programmed via message passing
 Explicit calls (send, receive) for communication

 Tightly coupled multiprocessors

 Shared global memory address space
 Traditional multiprocessing: symmetric multiprocessing (SMP)
 Existing multi-core processors, multithreaded processors
 Programming model similar to uniprocessors except (multitasking
uniprocessor) 2
 Operations on shared data require synchronization
SCALABILITY, CONVERGENCE,
AND SOME TERMINOLOGY
3
SCALING SHARED MEMORY
ARCHITECTURES

4
INTERCONNECTION SCHEMES FOR
SHARED MEMORY
 Scalability dependent on interconnect

5
UMA/UCA: UNIFORM MEMORY OR
CACHE ACCESS
• All processors have the same un-contended latency to memory
• Latencies get worse as system grows
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect

M a in M e m o ry
co n t en t io n in m em o ry b a n k s

. . .

lo n g In te r c o n n e c tio n N e tw o r k
la t en cy
co n t en tio n in n et w o rk

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
6
UNIFORM
+ MEMORY/CACHE ACCESS
Data placement unimportant/less important (easier to optimize code and make use
of available memory space)
- Scaling the system increases latencies
- Contention could restrict bandwidth and increase latency

M a in M e m o ry
co n t en t io n in m em o ry b a n k s

. . .

lo n g In te r c o n n e c tio n N e t w o r k
la t en cy
co n t en t io n in n et w o rk

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r

7
EXAMPLE SMP
 Quad-pack Intel Pentium Pro

8
HOW TO SCALE SHARED MEMORY
MACHINES?
 Two general approaches

 Maintain UMA
 Providea scalable interconnect to memory
 Downside: Every memory access incurs the round-trip network latency

 Interconnect complete processors with local memory

 NUMA (Non-uniform memory access)
 Local memory faster than remote memory
 Still needs a scalable interconnect for accessing remote memory
 Not on the critical path of local memory access
9
NUMA/NUCA: NONUNIFORM MEMORY/CACHE ACCESS
• Shared memory as local versus remote memory
+ Low latency to local memory
- Much higher latency to remote memories
+ Bandwidth to local memory may be higher
- Performance very sensitive to data placement

In te r c o n n e c tio n N e tw o r k
lo n g
co n ten tio n in n etw o rk
la ten cy

. . .
M e m o ry M e m o ry M e m o ry

s h o rt
la ten cy

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
10
CONVERGENCE OF PARALLEL
ARCHITECTURES
 Scalable shared memory architecture is similar to scalable
message passing architecture
 Main difference: is remote memory accessible with loads/stores?

11
HISTORICAL EVOLUTION: 1960S & 70S
• Early MPs
– Mainframes
– Small number of processors Memory
Memory
Memory
– crossbar interconnect Memory
Memory
Memory
Memory
– UMA Memory

Processor

corssbar
Processor

Processor

12
HISTORICAL EVOLUTION: 1980S
• Bus-Based MPs
– enabler: processor-on-a-board
– economical scaling
– precursor of today’s SMPs
– UMA

Memory Memory Memory Memory

cache cache cache cache

Proc Proc Proc Proc

13
HISTORICAL EVOLUTION: LATE 80S,
MID 90S
• Large Scale MPs (Massively Parallel Processors)
– multi-dimensional interconnects
– each node a computer (proc + cache + memory)
– both shared memory and message passing versions
– NUMA
– still used for “supercomputing”

14
HISTORICAL EVOLUTION: CURRENT
 Chip multiprocessors (multi-core)
 Small to Mid-Scale multi-socket CMPs
 One module type: processor + caches + memory
 Clusters/Datacenters
 Use high performance LAN to connect SMP blades, racks

 Driven by economics and cost

 Smaller systems => higher volumes
 Off-the-shelf components

 Driven by applications
 Many more throughput applications (web servers)
 … than parallel applications (weather prediction) 15

 Cloud computing
HISTORICAL EVOLUTION: FUTURE
 Cluster/datacenter on a chip?

 Heterogeneous multi-core?

 Bounce back to small-scale multi-core?

 ???

16
MULTI-CORE PROCESSORS
17
MOORE’S LAW

Moore, “Cramming more components onto integrated circuits,” 18

Electronics, 1965.
MULTI-CORE
 Idea: Put multiple processors on the same die.

 Technology scaling (Moore’s Law) enables more transistors to be

placed on the same die area

 What else could you do with the die area you dedicate to multiple
processors?
 Have a bigger, more powerful core
 Have larger caches in the memory hierarchy
 Simultaneous multithreading
 Integrate platform components on chip (e.g., network interface, memory
controllers) 19
WHY MULTI-CORE?
 Alternative: Bigger, more powerful single core
 Larger superscalar issue width, larger instruction window, more
execution units, large trace caches, large branch predictors, etc

+ Improves single-thread performance transparently to programmer,

compiler
- Very difficult to design (Scalable algorithms for improving single-thread
performance elusive)
- Power hungry – many out-of-order execution structures consume
significant power/area when scaled. Why?
- Diminishing returns on performance
- Does not significantly help memory-bound application performance
(Scalable algorithms for this elusive) 20
LARGE SUPERSCALAR VS. MULTI-CORE
 Olukotun et al., “The Case for a Single-Chip Multiprocessor,”
ASPLOS 1996.

21
MULTI-CORE VS. LARGE SUPERSCALAR
 Multi-core advantages
+ Simpler cores  more power efficient, lower complexity, easier to
design and replicate, higher frequency (shorter wires, smaller structures)
+ Higher system throughput on multi-programmed workloads  reduced
context switches
+ Higher system throughput in parallel applications

22
 Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand

23
LARGE SUPERSCALAR VS. MULTI-CORE
 Olukotun et al., “The Case for a Single-Chip Multiprocessor,”
ASPLOS 1996.

 Technology push
 Instruction
issue queue size limits the cycle time of the superscalar,
OoO processor  diminishing performance
 Quadratic increase in complexity with issue width
 Large, multi-ported register files to support large instruction windows
and issue widths  reduced frequency or longer RF access,
diminishing performance
 Application pull
 Integer applications: little parallelism?
 FP applications: abundant loop-level parallelism 24
 Others (transaction proc., multiprogramming): CMP better fit
WHY MULTI-CORE?
 Alternative: Bigger caches

+ Improves single-thread performance transparently to programmer,

compiler
+ Simple to design

- Diminishing single-thread performance returns from cache size. Why?

- Multiple levels complicate memory hierarchy

25
WHY MULTI-CORE?
 Alternative: (Simultaneous) Multithreading

+ Exploits thread-level parallelism (just like multi-core)

+ Good single-thread performance when there is a single thread
+ No need to have an entire core for another thread
+ Parallel performance aided by tight sharing of caches

- Scalability is limited: need bigger register files, larger issue width (and
associated costs) to have many threads  complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system reduces
both single-thread and parallel application performance
26

101+ Termux Commands List For Android 2021 (Best Ultimate Guide
No ratings yet
101+ Termux Commands List For Android 2021 (Best Ultimate Guide
15 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
ch1 PC
No ratings yet
ch1 PC
84 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
L38 TLP
No ratings yet
L38 TLP
13 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Ayushagrawal Hpc
No ratings yet
Ayushagrawal Hpc
17 pages
Scalable Parallel Computing
No ratings yet
Scalable Parallel Computing
11 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Parallel Processing: sp2016 Lec#5
No ratings yet
Parallel Processing: sp2016 Lec#5
27 pages
Chapter 9 COA
No ratings yet
Chapter 9 COA
31 pages
Week 6 A
No ratings yet
Week 6 A
32 pages
Trends in Computer Architecture
No ratings yet
Trends in Computer Architecture
30 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Parallel Archit 1
No ratings yet
Parallel Archit 1
18 pages
Parallel Architecture Fundamental
No ratings yet
Parallel Architecture Fundamental
18 pages
Unit 5
No ratings yet
Unit 5
66 pages
HPC TT1
No ratings yet
HPC TT1
29 pages
CS-3006_2_PDC_Overview_compressed
No ratings yet
CS-3006_2_PDC_Overview_compressed
107 pages
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
No ratings yet
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
31 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Co-1 (2)
No ratings yet
Co-1 (2)
66 pages
Parallel Programming- Unit 1
No ratings yet
Parallel Programming- Unit 1
81 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Lecture 8 Miscellaneous Topics
No ratings yet
Lecture 8 Miscellaneous Topics
52 pages
PC 1
No ratings yet
PC 1
53 pages
Multiprocessor
No ratings yet
Multiprocessor
22 pages
MObile Communication
No ratings yet
MObile Communication
61 pages
Parallel Processing
No ratings yet
Parallel Processing
61 pages
An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
No ratings yet
An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
77 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
pdcco1
No ratings yet
pdcco1
8 pages
Week_6_A
No ratings yet
Week_6_A
22 pages
Osa Multi Core
No ratings yet
Osa Multi Core
37 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Multiprocessing vs Multithreading 2
No ratings yet
Multiprocessing vs Multithreading 2
16 pages
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Memory Makers
From Everand
Memory Makers
Mei Gates
No ratings yet
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
From Everand
Understanding Software Engineering Vol 1: Where does the software run and how? The hardware.
Gabriel Clemente
No ratings yet
History Of Computers
From Everand
History Of Computers
IntroBooks Team
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
1 Introduction To Parallel Computing
No ratings yet
1 Introduction To Parallel Computing
58 pages
Authorisation: by M.O. Odeo
No ratings yet
Authorisation: by M.O. Odeo
33 pages
Computer Security & Cryptography: Bye M.O. Odeo
No ratings yet
Computer Security & Cryptography: Bye M.O. Odeo
22 pages
Classical Encryption Techniques: M. Odeo Lecturer
No ratings yet
Classical Encryption Techniques: M. Odeo Lecturer
39 pages
Cool Unix CLI
100% (1)
Cool Unix CLI
464 pages
Upgrades7350 SW1311
No ratings yet
Upgrades7350 SW1311
16 pages
CPU Scheduling
No ratings yet
CPU Scheduling
114 pages
Command Prompt Commands
No ratings yet
Command Prompt Commands
23 pages
Sysmon: How To Install, Upgrade, and Uninstall
No ratings yet
Sysmon: How To Install, Upgrade, and Uninstall
6 pages
Veritas Cluster Cheat Sheet
No ratings yet
Veritas Cluster Cheat Sheet
6 pages
DxDiag
No ratings yet
DxDiag
36 pages
BCA-542 LINUX
No ratings yet
BCA-542 LINUX
2 pages
Case-Study-Dos - 19070123
No ratings yet
Case-Study-Dos - 19070123
13 pages
CS433: Computer System Organization: Main Memory Virtual Memory Translation Lookaside Buffer
No ratings yet
CS433: Computer System Organization: Main Memory Virtual Memory Translation Lookaside Buffer
41 pages
Network Attached Storage
No ratings yet
Network Attached Storage
18 pages
Virtual Machine Hardware Versions (1003746)
No ratings yet
Virtual Machine Hardware Versions (1003746)
6 pages
Mca-20-13 Kuk Mca Operating System Paper
No ratings yet
Mca-20-13 Kuk Mca Operating System Paper
3 pages
ICTN 2530 Syllabus Summer 2014v1
No ratings yet
ICTN 2530 Syllabus Summer 2014v1
13 pages
Operating Systems Notes
No ratings yet
Operating Systems Notes
135 pages
Mobox Log - TXT Wineboot
No ratings yet
Mobox Log - TXT Wineboot
15 pages
Whitepaper InstallScape Internals
No ratings yet
Whitepaper InstallScape Internals
4 pages
21 PDF
No ratings yet
21 PDF
9 pages
Stanford Hydra Architecture: Presented by Drew Schena and Josh Milas
No ratings yet
Stanford Hydra Architecture: Presented by Drew Schena and Josh Milas
19 pages
Registry HKCU User Folder&Taskbar
No ratings yet
Registry HKCU User Folder&Taskbar
8 pages
PVP Siddhartha Institute of Technology: Student Lab Manual
No ratings yet
PVP Siddhartha Institute of Technology: Student Lab Manual
28 pages
Release Note For Local Loading Tool (LLT) 4.5.3: Reference: ICO-OPE-00119 V04
No ratings yet
Release Note For Local Loading Tool (LLT) 4.5.3: Reference: ICO-OPE-00119 V04
9 pages
Ros in 5 Days: Unit 2: Basic Concepts
No ratings yet
Ros in 5 Days: Unit 2: Basic Concepts
20 pages
AMI Software Utility User Guide
No ratings yet
AMI Software Utility User Guide
39 pages
1.2.1. Systems Software
No ratings yet
1.2.1. Systems Software
7 pages
How To Export Hotfix Details of Remote Computers
No ratings yet
How To Export Hotfix Details of Remote Computers
6 pages
Qubes OS: Masaryk University
No ratings yet
Qubes OS: Masaryk University
54 pages
Cheat Cube Ubuntu
No ratings yet
Cheat Cube Ubuntu
1 page
Graphics Memory Reporting Through WDDM: January 9, 2006
No ratings yet
Graphics Memory Reporting Through WDDM: January 9, 2006
15 pages