0% found this document useful (0 votes)
86 views26 pages

2 - Parallel Computer Architecture - 1

This document discusses parallel computer architectures, specifically shared memory vs message passing architectures. It describes tightly coupled multiprocessors with shared global memory that use synchronization for shared data access, versus loosely coupled multiprocessors connected via a network that use message passing. It discusses challenges with scaling shared memory architectures and techniques like NUMA. It provides examples of historical parallel computer evolution and drivers towards multi-core processors.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views26 pages

2 - Parallel Computer Architecture - 1

This document discusses parallel computer architectures, specifically shared memory vs message passing architectures. It describes tightly coupled multiprocessors with shared global memory that use synchronization for shared data access, versus loosely coupled multiprocessors connected via a network that use message passing. It discusses challenges with scaling shared memory architectures and techniques like NUMA. It provides examples of historical parallel computer evolution and drivers towards multi-core processors.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

PARALLEL COMPUTER

ARCHITECTURE

1
REVIEW: SHARED MEMORY VS.
MESSAGE PASSING
 Loosely coupled multiprocessors
 Noshared global memory address space
 Multicomputer network
 Network-based multiprocessors
 Usually programmed via message passing
 Explicit calls (send, receive) for communication

 Tightly coupled multiprocessors


 Shared global memory address space
 Traditional multiprocessing: symmetric multiprocessing (SMP)
 Existing multi-core processors, multithreaded processors
 Programming model similar to uniprocessors except (multitasking
uniprocessor) 2
 Operations on shared data require synchronization
SCALABILITY, CONVERGENCE,
AND SOME TERMINOLOGY
3
SCALING SHARED MEMORY
ARCHITECTURES

4
INTERCONNECTION SCHEMES FOR
SHARED MEMORY
 Scalability dependent on interconnect

5
UMA/UCA: UNIFORM MEMORY OR
CACHE ACCESS
• All processors have the same un-contended latency to memory
• Latencies get worse as system grows
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect

M a in M e m o ry
co n t en t io n in m em o ry b a n k s

. . .

lo n g In te r c o n n e c tio n N e tw o r k
la t en cy
co n t en tio n in n et w o rk

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
6
UNIFORM
+ MEMORY/CACHE ACCESS
Data placement unimportant/less important (easier to optimize code and make use
of available memory space)
- Scaling the system increases latencies
- Contention could restrict bandwidth and increase latency

M a in M e m o ry
co n t en t io n in m em o ry b a n k s

. . .

lo n g In te r c o n n e c tio n N e t w o r k
la t en cy
co n t en t io n in n et w o rk

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r

7
EXAMPLE SMP
 Quad-pack Intel Pentium Pro

8
HOW TO SCALE SHARED MEMORY
MACHINES?
 Two general approaches

 Maintain UMA
 Providea scalable interconnect to memory
 Downside: Every memory access incurs the round-trip network latency

 Interconnect complete processors with local memory


 NUMA (Non-uniform memory access)
 Local memory faster than remote memory
 Still needs a scalable interconnect for accessing remote memory
 Not on the critical path of local memory access
9
NUMA/NUCA: NONUNIFORM MEMORY/CACHE ACCESS
• Shared memory as local versus remote memory
+ Low latency to local memory
- Much higher latency to remote memories
+ Bandwidth to local memory may be higher
- Performance very sensitive to data placement

In te r c o n n e c tio n N e tw o r k
lo n g
co n ten tio n in n etw o rk
la ten cy

. . .
M e m o ry M e m o ry M e m o ry

s h o rt
la ten cy

P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
10
CONVERGENCE OF PARALLEL
ARCHITECTURES
 Scalable shared memory architecture is similar to scalable
message passing architecture
 Main difference: is remote memory accessible with loads/stores?

11
HISTORICAL EVOLUTION: 1960S & 70S
• Early MPs
– Mainframes
– Small number of processors Memory
Memory
Memory
– crossbar interconnect Memory
Memory
Memory
Memory
– UMA Memory

Processor

Processor

corssbar
Processor

Processor

12
HISTORICAL EVOLUTION: 1980S
• Bus-Based MPs
– enabler: processor-on-a-board
– economical scaling
– precursor of today’s SMPs
– UMA

Memory Memory Memory Memory

cache cache cache cache

Proc Proc Proc Proc


13
HISTORICAL EVOLUTION: LATE 80S,
MID 90S
• Large Scale MPs (Massively Parallel Processors)
– multi-dimensional interconnects
– each node a computer (proc + cache + memory)
– both shared memory and message passing versions
– NUMA
– still used for “supercomputing”

14
HISTORICAL EVOLUTION: CURRENT
 Chip multiprocessors (multi-core)
 Small to Mid-Scale multi-socket CMPs
 One module type: processor + caches + memory
 Clusters/Datacenters
 Use high performance LAN to connect SMP blades, racks

 Driven by economics and cost


 Smaller systems => higher volumes
 Off-the-shelf components

 Driven by applications
 Many more throughput applications (web servers)
 … than parallel applications (weather prediction) 15

 Cloud computing
HISTORICAL EVOLUTION: FUTURE
 Cluster/datacenter on a chip?

 Heterogeneous multi-core?

 Bounce back to small-scale multi-core?

 ???

16
MULTI-CORE PROCESSORS
17
MOORE’S LAW

Moore, “Cramming more components onto integrated circuits,” 18


Electronics, 1965.
MULTI-CORE
 Idea: Put multiple processors on the same die.

 Technology scaling (Moore’s Law) enables more transistors to be


placed on the same die area

 What else could you do with the die area you dedicate to multiple
processors?
 Have a bigger, more powerful core
 Have larger caches in the memory hierarchy
 Simultaneous multithreading
 Integrate platform components on chip (e.g., network interface, memory
controllers) 19
WHY MULTI-CORE?
 Alternative: Bigger, more powerful single core
 Larger superscalar issue width, larger instruction window, more
execution units, large trace caches, large branch predictors, etc

+ Improves single-thread performance transparently to programmer,


compiler
- Very difficult to design (Scalable algorithms for improving single-thread
performance elusive)
- Power hungry – many out-of-order execution structures consume
significant power/area when scaled. Why?
- Diminishing returns on performance
- Does not significantly help memory-bound application performance
(Scalable algorithms for this elusive) 20
LARGE SUPERSCALAR VS. MULTI-CORE
 Olukotun et al., “The Case for a Single-Chip Multiprocessor,”
ASPLOS 1996.

21
MULTI-CORE VS. LARGE SUPERSCALAR
 Multi-core advantages
+ Simpler cores  more power efficient, lower complexity, easier to
design and replicate, higher frequency (shorter wires, smaller structures)
+ Higher system throughput on multi-programmed workloads  reduced
context switches
+ Higher system throughput in parallel applications

22
 Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand

23
LARGE SUPERSCALAR VS. MULTI-CORE
 Olukotun et al., “The Case for a Single-Chip Multiprocessor,”
ASPLOS 1996.

 Technology push
 Instruction
issue queue size limits the cycle time of the superscalar,
OoO processor  diminishing performance
 Quadratic increase in complexity with issue width
 Large, multi-ported register files to support large instruction windows
and issue widths  reduced frequency or longer RF access,
diminishing performance
 Application pull
 Integer applications: little parallelism?
 FP applications: abundant loop-level parallelism 24
 Others (transaction proc., multiprogramming): CMP better fit
WHY MULTI-CORE?
 Alternative: Bigger caches

+ Improves single-thread performance transparently to programmer,


compiler
+ Simple to design

- Diminishing single-thread performance returns from cache size. Why?


- Multiple levels complicate memory hierarchy

25
WHY MULTI-CORE?
 Alternative: (Simultaneous) Multithreading

+ Exploits thread-level parallelism (just like multi-core)


+ Good single-thread performance when there is a single thread
+ No need to have an entire core for another thread
+ Parallel performance aided by tight sharing of caches

- Scalability is limited: need bigger register files, larger issue width (and
associated costs) to have many threads  complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system reduces
both single-thread and parallel application performance
26

You might also like