GPU
Architectures
A
CPU
Perspec+ve
Derek
Hower
AMD
Research
5/21/2013
With
updates
by
David
Wood
Goals
Data
Parallelism:
What
is
it,
and
how
to
exploit
it?
Workload
characterisHcs
Execu1on
Models
/
GPU
Architectures
MIMD
(SPMD),
SIMD,
SIMT
GPU
Programming
Models
Terminology
translaHons:
CPU
AMD
GPU
Nvidia
GPU
Intro
to
OpenCL
Modern
GPU
Microarchitectures
i.e.,
programmable
GPU
pipelines,
not
their
xed-funcHon
predecessors
Advanced
Topics:
(Time
permiYng)
The
Limits
of
GPUs:
What
they
can
and
cannot
do
The
Future
of
GPUs:
Where
do
we
go
from
here?
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Data
Parallel
ExecuHon
on
GPUs
Data Parallelism, Programming Models, SIMT
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
3
Graphics
Workloads
Streaming
computaHon
GPU
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Graphics
Workloads
Streaming
computaHon
on
pixels
GPU
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Graphics
Workloads
Iden2cal,
Streaming
computaHon
on
pixels
GPU
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Graphics
Workloads
Iden2cal,
Independent,
Streaming
computaHon
on
pixels
GPU
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Architecture
Spelling
Bee
P-A-R-A-L-L-E-L
Spell
Independent
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Generalize:
Data
Parallel
Workloads
Iden2cal,
Independent
computaHon
on
mul2ple
data
inputs
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Nave
Approach
Split
independent
work
over
mul1ple
processors
0,7
1,7
2,7
3,7
CPU0
7,0
()
=
CPU1
6,0
()
=
CPU2
5,0
()
=
CPU3
4,0
()
=
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
10
Data
Parallelism:
A
MIMD
Approach
MulHple
InstrucHon
MulHple
Data
Split
independent
work
over
mul1ple
processors
Program
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
1,7
CPU1
6,0
Memory
Writeback
Fetch
Decode
Execute
2,7
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
11
Data
Parallelism:
A
MIMD
Approach
MulHple
InstrucHon
MulHple
Data
Split
independent
work
over
mul1ple
processors
Program
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
When
work
is
iden1cal
(same
program):
1,7
CPU1
Program
Single
Program
MulHple
Data
(SPMD)
Fetch
Decode
Execute
Memory
Writeback
of
MIMD)
(Subcategory
=()
Program
=()
Program
=()
2,7
6,0
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
12
Data
Parallelism:
An
SPMD
Approach
Single
Program
MulHple
Data
Split
iden1cal,
independent
work
over
mul1ple
processors
Program
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory
Writeback
Fetch
Decode
Execute
1,7
CPU1
6,0
Memory
Writeback
Fetch
Decode
Execute
2,7
CPU2
5,0
Memory
Writeback
Fetch
Decode
Execute
3,7
CPU3
4,0
Memory
Writeback
Fetch
Decode
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
13
Data
Parallelism:
A
SIMD
Approach
Single
InstrucHon
MulHple
Data
Split
iden1cal,
independent
work
over
mul1ple
execuHon
units
(lanes)
More
ecient:
Eliminate
redundant
fetch/decode
0,7
1,7
2,7
Program
=()
CPU0
3,7
Execute
Memory
Execute
Memory
Fetch
Decode
Memory
Execute
Memory
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0
14
SIMD:
A
Closer
Look
One
Thread
+
Data
Parallel
Ops
Single
PC,
single
register
le
0,7
1,7
2,7
Program
=()
CPU0
3,7
Execute
Memory
Memory
Execute
Fetch
Decode
Memory
Execute
Memory
Execute
7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0
Register
File
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
15
Data
Parallelism:
A
SIMT
Approach
Single
InstrucHon
MulHple
Thread
Split
iden1cal,
independent
work
over
mul1ple
lockstep
threads
MulHple
Threads
+
Scalar
Ops
One
PC,
MulHple
register
les
0,7
Program
=()
WF0
1,7
Memory
Writeback
Execute
2,7
3,7
Execute
Memory
Writeback
Fetch
Decode
Execute
Memory
Writeback
7,0
6,0
5,0
4,0
Memory
Writeback
Execute
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
16
Terminology
Headache
#1
Its
common
to
interchange
SIMD
and
SIMT
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
17
Data
Parallel
ExecuHon
Models
MIMD/SPMD
MulHple
independent
threads
SIMD/Vector
One
thread
with
wide
execuHon
datapath
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
SIMT
MulHple
lockstep
threads
18
ExecuHon
Model
Comparison
Example
Architecture
MIMD/SPMD
SIMD/Vector
MulHcore
CPUs
x86
SSE/AVX
More
general:
supports
TLP
Can
mix
sequenHal
&
Easier
to
program
parallel
code
Gather/Scamer
operaHons
Inecient
for
data
parallelism
Gather/Scamer
can
be
awkward
Pros
Cons
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
SIMT
GPUs
Divergence
kills
performance
19
GPUs
and
Memory
Recall:
GPUs
perform
Streaming
computaHon
Streaming
memory
access
GPU
DRAM
latency:
100s
of
GPU
cycles
How
do
we
keep
the
GPU
busy
(hide
memory
latency)?
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
20
Hiding
Memory
Latency
OpHons
from
the
CPU
world:
Caches
Need
spaHal/temporal
locality
OoO/Dynamic
Scheduling
Need
ILP
MulHcore/MulHthreading/SMT
Need
independent
threads
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
21
MulHcore
MulHthreaded
SIMT
Many
SIMT
threads
grouped
together
into
GPU
Core
SIMT
threads
in
a
group
SMT
threads
in
a
CPU
core
Unlike
CPU,
groups
are
exposed
to
programmers
MulHple
GPU
Cores
GPU
GPU
Core
GPU
Core
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
22
MulHcore
MulHthreaded
SIMT
Many
SIMT
threads
grouped
together
into
GPU
Core
SIMT
threads
in
a
group
SMT
threads
in
a
CPU
core
Unlike
CPU,
groups
are
exposed
to
programmers
MulHple
GPU
Cores
This
is
a
GPU
Architecture
(Whew!)
GPU
GPU
Core
GPU
Core
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
23
GPU
Component
Names
AMD/OpenCL
Dereks
CPU
Analogy
Processing
Element
Lane
SIMD
Unit
Pipeline
Compute
Unit
Core
GPU
Device
Device
GPU
Core
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
24
GPU
Programming
Models
OpenCL
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
25
GPU
Programming
Models
CUDA
Compute
Unied
Device
Architecture
Developed
by
Nvidia
--
proprietary
First
serious
GPGPU
language/environment
OpenCL
Open
CompuHng
Language
From
makers
of
OpenGL
Wide
industry
support:
AMD,
Apple,
Qualcomm,
Nvidia
(begrudgingly),
etc.
C++
AMP
C++
Accelerated
Massive
Parallelism
Microsos
Much
higher
abstracHon
that
CUDA/OpenCL
OpenACC
Open
Accelerator
Like
OpenMP
for
GPUs
(semi-auto-parallelize
serial
code)
Much
higher
abstracHon
than
CUDA/OpenCL
26
GPU
Programming
Models
CUDA
Compute
Unied
Device
Architecture
Developed
by
Nvidia
--
proprietary
First
serious
GPGPU
language/environment
OpenCL
Open
CompuHng
Language
From
makers
of
OpenGL
Wide
industry
support:
AMD,
Apple,
Qualcomm,
Nvidia
(begrudgingly),
etc.
C++
AMP
C++
Accelerated
Massive
Parallelism
Microsos
Much
higher
abstracHon
that
CUDA/OpenCL
OpenACC
Open
Accelerator
Like
OpenMP
for
GPUs
(semi-auto-parallelize
serial
code)
Much
higher
abstracHon
than
CUDA/OpenCL
27
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
Early
GPU
languages
are
light
abstracHons
of
physical
hardware
OpenCL
+
CUDA
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
28
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
Early
GPU
languages
are
light
abstracHons
of
physical
hardware
OpenCL
+
CUDA
GPU
Architecture
GPU
GPU
Core
GPU
Core
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
29
OpenCL
Early
CPU
languages
were
light
abstracHons
of
physical
hardware
E.g.,
C
Early
GPU
languages
are
light
abstracHons
of
physical
hardware
OpenCL
+
CUDA
GPU
Architecture
OpenCL
Model
GPU
GPU
Core
NDRange
GPU
Core
Workgroup
Workgroup
Work-item
Wavefront
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
30
NDRange
N-Dimensional
(N
=
1,
2,
or
3)
index
space
ParHHoned
into
workgroups,
wavefronts,
and
work-items
NDRange
Workgroup
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Workgroup
31
Kernel
Run
an
NDRange
on
a
kernel
(i.e.,
a
funcHon)
Same
kernel
executes
for
each
work-item
Smells
like
MIMD/SPMD
Kernel
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
32
Kernel
Run
an
NDRange
on
a
kernel
(i.e.,
a
funcHon)
Same
kernel
executes
for
each
work-item
Smells
like
MIMD/SPMDbut
beware,
its
not!
Kernel
Workgroup
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
33
OpenCL
Code
__kernel
void flip_and_recolor(__global float3 **in_image,
__global float3 **out_image,
int img_dim_x, int img_dim_y)
{
int x = get_global_id(1); // get work-item id in dim 1
int y = get_global_id(2); // get work-item id in dim 2
out_image[img_dim_x - x][img_dim_y - y] =
recolor(in_image[x][y]);
}
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
34
GPU
Microarchitecture
AMD Graphics Core Next
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
35
GPU
Hardware
Overview
GPU
GDDR5
GPU
Local
Memory
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
SIMT
SIMT
L1
Cache
SIMT
SIMT
SIMT
SIMT
L1
Cache
SIMT
GPU
Core
SIMT
GPU
Core
L2
Cache
Local
Memory
36
Compute
Unit
A
GPU
Core
Compute
Unit
(CU)
Runs
Workgroups
Workgroup
Contains
4
SIMT
Units
Picks
one
SIMT
Unit
per
cycle
for
scheduling
SIMT
Unit
Runs
Wavefronts
Each
SIMT
Unit
has
10
wavefront
instrucHon
buer
Takes
4
cycles
to
execute
one
wavefront
SIMT
SIMT
SIMT
SIMT
L1
Cache
Local
Memory
10
Wavefront
x
4
SIMT
Units
=
40
Ac1ve
Wavefronts
/
CU
64
work-items
/
wavefront
x
40
acHve
wavefronts
=
2560
Ac1ve
Work-items
/
CU
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
37
Compute
Unit
Timing
Diagram
Time
On
average:
fetch
&
commit
one
wavefront
/
cycle
1
2
3
4
5
6
7
8
9
10
11
12
SIMT0
SIMT1
WF1_0
WF1_1
WF1_2
WF1_3
WF5_0
WF5_1
WF5_2
WF5_3
WF9_0
WF9_1
WF9_2
WF9_3
WF2_0
WF2_1
WF2_2
WF2_3
WF6_0
WF6_1
WF6_2
WF6_3
WF10_0
WF10_1
WF10_2
SIMT2
SIMT3
WF3_0
WF3_1
WF3_2
WF3_3
WF7_0
WF7_1
WF7_2
WF7_3
WF11_0
WF11_1
WF4_0
WF4_1
WF4_2
WF4_3
WF8_0
WF8_1
WF8_2
WF8_3
WF12_0
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
38
SIMT
Unit
A
GPU
Pipeline
Like
a
wide
CPU
pipeline
except
one
fetch
for
enHre
width
16-wide
physical
ALU
Executes
64-wavefront
over
4
cycles.
Why??
64KB
register
state
/
SIMT
Unit
Compare
to
x86
(Bulldozer):
~1KB
of
physical
register
le
state
(~1/64
size)
Address
Coalescing
Unit
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
A
key
to
good
memory
performance
Address
Coalescing
Unit
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
39
Address
Coalescing
Wavefront:
Issue
64
memory
requests
NDRange
Workgroup
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
Workgroup
40
Address
Coalescing
Wavefront:
Issue
64
memory
requests
Common
case:
work-items
in
same
wavefront
touch
same
cache
block
Coalescing:
Merge
many
work-items
requests
into
single
cache
block
request
Important
for
performance:
Reduces
bandwidth
to
DRAM
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
41
GPU
Memory
GPUs
have
caches.
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
42
Not
Your
CPUs
Cache
By
the
numbers:
Bulldozer
FX-8170
vs.
GCN
Radeon
HD
7970
CPU
(Bulldozer)
GPU
(GCN)
L1
data
cache
capacity
16KB
16
KB
AcHve
threads
(work-items)
sharing
L1
D
Cache
2560
L1
dcache
capacity
/
thread
16KB
6.4
bytes
Last
level
cache
(LLC)
capacity
8MB
768KB
AcHve
threads
(work-items)
sharing
LLC
81,920
LLC
capacity
/
thread
1MB
9.6
bytes
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
43
GPU
Caches
Maximize
throughput,
not
hide
latency
Not
there
for
either
spaHal
or
temporal
locality
L1
Cache:
Coalesce
requests
to
same
cache
block
by
dierent
work-items
i.e.,
streaming
thread
locality?
Keep
block
around
just
long
enough
for
each
work-item
to
hit
once
UlHmate
goal:
Reduce
bandwidth
to
DRAM
L2
Cache:
DRAM
staging
buer
+
some
instrucHon
reuse
UlHmate
goal:
Tolerate
spikes
in
DRAM
bandwidth
If
there
is
any
spaHal/temporal
locality:
Use
local
memory
(scratchpad)
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
44
Scratchpad
Memory
GPUs
have
scratchpads
(Local
Memory)
Allocated
to
a
workgroup
i.e.,
shared
by
wavefronts
in
workgroup
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
SIMT
SIMT
SIMT
Rename
address
Manage
capacity
manual
ll/evicHon
SIMT
Separate
address
space
Managed
by
sosware:
L1
Cache
Local
Memory
45
Example
System:
Radeon
HD
7970
High-end
part
32
Compute
Units:
81,920
AcHve
work-items
32
CUs
*
4
SIMT
Units
*
16
ALUs
=
2048
Max
FP
ops/cycle
264
GB/s
Max
memory
bandwidth
925
MHz
engine
clock
3.79
TFLOPS
single
precision
(accounHng
trickery:
FMA)
210W
Max
Power
(Chip)
>350W
Max
Power
(card)
100W
idle
power
(card)
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
46
Radeon
HD
7990
-
Cooking
Two
7970s
on
one
card:
375W
(AMD
Ocial)
450W
(OEM)
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
47
A
Rose
by
Any
Other
Name
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
48
Terminology
Headaches
#2-5
Nvidia/CUDA
AMD/OpenCL
Dereks
CPU
Analogy
CUDA
Processor
Processing
Element
Lane
CUDA
Core
GPU
Core
Streaming
MulHprocessor
GPU
Device
SIMD
Unit
Pipeline
Compute
Unit
Core
GPU
Device
Device
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
49
Terminology
Headaches
#6-9
CUDA/Nvidia
OpenCL/AMD
Henn&Pai
Work-item
Sequence
of
SIMD
Lane
OperaHons
Wavefront
Thread
of
SIMD
InstrucHons
Block
Workgroup
Body
of
vectorized
loop
Grid
NDRange
Vectorized
loop
Thread
Warp
Group
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
50
Terminology
Headache
#10
GPUs
have
scratchpads
(Local
Memory)
Allocated
to
a
workgroup
i.e.,
shared
by
wavefronts
in
workgroup
SIMT
SIMT
SIMT
Rename
address
Manage
capacity
manual
ll/evicHon
SIMT
Separate
address
space
Managed
by
sosware:
L1
Cache
Local
Memory
Nvidia
calls
Local
Memory
Shared
Memory.
AMD
some1mes
calls
it
Group
Memory.
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
51
Recap
Data
Parallelism:
IdenHcal,
Independent
work
over
mulHple
data
inputs
GPU
version:
Add
streaming
access
pamern
Data
Parallel
Execu1on
Models:
MIMD,
SIMD,
SIMT
GPU
Execu1on
Model:
MulHcore
MulHthreaded
SIMT
OpenCL
Programming
Model
NDRange
over
workgroup/wavefront
Modern
GPU
Microarchitecture:
AMD
Graphics
Core
Next
(GCN)
Compute
Unit
(GPU
Core):
4
SIMT
Units
SIMT
Unit
(GPU
Pipeline):
16-wide
ALU
pipe
(16x4
execuHon)
Memory:
designed
to
stream
GPUs:
Great
for
data
parallelism.
Bad
for
everything
else.
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
52
Advanced
Topics
GPU LimitaAons, Future of GPGPU
53
Choose
Your
Own
Adventure!
SIMT
Control
Flow
&
Branch
Divergence
Memory
Divergence
When
GPUs
talk
Wavefront
communicaHon
GPU
coherence
GPU
consistency
Future
of
GPUs:
Whats
next?
54
SIMT
Control
Flow
Consider
SIMT
condiHonal
branch:
One
PC
MulHple
data
(i.e.,
mulHple
condiHons)
if (x <= 0)
y = 0;
else
y = x;
55
SIMT
Control
Flow
Work-items
in
wavefront
run
in
lockstep
Dont
all
have
to
commit
Branching
through
predica1on
AcHve
lane:
commit
result
InacHve
lane:
throw
away
result
All
lanes
acHve
at
start:
1111
if (x <= 0)
y = 0;
else
y = x;
Branch
set
execuHon
mask:
1000
Else
invert
execuHon
mask:
0111
Converge
Reset
execuHon
mask:
1111
56
SIMT
Control
Flow
Work-items
in
wavefront
run
in
lockstep
Dont
all
have
to
commit
Branching
through
predica1on
AcHve
lane:
commit
result
InacHve
lane:
throw
away
result
All
lanes
acHve
at
start:
1111
Branch
divergence
if (x <= 0)
y = 0;
else
y = x;
Branch
set
execuHon
mask:
1000
Else
invert
execuHon
mask:
0111
Converge
Reset
execuHon
mask:
1111
57
Branch
Divergence
When
control
ow
diverges,
all
lanes
take
all
paths
Divergence
Kills
Performance
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
58
Beware!
Divergence
isnt
just
a
performance
problem:
__global int lock = 0;
void mutex_lock()
{
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
59
Beware!
Divergence
isnt
just
a
performance
problem:
__global int lock = 0;
void mutex_lock()
{
Deadlock:
work-items
cant
enter
mutex
together!
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
60
Memory
Bandwidth
SIMT
DRAM
Lane
0
Bank
0
Lane
1
Bank
1
Lane
2
Bank
2
Lane
3
Bank
3
--
Parallel
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
61
Memory
Bandwidth
SIMT
DRAM
Lane
0
Bank
0
Lane
1
Bank
1
Lane
2
Bank
2
Lane
3
Bank
3
--
Sequen1al
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
62
Memory
Bandwidth
Memory
divergence
SIMT
DRAM
Lane
0
Bank
0
Lane
1
Bank
1
Lane
2
Bank
2
Lane
3
Bank
3
--
Sequen1al
Access
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
63
Memory
Divergence
One
work-item
stalls
enHre
wavefront
must
stall
Cause:
Bank
conicts,
cache
misses
Data
layout
&
parHHoning
is
important
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
64
Memory
Divergence
One
work-item
stalls
enHre
wavefront
must
stall
Cause:
Bank
conicts,
cache
misses
Data
layout
&
parHHoning
is
important
Divergence
Kills
Performance
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
65
CommunicaHon
and
SynchronizaHon
Work-items
can
communicate
with:
Work-items
in
same
wavefront
No
special
sync
neededthey
are
lockstep!
Work-items
in
dierent
wavefront,
same
workgroup
(local)
Local
barrier
Work-items
in
dierent
wavefront,
dierent
workgroup
(global)
OpenCL
1.x:
Nope
OpenCL
2.x:
Yes,
but
CUDA
4.x:
Yes,
but
complicated
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
66
GPU
Consistency
Models
Very
weak
guarantee:
Program
order
respected
within
single
work-item
All
other
bets
are
o
Safety
net:
Fence
make
sure
all
previous
accesses
are
visible
before
proceeding
Built-in
barriers
are
also
fences
A
wrench:
GPU
fences
are
scoped
only
apply
to
subset
of
work-items
in
system
E.g.,
local
barrier
Take-away:
Area
of
acHve
research
See
Hower,
et
al.
Heterogeneous-race-free
Memory
Models,
ASPLOS
2014
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
67
GPU
Coherence?
NoHce:
GPU
consistency
model
does
not
require
coherence
i.e.,
Single
Writer,
MulHple
Reader
MarkeHng
claims
they
are
coherent
GPU
Coherence:
Nvidia:
disable
private
caches
AMD:
ush/invalidate
enHre
cache
at
fences
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
68
GPU
Architecture
Research
Blending
with
CPU
architecture:
Dynamic
scheduling
/
dynamic
wavefront
re-org
Work-items
have
more
locality
than
we
think
Tighter
integraHon
with
CPU
on
SOC:
Fast
kernel
launch
Exploit
ne-grained
parallel
region:
Remember
Amdahls
law
Common
shared
memory
Reliability:
Historically:
Who
noHces
a
bad
pixel?
Future:
GPU
compute
demands
correctness
Power:
Mobile,
mobile
mobile!!!
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
69
Computer
Economics
101
GPU
Compute
is
cool
+
gaining
steam,
but
Is
a
0
billion
dollar
industry
(to
quote
Mark
Hill)
GPU
design
prioriHes:
1.
Graphics
2.
Graphics
N-1.
Graphics
N.
GPU
Compute
Moral
of
the
story:
GPU
wont
become
a
CPU
(nor
should
it)
GPU
ARCHITECTURES:
A
CPU
PERSPECTIVE
70