GRIT: Enhancing Multi-GPU Performance with
Fine-Grained Dynamic Page Placement
Yueqi Wang*1, Bingyao Li*1, Aamer Jaleel2, Jun Yang1, Xulong Tang1
University of Pittsburgh, 2NVIDIA
1
Multi-GPU is Popular
Graph DNN
Processing
ChatGP Datacenter
T Workloads
Ever-growing application complexity and input
dataset sizes.
[1] https://www.cerebras.net/blog/harnessing-the-power-of-sparsity-for-large-gpt-ai-models 2
Multi-GPU is Popular
Graph DNN
Multi-GPU is here ! (NVIDIA DGX,
Processing
Intel Xe)
ChatGP Datacenter
T Workloads
Ever-growing application complexity and input
dataset sizes.
[1] https://www.cerebras.net/blog/harnessing-the-power-of-sparsity-for-large-gpt-ai-models 3
UVM for Multi-GPU
• Growing trend of multi-GPUs leveraging Unified Virtual
Memory (UVM)
Simplify
Global development Heterogeneit Compatibilit
pointer to & y y
devices deployment
GPU GPU … GPU … CPU CPU …
Unified Memory
4
Multi-GPU
Scalability
4.0
x Why performance gap:
3.0
x
• NUMA data access
2.0
x
2.4x • Data transfer
1.8x
1.0
x
• Address translation
0.0
x
1-GPU 2-GPUs 4-GPUs ……
[2] https://developer.nvidia.com/blog/easy-multi-gpu-deep-learning-digits-2/ 5
Multi-GPU Page Placement Schemes
1. On-Touch Migration: Request,
Migrate.
Reques
GPU0
t
GPU0 GPU1 GPU1
Page
GPU0
Migrate
GPU1
UVM Driver
GPU0
Every memory access Ping-Pong
is local effect
6
Multi-GPU Page Placement Schemes
2. Access Counter-based Migration: Reaching threshold,
Migrate.
Reques
t
GPU0 GPU1 GPU1
Page
++1
Access counter
GPU0 Page
1>+ Threshold Migrate
1
UVM Driver
Reduce Ping-Pong Substantial remote
migration access
7
Multi-GPU Page Placement Schemes
3. Page Duplication: Read, Duplicate.
Read Writ
e
GPU0 GPU1 GPU0 GPU1 GPU2
Page
Page Page Page
Duplica Protectio Page
te n invalidat
fault ion
UVM Driver UVM Driver
Reduce migration & Significant collapsing
remote access overhead
8
Performance of Page Placement Schemes
2.5
Normalized Performance
2.0
No “one-size-fits-all” page placement
1.5
1.0
scheme
0.5
0.0
BFS BS C2D FIR GEMM MM SC ST
On-touch Access counter Duplication
9
Page Access Characteristics
The page-sharing / read-write patterns vary within the
same application
Write Read
GPU 1 GPU2 GPU3 GPU4 100%
Percentage of page access-
100%
page read/write
A dynamic page placement scheme that can
Percentage of
75%
75%
accommodate variations in page access
50%
50%
25%
ing
characteristics 25%
0%
0%
0 3 6 9 12 15 18 21 24 27 30 Time
Time
Private / Read /
Share Write
10
Problem Summary
Problem:
Delivered performance is constrained by NUMA overhead
No “one-size-fits-all” page placement scheme
Goal:
Effectively reduce NUMA overhead in multi-GPU by
determining
page placement scheme at runtime
11
GRIT (Fine-GRained dynamIc page placemenT)
Scheme
change
metric
Dynamically determine page placement scheme
at runtime
12
GRIT – Scheme Change Metric
Request Write
Current Page duplication
GPU0
scheme is Page should
unsuitable be avoided
local page page
UP UP
fault protection
fault
UVM Driver
Indicator: Number of page faults (local page fault & page
protection fault)
13
GRIT: Dynamic Page Placement Scheme
Scheme How to
change track
metric information
?
Dynamically determine page placement scheme
at runtime
Fault-Aware
Initiator
14
GRIT – Page Attribute Table and Cache (PA-
Table & Cache)
Additional memory Facilitate
access lookup
Write back
CPU Memory PA-Cache
PA-Table Way 0 … Way 3
Fault VPT FC R/W VPT FC R/W
VPN Read/Write
Counter … … … … … …
0xA00 10 0
… … … … … …
0xA01 01 1 … … … … … …
… … … … … … … … …
Write
allocate
15
GRIT: Dynamic Page Placement Scheme
Scheme How to Which
change track scheme to
metric information change to ?
?
Dynamically determine page placement scheme
at runtime
Fault-Aware PA-Table &
Initiator PA-Cache
16
GRIT – Which Scheme 63 62:54 53:52 51:12 11 10:9
X Unused Bits 4 KB Page Frame U Scheme
D (UB) Number (PFN) B Bits
On-
touch
FC
Get access Reach
information from thresho
PA-Table /PA-Cache ld?
True
Change Scheme
Scheme information is
stored in host PTE.
True All Fal
Duplicatio Access-
read se
n ?
counter
17
GRIT: Dynamic Page Placement Scheme
Scheme How to Which
change track scheme to
metric information change to ?
?
Dynamically determine page placement scheme
at runtime
Scheme
Fault-Aware PA-Table &
Decision
Initiator PA-Cache
Mechanism
18
GRIT: Dynamic Page Placement Scheme
3.0
Normalized performance
2.5
2.0
Performance
1.5
gap due to improper scheme
1.0
before trigger scheme change
0.5
0.0
BFS BS C2D FIR GEMM MM SC ST
On-touch Access counter Duplication GRIT-Dynamic
19
Page Attributes Characterization
The neighboring pages tend to exhibit similar
access attributes
Proactively determine page placement
scheme
for neighboring pages
20
GRIT: Neighboring-Aware Prediction
Scheme How to Which
change track scheme to
metric information change to ?
?
Proactively decide
Dynamically determine page placement scheme
neighboring scheme
at runtime
Neighboring-
Scheme
Fault-Aware PA-Table &
Aware
Decision
Initiator PA-Cache
Prediction
Mechanism
21
GRIT: Neighboring-Aware Prediction 63
X
62:54
Unused
53:52
Group
D Bits (UB) Bits
Page
Table
0xF000 00 00 00
0xF001 00 00 00
0xF002 00 00 00 Recursiv
0xF003 00 00 Promot 00 ely
e Promote
0xF004 01 01 00
0xF005 01 01 00
0xF006 01 01 00
0xF007 10 00 00
…
VPN Schem 32KB page
e group
256KB page
group
22
GRIT – Put All Together
UVM Yes,
Drive Fault change
r reach scheme
threshold
No, resolve ?
fault resolve
fault
23
Methodology
• Simulator: MGPUSim [ISCA 19’]
• Workloads: 8 applications from Hetero-Mark,
AMDAPPSDK , SHOC, and DNN Mark benchmark
suites, including random, adjacent, and scatter-
gather access patterns.
Detailed page placement scheme modeling in
paper
24
Evaluation – Overall Performance
3.0
2.5
Normalized per-
2.0
formance
1.5
1.0
0.5
0.0
BFS BS C2D FIR GEMM MM SC ST Ave.
On-touch Access counter Duplication GRIT
GRIT achieves 60%, 49%, and 29% performance improvement
compared to uniformly employing on-touch, access counter-based,
and page duplication scheme.
25
Evaluation – Scheme Breakdown
Scheme percentage 100%
75%
50%
25%
0%
BFS BS C2D FIR GEMM MM SC ST
On-touch Access counter Duplication
GRIT is able to distinguish page attributes and consistently
select
the most suitable scheme accordingly.
26
Summary
Problem: NUMA overheads in multi-GPU systems
• No “one-size-fits-all” page placement scheme
GRIT:
A. Dynamic page placement scheme determines
schemes in a fine-grained manner
B. Neighboring-aware prediction proactively
determines adjacent page scheme
Improves performance by 60% on average.
27
Thanks! Q&A
GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement
Yueqi Wang*1, Bingyao Li*1, Aamer Jaleel2, Jun Yang1, Xulong Tang1
1
University of Pittsburgh, 2NVIDIA
28