0% found this document useful (0 votes)
109 views28 pages

GRIT Slidess

Uploaded by

kranti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views28 pages

GRIT Slidess

Uploaded by

kranti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

GRIT: Enhancing Multi-GPU Performance with

Fine-Grained Dynamic Page Placement

Yueqi Wang*1, Bingyao Li*1, Aamer Jaleel2, Jun Yang1, Xulong Tang1
University of Pittsburgh, 2NVIDIA
1
Multi-GPU is Popular

Graph DNN
Processing

ChatGP Datacenter
T Workloads

Ever-growing application complexity and input


dataset sizes.
[1] https://www.cerebras.net/blog/harnessing-the-power-of-sparsity-for-large-gpt-ai-models 2
Multi-GPU is Popular

Graph DNN
Multi-GPU is here ! (NVIDIA DGX,
Processing

Intel Xe)
ChatGP Datacenter
T Workloads

Ever-growing application complexity and input


dataset sizes.
[1] https://www.cerebras.net/blog/harnessing-the-power-of-sparsity-for-large-gpt-ai-models 3
UVM for Multi-GPU
• Growing trend of multi-GPUs leveraging Unified Virtual
Memory (UVM)
Simplify
Global development Heterogeneit Compatibilit
pointer to & y y
devices deployment

GPU GPU … GPU … CPU CPU …

Unified Memory

4
Multi-GPU
Scalability
4.0
x Why performance gap:

3.0
x
• NUMA data access
2.0
x
2.4x • Data transfer
1.8x
1.0
x
• Address translation
0.0
x
1-GPU 2-GPUs 4-GPUs ……

[2] https://developer.nvidia.com/blog/easy-multi-gpu-deep-learning-digits-2/ 5
Multi-GPU Page Placement Schemes
1. On-Touch Migration: Request,
Migrate.
Reques
GPU0
t
GPU0 GPU1 GPU1
Page
GPU0
Migrate
GPU1
UVM Driver
GPU0

Every memory access Ping-Pong


is local effect
6
Multi-GPU Page Placement Schemes
2. Access Counter-based Migration: Reaching threshold,
Migrate.
Reques
t
GPU0 GPU1 GPU1
Page
++1
Access counter
GPU0 Page
1>+ Threshold Migrate
1
UVM Driver

Reduce Ping-Pong Substantial remote


migration access
7
Multi-GPU Page Placement Schemes
3. Page Duplication: Read, Duplicate.

Read Writ
e
GPU0 GPU1 GPU0 GPU1 GPU2
Page
Page Page Page

Duplica Protectio Page


te n invalidat
fault ion
UVM Driver UVM Driver

Reduce migration & Significant collapsing


remote access overhead
8
Performance of Page Placement Schemes

2.5
Normalized Performance

2.0

No “one-size-fits-all” page placement


1.5

1.0
scheme
0.5

0.0
BFS BS C2D FIR GEMM MM SC ST

On-touch Access counter Duplication

9
Page Access Characteristics
 The page-sharing / read-write patterns vary within the
same application

Write Read
GPU 1 GPU2 GPU3 GPU4 100%
Percentage of page access-

100%

page read/write
A dynamic page placement scheme that can

Percentage of
75%
75%

accommodate variations in page access


50%
50%
25%
ing

characteristics 25%
0%
0%
0 3 6 9 12 15 18 21 24 27 30 Time
Time

Private / Read /
Share Write
10
Problem Summary

Problem:

Delivered performance is constrained by NUMA overhead

No “one-size-fits-all” page placement scheme

Goal:
Effectively reduce NUMA overhead in multi-GPU by
determining
page placement scheme at runtime

11
GRIT (Fine-GRained dynamIc page placemenT)

Scheme
change
metric

Dynamically determine page placement scheme


at runtime

12
GRIT – Scheme Change Metric

Request Write

Current Page duplication


GPU0
scheme is Page should
unsuitable be avoided
local page page
UP UP
fault protection
fault
UVM Driver

Indicator: Number of page faults (local page fault & page


protection fault)
13
GRIT: Dynamic Page Placement Scheme

Scheme How to
change track
metric information
?

Dynamically determine page placement scheme


at runtime

Fault-Aware
Initiator

14
GRIT – Page Attribute Table and Cache (PA-
Table & Cache)
Additional memory Facilitate
access lookup
Write back
CPU Memory PA-Cache
PA-Table Way 0 … Way 3

Fault VPT FC R/W VPT FC R/W


VPN Read/Write
Counter … … … … … …
0xA00 10 0
… … … … … …
0xA01 01 1 … … … … … …

… … … … … … … … …

Write
allocate
15
GRIT: Dynamic Page Placement Scheme

Scheme How to Which


change track scheme to
metric information change to ?
?

Dynamically determine page placement scheme


at runtime

Fault-Aware PA-Table &


Initiator PA-Cache

16
GRIT – Which Scheme 63 62:54 53:52 51:12 11 10:9

X Unused Bits 4 KB Page Frame U Scheme


D (UB) Number (PFN) B Bits
On-
touch
FC
Get access Reach
information from thresho
PA-Table /PA-Cache ld?
True
Change Scheme
Scheme information is
stored in host PTE.
True All Fal
Duplicatio Access-
read se
n ?
counter

17
GRIT: Dynamic Page Placement Scheme

Scheme How to Which


change track scheme to
metric information change to ?
?

Dynamically determine page placement scheme


at runtime
Scheme
Fault-Aware PA-Table &
Decision
Initiator PA-Cache
Mechanism

18
GRIT: Dynamic Page Placement Scheme
3.0
Normalized performance

2.5

2.0

Performance
1.5
gap due to improper scheme
1.0
before trigger scheme change
0.5

0.0
BFS BS C2D FIR GEMM MM SC ST

On-touch Access counter Duplication GRIT-Dynamic

19
Page Attributes Characterization
 The neighboring pages tend to exhibit similar
access attributes

Proactively determine page placement


scheme
for neighboring pages

20
GRIT: Neighboring-Aware Prediction

Scheme How to Which


change track scheme to
metric information change to ?
?

Proactively decide
Dynamically determine page placement scheme
neighboring scheme
at runtime
Neighboring-
Scheme
Fault-Aware PA-Table &
Aware
Decision
Initiator PA-Cache
Prediction
Mechanism

21
GRIT: Neighboring-Aware Prediction 63
X
62:54
Unused
53:52
Group
D Bits (UB) Bits

Page
Table
0xF000 00 00 00
0xF001 00 00 00
0xF002 00 00 00 Recursiv
0xF003 00 00 Promot 00 ely
e Promote
0xF004 01 01 00
0xF005 01 01 00
0xF006 01 01 00
0xF007 10 00 00


VPN Schem 32KB page
e group
256KB page
group
22
GRIT – Put All Together

UVM Yes,
Drive Fault change
r reach scheme
threshold
No, resolve ?
fault resolve
fault

23
Methodology

• Simulator: MGPUSim [ISCA 19’]

• Workloads: 8 applications from Hetero-Mark,


AMDAPPSDK , SHOC, and DNN Mark benchmark
suites, including random, adjacent, and scatter-
gather access patterns.

Detailed page placement scheme modeling in


paper

24
Evaluation – Overall Performance
3.0
2.5
Normalized per-

2.0
formance

1.5
1.0
0.5
0.0
BFS BS C2D FIR GEMM MM SC ST Ave.

On-touch Access counter Duplication GRIT

GRIT achieves 60%, 49%, and 29% performance improvement


compared to uniformly employing on-touch, access counter-based,
and page duplication scheme.
25
Evaluation – Scheme Breakdown
Scheme percentage 100%

75%

50%

25%

0%
BFS BS C2D FIR GEMM MM SC ST

On-touch Access counter Duplication

GRIT is able to distinguish page attributes and consistently


select
the most suitable scheme accordingly.
26
Summary
Problem: NUMA overheads in multi-GPU systems
• No “one-size-fits-all” page placement scheme
GRIT:
A. Dynamic page placement scheme determines
schemes in a fine-grained manner
B. Neighboring-aware prediction proactively
determines adjacent page scheme

Improves performance by 60% on average.

27
Thanks! Q&A
GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dynamic Page Placement

Yueqi Wang*1, Bingyao Li*1, Aamer Jaleel2, Jun Yang1, Xulong Tang1
1
University of Pittsburgh, 2NVIDIA

28

You might also like