0% found this document useful (0 votes)

35 views44 pages

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views44 pages

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Demystify CUDA Debugging and Performance

with Powerful Developer Tools

Jackson Marusarz
Agenda

• High-level tools ecosystem overview

• For each tool:

• Brief description and feature overview
• New features in the latest releases and the problems the help solve

• Current and Future Areas of Focus

• Additional Resources / Q&A

https://developer.nvidia.com/tools-overview
Developer Tools Ecosystem
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Nsight Visual Studio Code Edition

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Visual Studio Code Edition
Nsight Visual Studio Edition
Nsight Eclipse Edition
Compute Debuggers and IDEs
Compute Debuggers
Debug GPU Kernels Running on Device

• CUDA GDB
• CPU + GPU CUDA kernel debugger
• Supports stepping, breakpoints, in-line functions, variable inspection etc…
• Built on GDB and uses many of the same CLI commands
• Local/Remote connection support
• Nsight Visual Studio Edition
• IDE integration for Visual Studio
• Build and Debug CPU+GPU code from Visual Studio
• Nsight Visual Studio Code Edition
• New IDE integration for VS Code
• Build and Debug CPU+GPU code from Visual Studio Code
• Remotely target Linux targets from Windows or Linux
• Nsight Eclipse Edition
• IDE integration for Eclipse
• Build and Debug CPU+GPU code from Eclipse
Compute Sanitizer
Automatically Scan for Bugs and Memory Issues

• Compute Sanitizer checks correctness issues via

sub-tools:

• Memcheck – Memory access error and leak detection

tool.
• Racecheck – Shared memory data access hazard
detection tool.
• Initcheck – Uninitialized device global memory access
detection tool.
• Synccheck – Thread synchronization hazard detection
tool.

https://github.com/NVIDIA/compute-sanitizer-samples
Compute Sanitizer
Reading a Memcheck Example Report

Address space Type of access Access size

========= Invalid global write of size 4 bytes Access location

========= Faulty thread

at 0xb0 in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:39:out_of_bounds_function()

========= Faulty address

by thread (0,0,0) and(0,0,0)
in block nearest
allocation
========= Address 0x87654320 is out of bounds

========= and Device and host backtracesbytes before the nearest allocation at 0x7f953da00000 of size 1,024 bytes
is 140,276,689,190,112

========= Device Frame:/home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:44:out_of_bounds_kernel() [0x30]

========= Saved host backtrace up to driver entry point at kernel launch time

========= Host Frame: [0x2774ec]

========= in /lib/x86_64-linux-gnu/libcuda.so.1

========= Host Frame:__cudart803 [0xfccb]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo

========= Host Frame:cudaLaunchKernel [0x6a578]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo
Compute Debuggers and IDEs
New Features
Debuggers/IDE

IDEs
• VS Code Autostart tasks for remote debugging.
• VS Code remote debug (QNX/L4T)
• VS Code Docker support

Compute Sanitizer
• Racecheck support for device-launched graphs
• Memcheck support for Address Translation
Service (ATS)
• Memcheck support for Heterogeneous Memory
Management (HMM)
NVTX Tools Extension API
NVIDIA Tools eXtension (NVTX)
• Decorate application source code with annotations (markers, ranges, nested ranges, …) to help visualize execution with debugging, tracing and profiling tools

• Header-only library https://github.com/NVIDIA/NVTX/tree/release-v3/c.

#include <nvtx3/nvToolsExt.h>

• Marker:
nvtxMark("This is a marker");

• Push-Pop range
nvtxRangePush("This is a push/pop range");
// Do something interesting in the range
nvtxRangePop(); // Pop must be on same thread as corresponding Push

• Start-End range
nvtxRangeHandle_t handle = nvtxRangeStart("This is a start/end range");
// Somewhere else in the code, not necessarily same thread as Start call:
nvtxRangeEnd(handle);

API references https://nvidia.github.io/NVTX/doxygen/index.html and https://nvidia.github.io/NVTX/doxygen-cpp/index.html

NVIDIA SDKs and NVTX
A Complete Ecosystem

DeepStream SDK Holoscan SDK

Accel. GStreamer
GXF GXF
Plugins

Math Libraries Comm. Libraries …

Deep Learning Libraries
cuSPARSE NVSHMEM cuDF
TensorRT

cuSOLVER NCCL cuFile

cuDNN cuDLA

cuBLAS cuML
Python and NVTX

• Annotate Python code with NVTX • pip install nvtx - https://pypi.org/project/nvtx/

• Profile and Visualize with Nsight Systems

Python and NVTX
Trace Python Functions of Interest

• No Python source changes required

• Annotations are configured in a JSON file (e.g. <target-
platform-folder>/PythonNvtx/annotations.json)
Nsight Systems
Nsight Systems
System Profiler

Key Features:
• System-wide application algorithm tuning
• Multi-process tree support
• Locate optimization opportunities
• Visualize millions of events on a very fast GUI timeline
• Identify gaps of unused CPU and GPU time
• Balance your workload across multiple CPUs and GPUs
• CPU algorithms, utilization and thread state
GPU streams, kernels, memory transfers, etc
• Command Line, Standalone, IDE Integration
• OS: Linux (x86, ARM Server, Tegra), Windows, macOS X (host)
• GPUs: Pascal+
• Docs/product: https://developer.nvidia.com/nsight-systems
Processes and
threads

Thread state

cuDNN and
cuBLAS trace

Kernel and
memory transfer
activities

Multi-GPU
Zoom/Filter to Exact Areas of Interest
Nsight Systems
New Features
Grace Host Profiling
Hardware Counters and Metrics

• CPU Core and Uncore Events

• Sampled for each CPU
• Visualize parallelism effects
• Cache hit/miss, instructions retired, etc…
• L3 Coherency Fabric
• Socket to socket traffic
• Variable sampling frequencies supported
• Timeline correlated with all other data
• GPU vs. CPU idle times and metrics
• Data movement
• Zoom and filter
Grace Host Profiling
Cache Access Pattern Example

Single threaded CPU matrix multiplication with poor memory access patterns

Improving access pattern and implementing cache blocking

JupyterLab Integration Updates

• Extension to JupyterLab
• Profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Launch new remote GUI streaming container
directly in JupyterLab
• Servers without X, Windowing Manager, …
• Container with X, WM, & WebRTC server
• Dockerfile inside Nsight Systems Installer

• See it in action:
• DLIT61667: Profilers, Python, and Performance:
Nsight Tools for Optimizing Modern CUDA Workloads
Python Profiling Updates

• Python Call Stacks Samples and CUDA API Backtrace

• Identify where you are and how you got there
• Global Interpreter Lock (GIL) trace
• Common performance limiter in Python
• See annotated code ranges built into in popular frameworks and libraries
such as:
• RAPIDS, Spark, CV-CUDA, and more…
Cluster and Recipe Framework Improvements

• Nsight Systems enhanced support for Kubernetes

• Nsight Systems analysis framework:
• User programmable and predefined recipes to:
• Process and analyze complex and large reports or collection of
reports
• Understand how compute cold-spots relate to communications
• Generate multi-node heatmaps to show :
• InfiniBand congestion
• InfiniBand, Ethernet, and NVLink throughputs
• Overlapped compute and networking

• NVIDIA Switch per-port support

• Remotely stream GUI inside container
• No need to copy/export out to local PCs
Recipe Framework Example

• Multi-process workload with NCCL

• Utilization heatmaps for NCCL/Compute/All
• Visualize usage over time to identify:
• Phases and behavior patterns
• Load imbalance
• Idle GPU compute cycles
• Inefficient scheduling
• Overlapping communication and compute
• Ensure resources are used efficiently
Nsight Compute
Nsight Compute
Kernel Profiler

Key Features:
• Interactive CUDA API debugging and kernel profiling
• Built-in rules expertise
• Fully customizable data collection and display
• Command Line, Standalone, IDE Integration, Remote Targets

• OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, macOS X (host only)
• GPUs: Volta+

• Docs/product: https://developer.nvidia.com/nsight-compute
Nsight Compute GUI Interface

Targeted metric sections

Customizable data
collection and
presentation

Built-in expertise for

Guided Analysis and
optimization
Visual memory analysis chart

Metrics for peak performance

ratios
Source/PTX/SASS analysis
and correlation

Metric heatmap to quickly

Source metrics per identify hotspots
instruction
Nsight Compute
New Features
Nsight Compute
Periodic Metric Sampling

• Reveals behaviors hidden by aggregates

• Inter-kernel phases
• Workload imbalance (tail effects, etc…)
• Warp Stall Reasons
Nsight Compute
Source Code Comparison

• Source Code Comparison

• Determine how modifications impact performance
• No need for multiple open reports/GUIs
• Automatic diff’ing to locate and navigate to changes
• Per-source heatmaps provide additional visual information
Nsight Compute
Workload Distribution Section and Load Imbalance Rules

• New GPU and Memory Workload Distribution section

• Helps users understand the balance of work across SMs and memory.
• New rules identify load imbalances where uneven work distribution could be impacting performance.
• Use this new section and the built-in rules to detect uneven workload distributions that may keep you from achieving
peak performance.
Coming Soon…

• Source Page Statistics including multi-select

• Python Callstacks and Syntax Highlighting

• Range Replay Kernel Timestamps

CUPTI
CUDA Profiling Tools Interface
CUPTI Updates

• New APIs for instruction level SASS metrics

• Gives CUPTI users/tool developers the ability to collect SASS metrics
through code instrumentation
• Previously only available through Nsight Compute
• Graph-level tracing for device-launched graphs
• Start and stop trace for the graph execution
• Lower overhead than per-node tracing
• Push Buffer full events
• CUDA API queue pressure can cause performance degradation
• Overhead reporting for lazy loading of CUDA modules and functions
• Performance improvements
• Tracing overhead reductions to ensure accurate performance data
Reviewing Areas of Focus
Focus Area: DevTools ♡ Python

• Python CPU Call Stacks

• Python GIL trace in Nsight Systems
• JupyterLab support
• Nsight Systems can profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Timeline reports can launch the remote GUI streaming
container with a single click directly in JupyterLab
• Increasing Python collateral/samples/labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools
for Optimizing Modern CUDA Workloads
Focus Area: Cloud & Cluster

• Nsight Systems enhanced support for Kubernetes

• Nsight Systems analysis framework recipes to:
• Understand how compute cold-spots relate to
communications
• Generate multi-node heatmaps to show :
• Infiniband congestion
• Infiniband, Ethernet, and NVLink throughputs
• Overlapped compute and networking
• NVIDIA Switch per-port support

• Remotely stream GUI inside container

• No need to copy/export out to local PCs
• Jupyter Lab integration including multi-node recipes
• More Details and Examples:
• S62388: Achieving Higher Performance From Your
Datacenter and Cloud Application
Additional Resources
New Developer Tools Video Series
YouTube Playlist
DEVELOPER TOOLS ACROSS GTC
Sessions
S62256: Demystify CUDA debugging and performance with powerful developer tools
S62388: Achieving Higher Performance From Your Data Center and Cloud Application
SE62128: Exploring AI-Assisted Developer Tools for Accelerated Computing
S62398: Advances in Ray Tracing Developer Tools
Labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads
Connect with the Experts
CWE61532: What's in Your CUDA Toolbox? CUDA Profiling, Optimization, and Debugging Tools
CWE61581: Using Nsight Graphics Tools to Transform Your Graphics Application to a Next-Gen Powerhouse
CWE61231: Connect With the Experts: GPU Compute Performance Analysis and Optimizations
SE63279: Ask the Experts: Connect with Jetson, Metropolis, and Isaac Platform Experts and Engineers
Live demos
Come and visit the Developer Tools pod during show floor hours!

Developer Tools are free, get started here:

https://developer.nvidia.com/tools-overview
Training and Tutorials:
https://developer.nvidia.com/tools-tutorials

Interested in working on Developer Tools? We are hiring! Scan the QR code

Chapter 1 - Types & Components of A Computer System
No ratings yet
Chapter 1 - Types & Components of A Computer System
8 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Introduction To Cypress
No ratings yet
Introduction To Cypress
17 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
OpenACC 2
No ratings yet
OpenACC 2
44 pages
Release Notes
No ratings yet
Release Notes
7 pages
Release Notes
No ratings yet
Release Notes
7 pages
Installation Guide
No ratings yet
Installation Guide
11 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
Module2
No ratings yet
Module2
50 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
CUDA Tools
No ratings yet
CUDA Tools
25 pages
CUDA Zone - Library of Resources - NVIDIA Developer
No ratings yet
CUDA Zone - Library of Resources - NVIDIA Developer
7 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
Sanitizer NVTX Guide
No ratings yet
Sanitizer NVTX Guide
12 pages
CUDA_Toolkit_Release_Notes
No ratings yet
CUDA_Toolkit_Release_Notes
26 pages
Cuda Toolkit Release Notes
No ratings yet
Cuda Toolkit Release Notes
17 pages
Sanitizer NV TX Guide
No ratings yet
Sanitizer NV TX Guide
12 pages
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
100% (1)
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
44 pages
ReleaseNotes
No ratings yet
ReleaseNotes
37 pages
SanitizerNvtxGuide
No ratings yet
SanitizerNvtxGuide
12 pages
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
No ratings yet
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
8 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
15 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
ComputeSanitizer
No ratings yet
ComputeSanitizer
57 pages
Compute Sanitizer
No ratings yet
Compute Sanitizer
57 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
13 pages
User Guide
No ratings yet
User Guide
309 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
ReleaseNotes
No ratings yet
ReleaseNotes
16 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Compute Sanitizer
No ratings yet
Compute Sanitizer
50 pages
CUDA_Toolkit_Release_Notes
No ratings yet
CUDA_Toolkit_Release_Notes
50 pages
Profiling Guide
No ratings yet
Profiling Guide
76 pages
Cuda Maxwell 3 D
No ratings yet
Cuda Maxwell 3 D
38 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
80 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
S62398
No ratings yet
S62398
59 pages
Owens
No ratings yet
Owens
67 pages
API Reference: v11.7 August 2022
No ratings yet
API Reference: v11.7 August 2022
4 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
Part1 22
No ratings yet
Part1 22
77 pages
Release Notes
No ratings yet
Release Notes
38 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
16 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
Envytools PDF
No ratings yet
Envytools PDF
701 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
unit 3 question bank solution
No ratings yet
unit 3 question bank solution
23 pages
Internship Report
No ratings yet
Internship Report
16 pages
ourlog_875
No ratings yet
ourlog_875
121 pages
unit 2
No ratings yet
unit 2
13 pages
Lower Case To Upper Case MIC
No ratings yet
Lower Case To Upper Case MIC
12 pages
Unit-4 Notes of Advance Operating System
No ratings yet
Unit-4 Notes of Advance Operating System
35 pages
Question Bank - Module 1
No ratings yet
Question Bank - Module 1
3 pages
Dhcpv6 Revisited: Certified Network Engineer For Ipv6 (Cne6) - Gold
No ratings yet
Dhcpv6 Revisited: Certified Network Engineer For Ipv6 (Cne6) - Gold
47 pages
Eaton 186874 IZMX ECAM 1 en - GB
No ratings yet
Eaton 186874 IZMX ECAM 1 en - GB
1 page
Intro To JAVA Programming: Sana Khalique
No ratings yet
Intro To JAVA Programming: Sana Khalique
17 pages
Webservices Arithmetic
No ratings yet
Webservices Arithmetic
3 pages
Networking Open Exam
No ratings yet
Networking Open Exam
11 pages
Resensys Senscope Software
No ratings yet
Resensys Senscope Software
4 pages
Lecture 1-Introduction: Data Structure and Algorithm Analysis
No ratings yet
Lecture 1-Introduction: Data Structure and Algorithm Analysis
27 pages
Swathi_Dot
No ratings yet
Swathi_Dot
6 pages
Rineesh Konaparthi - 05102022
No ratings yet
Rineesh Konaparthi - 05102022
4 pages
80-Command Line Interface
No ratings yet
80-Command Line Interface
7 pages
Wepik The Future of Artificial Intelligence Predictions and Projections
No ratings yet
Wepik The Future of Artificial Intelligence Predictions and Projections
10 pages
Cloud
No ratings yet
Cloud
25 pages
C-RAN and FH Requirements - 1025
No ratings yet
C-RAN and FH Requirements - 1025
26 pages
MS - 12CS - PB-I - 23-24 Set 2
No ratings yet
MS - 12CS - PB-I - 23-24 Set 2
6 pages
CAEForum21 Fu Armitage 1
No ratings yet
CAEForum21 Fu Armitage 1
36 pages
PCS-9893C - X - Instruction Manual - EN - Domestic General - X - R1.00
No ratings yet
PCS-9893C - X - Instruction Manual - EN - Domestic General - X - R1.00
42 pages
Internet Programming I_Chapter 1
No ratings yet
Internet Programming I_Chapter 1
21 pages
Oracle Tips and Tricks
No ratings yet
Oracle Tips and Tricks
28 pages
998 20312057 Product Selection Guide EcoStruxureBuilding
No ratings yet
998 20312057 Product Selection Guide EcoStruxureBuilding
80 pages
NSE Solution Insider - Extend SD-WAN Into The Cloud With FortiGate Azure Virtual WAN Apr 12, 2022
No ratings yet
NSE Solution Insider - Extend SD-WAN Into The Cloud With FortiGate Azure Virtual WAN Apr 12, 2022
17 pages
Module 5 Q and Ans
No ratings yet
Module 5 Q and Ans
6 pages

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Demystify CUDA Debugging and Performance

with Powerful Developer Tools

• High-level tools ecosystem overview

• For each tool:

• Current and Future Areas of Focus

• Additional Resources / Q&A

• Compute Sanitizer checks correctness issues via

• Memcheck – Memory access error and leak detection

Address space Type of access Access size

========= Invalid __global__ write of size 4 bytes Access location

========= Faulty thread

========= Faulty address

========= Device Frame:/home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:44:out_of_bounds_kernel() [0x30]

========= Host Frame: [0x2774ec]

========= Host Frame:__cudart803 [0xfccb]

========= Host Frame:cudaLaunchKernel [0x6a578]

• Header-only library https://github.com/NVIDIA/NVTX/tree/release-v3/c.

API references https://nvidia.github.io/NVTX/doxygen/index.html and https://nvidia.github.io/NVTX/doxygen-cpp/index.html

DeepStream SDK Holoscan SDK

Math Libraries Comm. Libraries …

cuSOLVER NCCL cuFile

• Annotate Python code with NVTX • pip install nvtx - https://pypi.org/project/nvtx/

• Profile and Visualize with Nsight Systems

• No Python source changes required

• CPU Core and Uncore Events

Improving access pattern and implementing cache blocking

• Python Call Stacks Samples and CUDA API Backtrace

• Nsight Systems enhanced support for Kubernetes

• NVIDIA Switch per-port support

• Multi-process workload with NCCL

Targeted metric sections

Built-in expertise for

Metrics for peak performance

Metric heatmap to quickly

• Reveals behaviors hidden by aggregates

• Source Code Comparison

• New GPU and Memory Workload Distribution section

• Source Page Statistics including multi-select

• Python Callstacks and Syntax Highlighting

• Range Replay Kernel Timestamps

• New APIs for instruction level SASS metrics

• Python CPU Call Stacks

• Nsight Systems enhanced support for Kubernetes

• Remotely stream GUI inside container

Developer Tools are free, get started here:

Interested in working on Developer Tools? We are hiring! Scan the QR code

You might also like

========= Invalid global write of size 4 bytes Access location