0% found this document useful (0 votes)
35 views44 pages

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views44 pages

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Demystify CUDA Debugging and Performance

with Powerful Developer Tools


Jackson Marusarz
Agenda

• High-level tools ecosystem overview

• For each tool:


• Brief description and feature overview
• New features in the latest releases and the problems the help solve

• Current and Future Areas of Focus

• Additional Resources / Q&A

https://developer.nvidia.com/tools-overview
Developer Tools Ecosystem
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Nsight Visual Studio Code Edition

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Visual Studio Code Edition
Nsight Visual Studio Edition
Nsight Eclipse Edition
Compute Debuggers and IDEs
Compute Debuggers
Debug GPU Kernels Running on Device

• CUDA GDB
• CPU + GPU CUDA kernel debugger
• Supports stepping, breakpoints, in-line functions, variable inspection etc…
• Built on GDB and uses many of the same CLI commands
• Local/Remote connection support
• Nsight Visual Studio Edition
• IDE integration for Visual Studio
• Build and Debug CPU+GPU code from Visual Studio
• Nsight Visual Studio Code Edition
• New IDE integration for VS Code
• Build and Debug CPU+GPU code from Visual Studio Code
• Remotely target Linux targets from Windows or Linux
• Nsight Eclipse Edition
• IDE integration for Eclipse
• Build and Debug CPU+GPU code from Eclipse
Compute Sanitizer
Automatically Scan for Bugs and Memory Issues

• Compute Sanitizer checks correctness issues via


sub-tools:

• Memcheck – Memory access error and leak detection


tool.
• Racecheck – Shared memory data access hazard
detection tool.
• Initcheck – Uninitialized device global memory access
detection tool.
• Synccheck – Thread synchronization hazard detection
tool.

https://github.com/NVIDIA/compute-sanitizer-samples
Compute Sanitizer
Reading a Memcheck Example Report

Address space Type of access Access size

========= Invalid __global__ write of size 4 bytes Access location

========= Faulty thread


at 0xb0 in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:39:out_of_bounds_function()

========= Faulty address


by thread (0,0,0) and(0,0,0)
in block nearest
allocation
========= Address 0x87654320 is out of bounds

========= and Device and host backtracesbytes before the nearest allocation at 0x7f953da00000 of size 1,024 bytes
is 140,276,689,190,112

========= Device Frame:/home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:44:out_of_bounds_kernel() [0x30]

========= Saved host backtrace up to driver entry point at kernel launch time

========= Host Frame: [0x2774ec]

========= in /lib/x86_64-linux-gnu/libcuda.so.1

========= Host Frame:__cudart803 [0xfccb]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo

========= Host Frame:cudaLaunchKernel [0x6a578]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo
Compute Debuggers and IDEs
New Features
Debuggers/IDE

IDEs
• VS Code Autostart tasks for remote debugging.
• VS Code remote debug (QNX/L4T)
• VS Code Docker support

Compute Sanitizer
• Racecheck support for device-launched graphs
• Memcheck support for Address Translation
Service (ATS)
• Memcheck support for Heterogeneous Memory
Management (HMM)
NVTX Tools Extension API
NVIDIA Tools eXtension (NVTX)
• Decorate application source code with annotations (markers, ranges, nested ranges, …) to help visualize execution with debugging, tracing and profiling tools

• Header-only library https://github.com/NVIDIA/NVTX/tree/release-v3/c.


#include <nvtx3/nvToolsExt.h>

• Marker:
nvtxMark("This is a marker");

• Push-Pop range
nvtxRangePush("This is a push/pop range");
// Do something interesting in the range
nvtxRangePop(); // Pop must be on same thread as corresponding Push

• Start-End range
nvtxRangeHandle_t handle = nvtxRangeStart("This is a start/end range");
// Somewhere else in the code, not necessarily same thread as Start call:
nvtxRangeEnd(handle);

API references https://nvidia.github.io/NVTX/doxygen/index.html and https://nvidia.github.io/NVTX/doxygen-cpp/index.html


NVIDIA SDKs and NVTX
A Complete Ecosystem

DeepStream SDK Holoscan SDK

Accel. GStreamer
GXF GXF
Plugins

Math Libraries Comm. Libraries …


Deep Learning Libraries
cuSPARSE NVSHMEM cuDF
TensorRT

cuSOLVER NCCL cuFile


cuDNN cuDLA

cuBLAS cuML
Python and NVTX

• Annotate Python code with NVTX • pip install nvtx - https://pypi.org/project/nvtx/

• Profile and Visualize with Nsight Systems


Python and NVTX
Trace Python Functions of Interest

• No Python source changes required


• Annotations are configured in a JSON file (e.g. <target-
platform-folder>/PythonNvtx/annotations.json)
Nsight Systems
Nsight Systems
System Profiler

Key Features:
• System-wide application algorithm tuning
• Multi-process tree support
• Locate optimization opportunities
• Visualize millions of events on a very fast GUI timeline
• Identify gaps of unused CPU and GPU time
• Balance your workload across multiple CPUs and GPUs
• CPU algorithms, utilization and thread state
GPU streams, kernels, memory transfers, etc
• Command Line, Standalone, IDE Integration
• OS: Linux (x86, ARM Server, Tegra), Windows, macOS X (host)
• GPUs: Pascal+
• Docs/product: https://developer.nvidia.com/nsight-systems
Processes and
threads

Thread state

cuDNN and
cuBLAS trace

Kernel and
memory transfer
activities

Multi-GPU
Zoom/Filter to Exact Areas of Interest
Nsight Systems
New Features
Grace Host Profiling
Hardware Counters and Metrics

• CPU Core and Uncore Events


• Sampled for each CPU
• Visualize parallelism effects
• Cache hit/miss, instructions retired, etc…
• L3 Coherency Fabric
• Socket to socket traffic
• Variable sampling frequencies supported
• Timeline correlated with all other data
• GPU vs. CPU idle times and metrics
• Data movement
• Zoom and filter
Grace Host Profiling
Cache Access Pattern Example

Single threaded CPU matrix multiplication with poor memory access patterns

Improving access pattern and implementing cache blocking


JupyterLab Integration Updates

• Extension to JupyterLab
• Profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Launch new remote GUI streaming container
directly in JupyterLab
• Servers without X, Windowing Manager, …
• Container with X, WM, & WebRTC server
• Dockerfile inside Nsight Systems Installer

• See it in action:
• DLIT61667: Profilers, Python, and Performance:
Nsight Tools for Optimizing Modern CUDA Workloads
Python Profiling Updates

• Python Call Stacks Samples and CUDA API Backtrace


• Identify where you are and how you got there
• Global Interpreter Lock (GIL) trace
• Common performance limiter in Python
• See annotated code ranges built into in popular frameworks and libraries
such as:
• RAPIDS, Spark, CV-CUDA, and more…
Cluster and Recipe Framework Improvements

• Nsight Systems enhanced support for Kubernetes


• Nsight Systems analysis framework:
• User programmable and predefined recipes to:
• Process and analyze complex and large reports or collection of
reports
• Understand how compute cold-spots relate to communications
• Generate multi-node heatmaps to show :
• InfiniBand congestion
• InfiniBand, Ethernet, and NVLink throughputs
• Overlapped compute and networking

• NVIDIA Switch per-port support


• Remotely stream GUI inside container
• No need to copy/export out to local PCs
Recipe Framework Example

• Multi-process workload with NCCL


• Utilization heatmaps for NCCL/Compute/All
• Visualize usage over time to identify:
• Phases and behavior patterns
• Load imbalance
• Idle GPU compute cycles
• Inefficient scheduling
• Overlapping communication and compute
• Ensure resources are used efficiently
Nsight Compute
Nsight Compute
Kernel Profiler

Key Features:
• Interactive CUDA API debugging and kernel profiling
• Built-in rules expertise
• Fully customizable data collection and display
• Command Line, Standalone, IDE Integration, Remote Targets

• OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, macOS X (host only)
• GPUs: Volta+

• Docs/product: https://developer.nvidia.com/nsight-compute
Nsight Compute GUI Interface

Targeted metric sections

Customizable data
collection and
presentation

Built-in expertise for


Guided Analysis and
optimization
Visual memory analysis chart

Metrics for peak performance


ratios
Source/PTX/SASS analysis
and correlation

Metric heatmap to quickly


Source metrics per identify hotspots
instruction
Nsight Compute
New Features
Nsight Compute
Periodic Metric Sampling

• Reveals behaviors hidden by aggregates


• Inter-kernel phases
• Workload imbalance (tail effects, etc…)
• Warp Stall Reasons
Nsight Compute
Source Code Comparison

• Source Code Comparison


• Determine how modifications impact performance
• No need for multiple open reports/GUIs
• Automatic diff’ing to locate and navigate to changes
• Per-source heatmaps provide additional visual information
Nsight Compute
Workload Distribution Section and Load Imbalance Rules

• New GPU and Memory Workload Distribution section


• Helps users understand the balance of work across SMs and memory.
• New rules identify load imbalances where uneven work distribution could be impacting performance.
• Use this new section and the built-in rules to detect uneven workload distributions that may keep you from achieving
peak performance.
Coming Soon…

• Source Page Statistics including multi-select

• Python Callstacks and Syntax Highlighting

• Range Replay Kernel Timestamps


CUPTI
CUDA Profiling Tools Interface
CUPTI Updates

• New APIs for instruction level SASS metrics


• Gives CUPTI users/tool developers the ability to collect SASS metrics
through code instrumentation
• Previously only available through Nsight Compute
• Graph-level tracing for device-launched graphs
• Start and stop trace for the graph execution
• Lower overhead than per-node tracing
• Push Buffer full events
• CUDA API queue pressure can cause performance degradation
• Overhead reporting for lazy loading of CUDA modules and functions
• Performance improvements
• Tracing overhead reductions to ensure accurate performance data
Reviewing Areas of Focus
Focus Area: DevTools ♡ Python

• Python CPU Call Stacks


• Python GIL trace in Nsight Systems
• JupyterLab support
• Nsight Systems can profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Timeline reports can launch the remote GUI streaming
container with a single click directly in JupyterLab
• Increasing Python collateral/samples/labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools
for Optimizing Modern CUDA Workloads
Focus Area: Cloud & Cluster

• Nsight Systems enhanced support for Kubernetes


• Nsight Systems analysis framework recipes to:
• Understand how compute cold-spots relate to
communications
• Generate multi-node heatmaps to show :
• Infiniband congestion
• Infiniband, Ethernet, and NVLink throughputs
• Overlapped compute and networking
• NVIDIA Switch per-port support

• Remotely stream GUI inside container


• No need to copy/export out to local PCs
• Jupyter Lab integration including multi-node recipes
• More Details and Examples:
• S62388: Achieving Higher Performance From Your
Datacenter and Cloud Application
Additional Resources
New Developer Tools Video Series
YouTube Playlist
DEVELOPER TOOLS ACROSS GTC
Sessions
S62256: Demystify CUDA debugging and performance with powerful developer tools
S62388: Achieving Higher Performance From Your Data Center and Cloud Application
SE62128: Exploring AI-Assisted Developer Tools for Accelerated Computing
S62398: Advances in Ray Tracing Developer Tools
Labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads
Connect with the Experts
CWE61532: What's in Your CUDA Toolbox? CUDA Profiling, Optimization, and Debugging Tools
CWE61581: Using Nsight Graphics Tools to Transform Your Graphics Application to a Next-Gen Powerhouse
CWE61231: Connect With the Experts: GPU Compute Performance Analysis and Optimizations
SE63279: Ask the Experts: Connect with Jetson, Metropolis, and Isaac Platform Experts and Engineers
Live demos
Come and visit the Developer Tools pod during show floor hours!

Developer Tools are free, get started here:


https://developer.nvidia.com/tools-overview
Training and Tutorials:
https://developer.nvidia.com/tools-tutorials

Interested in working on Developer Tools? We are hiring! Scan the QR code

You might also like