GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB
GPU Matrix Multiplication Methods
Explore top LinkedIn content from expert professionals.
Summary
GPU matrix multiplication methods refer to specialized techniques for multiplying large arrays of numbers (matrices) using graphics processing units, which can perform these operations much faster than traditional CPUs. These methods are at the heart of modern AI and scientific computing, and often involve unique ways of organizing memory and computation to get the most out of the hardware.
- Understand memory layout: Arrange your matrix data so that the GPU can access it quickly, which dramatically boosts performance and prevents unnecessary delays.
- Explore kernel optimizations: Use techniques like blocktiling, warptiling, and specialized instructions to make sure your matrix multiplication tasks run as fast as possible on different GPU architectures.
- Balance precision and speed: Try low-precision or emulated algorithms when you need faster results without sacrificing too much accuracy, especially for research or high-performance computing projects.
-
-
I'm happy to share a new coding tutorial for doing fast matrix multiplication on #NVIDIA Hopper GPUs! This covers the warpgroup matrix-multiply-accumulate (WGMMA) instruction that specifically targets the Tensor Cores on Hopper GPUs. Using tools from the CUTLASS library, we go into detail on all aspects of correctly invoking WGMMA as a primitive for matmul when writing a CUDA kernel -- how tensor data should be laid out in memory for WGMMA, how to use CUTLASS to define these layouts of your data, and how to synchronize WGMMA as an async instruction to guard against race conditions and ensure correct behavior of your kernel. If you've read the blog post on FlashAttention-3, you'll know how heavily FA-3 exploits WGMMA -- both in terms of its higher throughput and asynchronous capabilities -- to achieve its impressive performance gains. Our hope is that this tutorial can help similarly unlock the potential of the Hopper architecture when coding up your own projects and research ideas! We're also planning at least two followups to this tutorial - one covering the overall structure of an efficient GEMM kernel with a focus on copy-compute overlapping techniques such as warp specialization, and another on persistent kernels and the Stream-K algorithm for GEMM. Work done in collaboration with my colleagues at Colfax and Hieu Pham. https://lnkd.in/g-tsnnha
-
nvidia's B200 does 1,760 TFLOPS on a square GEMM and 4 TFLOPS on a skinny one. the shape of your matmul is the single biggest variable in GPU performance. but why? same hardware, same operation, same precision. the only difference is three numbers: M, N, and K (the shape of the matrices you're computing in your matrix multiplications). the first problem is arithmetic intensity. for a small-M GEMM, arithmetic intensity is approximately M/2 FLOPs per byte for FP16. at M=1, that's 0.5 FLOPs/byte. B200's compute-to-bandwidth ratio is ~281. so at M=1, you're using 0.18% of peak TFLOPS. the GPU loads the entire weight matrix from HBM and does one row of math with it, with literally zero data reuse. this maps directly to how LLM inference works. during prefill, M equals your sequence length: it's thousands of tokens, so compute-bound, and you repeatedly get 70%+ GPU utilization. but, during decode, M equals your batch size: often 1 to 64, completely memory-bound, so you get around 2-10% utilization. it's the same model, same weights, and same GPU, but completely different performance regime. this is why disaggregated serving exists: prefill and decode want fundamentally different hardware. the second problem is wave quantization. the GPU tiles the output matrix into blocks (typically 256x128) and assigns each tile to an SM. for example an A100 has 108 SMs. if your shape produces exactly 108 tiles, that's one perfect wave, so you get 100% utilization. if it produces 109 tiles, you need two waves. the second wave uses 1 out of 108 SMs. execution time roughly doubles for a single extra tile! this creates literal performance cliffs. benchmarks show that adding a single column to a 1792x1792 matrix drops performance 30%. the tile count jumps from 98 to 120, crossing the wave boundary. the tail wave is ~88% idle. cuBLAS handles this with a trained ML recommender, instead of a lookup table. it ships hundreds of kernel implementations and uses a model trained on benchmark data to predict the fastest one for each (M, N, K) in microseconds. AMD's Tensile takes the opposite approach: benchmark 23 million kernel launches offline, build a predicate tree, exact-match lookup at runtime. they basically do brute force, and it works pretty well. the solutions to this problem are varied and elegant. split-K parallelizes along the reduction dimension to create more tiles. stream-K goes further: it divides total work evenly across all SMs, eliminating wave quantization entirely. up to 14x speedup at wave boundaries. persistent kernels on Blackwell take it to the limit: two SMs cooperate on a single tile with dedicated tensor memory, hitting 1,760 TFLOPS. the nature of mixture-of-experts models compound all of these problems. deepseek V3 has 256 experts with top 8 routing. a batch-128 decode step means each expert sees ~4 tokens on average. that's M=4 per expert GEMM, firmly in the <3% utilization regime (sad), regardless of how many GPUs you throw at it.
-
🚀 Outperforming Nvidia cuBLAS on the H100: A Journey From 4% to 107% Performance 📈 TL;DR 🤖 Iteratively builds a CUDA kernel for matrix multiplication on NVIDIA's H100 GPU. ⚡ Achieves 764 TFLOPs (107% of cuBLAS) for N=4096. 🚀 Involves 10 novel kernel optimizations to maximize Hopper architecture features including Warp-group Tensor Cores (WGMMA) and Tensor Memory Accelerator (TMA) 📈 Builds upon Simon Boehm's previous work on the Nvidia RTX A6000 GPU back in 2022 Problems & Solutions 💥 Problem: A6000-style kernels only achieved 4% of cuBLAS on H100 🛠️ Solution: Adopt Hopper-specific WGMMA instructions (m64n256k16) + TMA swizzled loads 💥 Problem: Register spilling with large tiles (128x256) 🛠️ Solution: Split computation across 2 warp-groups (3 total: 1 producer + 2 consumers) 🔥 Problem: Hitting power wall of 700W GPU limit 🛠️ Solution: Optimized memory ops & spatial scheduling 🔥 Problem: Cache Thrashing: Row-major 🛠️ Solution: Hilbert curve tile loading (83% L2 hit rate) 🔥 Async Bottlenecks: Sequential ops 🛠️ Producer-consumer queues with PTX barriers Novel Insights and Learnings 🔄 Asynchronous Pipelines: Producer-consumer pipelines fully hide 1,415-cycle load latencies. 🤖 Warp-Group Tensor Cores: Hopper's WGMMA enables 4 concurrent tensor core operations per SM. 🧭 Hilbert Scheduling: Spatial locality boosts L2 cache hit rate by 13% over cuBLAS. 🔋 Cluster Multicast: Reduces global memory traffic by 38% via shared B matrix loads. Improvements Over Prior Work on A6000 by Simon Boehm 🚀 10x Speedup: From 32 TFLOPs (baseline FP32) to 317 TFLOPs using tensor cores. ⏩ 2.3x Faster: Initial H100 tensor core kernel improved through warp specialization. 🔒 Cache Efficiency: Achieved 83% L2 cache hit rate (vs cuBLAS 70%). Key Implementations 🛠️ WGMMA Instructions: Used m64n256k16 for larger tiles, balancing register and SMEM usage. 🔧 PTX Barriers: Manual phase tracking reduced synchronization overhead by 23%. 📡 Cluster Multicast: Leveraged Hopper's TMA multicast for shared B matrix loads. 🗺️ Hilbert Scheduling: Optimized tile order for spatial locality and cache reuse. Future Work 🔮 Auto-tuning: Optimize tile sizes (BM, BN, BK) for varying matrix dimensions. 🔋 Power Efficiency: Explore power-aware scheduling to balance tensor core and L2 cache energy. 🌐 Grouped GEMM: Extend to grouped matrix multiplications for MoE models. Key Visualizations 🗺️ Figure 1: Step-by-Step Kernel Improvements 🌌 Figure 2: Multiple streaming multi-processors (SM's) within the H100 🖼️ Figure 3: Kernel structure diagram showing TMA loads → tensor core chunks → register accumulation 🔮 Figure 4: Barrier state machine for producer-consumer pipeline with circular buffer 🗺️ Figure 5 & 6: Hilbert Curve Scheduling: Visualization of spatially optimized tile order improving L2 cache hit rate to 83% (vs cuBLAS' 70%). 📈 Figure 7: Graph comparing kernel performance across matrix sizes (N=512 to N=8192). Step-by-Step Kernel Improvements and Links 👇
-
Everyone with AI knows quantization. FP16 to INT8 to FP4. Sacrifice precision, gain speed. Standard tradeoff. Emulation is the opposite. The idea: use low-precision tensor cores to produce a high-precision result. Not approximate. Mathematically exact. Two algorithms in cuBLAS right now as example. 𝗕𝗙𝟭𝟲𝘅𝟵 for FP32. Each FP32 value is split into three BF16 components. The matrix multiply expands into 9 BF16 GEMMs on tensor cores. Recombine. Full FP32 precision recovered. Why it works: BF16 and FP32 share the same 8-bit exponent. You redistribute the 23 mantissa bits across three 7-bit BF16 slots. Nothing is lost. 𝗢𝘇𝗮𝗸𝗶 𝗦𝗰𝗵𝗲𝗺𝗲 for FP64. Each FP64 value is scaled by a shared power-of-two factor, then sliced into INT8 chunks. Multiple INT8 GEMMs on tensor cores. Recombination uses error-free transformations to guarantee zero accumulated rounding error. The number of slices depends on the input data range. cuBLAS picks automatically via ADP (Automatic Dynamic Precision). The results on Blackwell: up to 13x faster than native FP64 on RTX PRO 6000. Same accuracy or better. Quantization: you trade precision for speed. Emulation: you keep precision AND gain speed. The cost is more GEMMs. But when tensor core throughput is 10x+ higher than native arithmetic, the math works out. This matters for HPC. Weather simulation, quantum chemistry, materials science. People who need FP64 but want GPU speed. cuBLAS now gives them both. CUBLAS_COMPUTE_32F_EMULATED_16BFX9 CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT Two enum values. That's all it takes to switch.
-
🧠 How a math identity eliminates millions of memory operations. 𝗧𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺: cuBLAS expects matrices in column-major order (Fortran style). Your C++/Python code uses row-major order. The naive solution? Transpose your matrices before calling cuBLAS, then transpose the result back. But matrix transpose is EXPENSIVE. It's memory-bound, requiring a full read and write of every element. On an A100, even an optimized transpose of a 4K×4K matrix takes ~0.12ms. In a transformer with thousands of matrix multiplications per forward pass, this adds up to 10-20% overhead. 𝗧𝗵𝗲 𝗶𝗻𝘀𝗶𝗴𝗵𝘁: A matrix stored in row-major order has the exact same memory layout as its transpose stored in column-major order. Same bytes. Zero data movement. Just a different interpretation. 𝗧𝗵𝗲 𝘁𝗿𝗶𝗰𝗸: For C = A × B, we use the identity: Cᵀ = Bᵀ × Aᵀ Your row-major A? cuBLAS sees it as Aᵀ in column-major Your row-major B? cuBLAS sees it as Bᵀ in column-major Swap the multiplication order: compute Bᵀ × Aᵀ cuBLAS outputs Cᵀ in column-major = C in row-major ✓ 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁: • Zero memory copies • Zero extra kernel launches • 100% of GPU resources go to actual computation This is why PyTorch and TensorFlow "just work" with cuBLAS despite storing tensors in row-major format. Pure mathematics turning an apparent incompatibility into a zero-cost abstraction. Sometimes the best optimization isn't faster code—it's realizing you don't need to run the code at all. #GPU #CUDA #Performance #MachineLearning #SoftwareEngineering
-
🔥 Tiling Strategy on NVIDIA GPU In a typical CUDA program, we transfer data from the CPU to the GPU and write the CUDA kernel function to fetch data from the GPU's global memory to perform arithmetic operations. Problem? Yes, Latency is introduced by accessing global memory, especially when you have many threads accessing the global memory, reducing the bandwidth massively and creating a performance bottleneck. Solution? Shared memory in a thread block, it reduces slow global memory accesses by reusing data locally, dramatically boosting performance for compute-intensive tasks like matrix multiplication by making kernels compute-bound instead of memory-bound. Since shared memory is much smaller than global memory, we use the "Tiling Strategy". Tiling breaks large data (like matrices) into smaller "tiles" that fit into fast on-chip memory (shared memory/registers). This involves threads loading tile data into shared memory, processing it, and then moving to the next tile, maximizing data locality and leveraging hardware like Tensor Cores for massive parallelism, with newer tools like CUDA Tile simplifying this for developers. Check out Efficient Matrix Transpose implementation using Tiling Strategy: https://lnkd.in/gTkQ84Y4
-
I spent the last 6 hours deeply understanding how to tile matrix multiplication in CUDA. I never understood the tiling until I manually worked the indexing for a matrix. What's the problem with naive matrix multiplication? Super inefficient on GPUS. 1. Each thread reads from global memory every time it needs an element, which is slow compared to shared memory. 2. There’s no reuse of data, so threads repeatedly fetch the same values. Tiling solves this by loading chunks of the matrices into shared memory, letting threads reuse data locally and improving performance. Once you visualize the tiles and map them to threads, the indexing finally makes sense, and well and it’s incredibly satisfying. Spent the last few weeks solving solutions from the pmpp book and solved for 4 chapters so far! Repo - https://lnkd.in/gRjxeN6u
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development