#CUDA 12.9 is out! and so is BF16 Tensor Core accelerated single-precision (FP32) Matrix Multiplication (GEMM) that delivers a 3X speed-up on Blackwell GPUs. To learn more about this and more, like the new block-scaled FP4 and FP8 formats check out our latest blog: https://lnkd.in/gsHCWtjv
Ditching tf32
NVIDIA | Thought Leadership | Multi-Agentic-Systems Powered by Gen AI | Stanford University | IIT Guwahati | Architecting Generative AI & LLMs across the Industries | Seeker-builder
5moLove this