🔥 AMD #DevDay 2025
🎉 Very glad (and lucky!) to win the #Radeon RX9070 XT #GPU prize — huge thanks to the #AMD team!
🥥 Summary: From system-level optimization (vLLM, #SGLang) to model orchestration (#gpt-oss, #Gemma Nano, #Ollama) — an inspiring cross-section of how open-source #AI + AMD hardware co-evolve. Modern #LLM #infrastructure converges on a few core engineering patterns across serving, memory, #agentic #reasoning and #kernel stacks.
👉 Long article: https://lnkd.in/gHBHCJXy
🤝 Many thanks to speakers and hands-on workshops by: Ion Stoica (University of California, Berkeley), Lin Qiao (Fireworks AI), Daniel Han (Unsloth AI), Simon Mo (vLLM), Robert Shaw (Red Hat), Linda Yang (Supermicro), Michael Chiang (Ollama), Dominik Kundel (OpenAI), Erwan Mendard (Crusoe), Tris Warkentin (Google DeepMind), Yineng Zhang (sgl-project), Driss Guessous (Meta), Jeffrey Daily (AMD), Zhenyu Gu (AMD), Sina Rafati (AMD), Simran Arora (Together AI, Stanford University), ComfyUI, Hugging Face.
🧩 Summary
📚 Layer/token–aware memory (#vLLM + #Jenga): two-level allocator with LCM-aligned blocks + fine-grained sub-allocator. Handles heterogeneous KV sizes, minimizes fragmentation, boosts GPU utilization (~4.9× throughput gain).
⭐️ Distributed inference (#llm-d): inference gateway + prefix-aware routing + prefill/decode disaggregation. RDMA-based KV transfer, prefix indices (Bloom filters), expert-parallel load balancing (EPLB), and 5-stage decode pipeline enable predictable SLOs and lower TTFT.
⏰ #Quantization & model structure (#GPT-#OSS): Harmony prompt schema and MXFP4 4-bit float quantization for MoE reduce memory footprint (120B fits 80 GB). Requires ROCm-level kernel tuning (AITER + fused AllReduce).
🌉 #Hierarchical KV caching (#SGLang / DeepSeek / #HiCache / #Mooncake): GPU–host–remote tiers with GPU-assisted radix indexing and RDMA prefetch; achieves 6× throughput, 84% TTFT reduction. Deterministic kernels via CUDA graph replay ensure reproducible #RL.
🏙️ Hardware/software co-design (AITER, Primus-Turbo, CK, Triton): fused GEMM/attention kernels, FP8/FP4 quant, token-routing fusion, and Uneven PP for fine-grained #parallelism. Kernel-level tuning now as critical as model design.
🍃 AGIKIT & auto-tuning: emerging pipelines that benchmark, modify, and redeploy kernels automatically — agentic optimization loop for workload-specific GPU performance.
⚙️ 2025 Topics
Jenga-style allocators + per-layer KV metrics.
Wire vLLM KV connector + IGW prefix routing into serving.
Split prefill/decode with RDMA KV path.
Validate MXFP4/FP8 quant on #CoT/function-calling tests.
Benchmark ROCm + AITER + Primus early; kernel correctness = performance.
Use deterministic kernels for reproducible RL.
📈 Summary
LLM infra / system treats kernels, caches, and quant formats as first-class product surfaces. Next: observability on KV layers, P/D RDMA prototypes, quant sensitivity evals, and integrated kernel auto-tuning loops.
Read more: https://lnkd.in/geSMr9G6