vLLM reposted this
HUGE Performance Boost for AMD GPUs with vLLM 0.9.0! Excited to announce vLLM 0.9.0, packed with significant AMD ROCm optimizations! We're seeing incredible speed-ups that will change how you run LLMs on AMD hardware. 🔥 What's New in vLLM 0.9.0 for AMD? MI-Series GPUs: - FP8 KV cache support (ROCm V1) - Qwen3 MoE: +16.83% request throughput (MI300X) - AITER Block-Scaled GEMM: +19.4% performance boost! - DeepSeek V3/R1 optimizations: +13.8% output throughput! 🎮 Consumer GPUs (RX 7000/9000 Series): - NEW: Custom Paged Attention for Radeon - RX 9000 Series: +16.4% throughput - RX 7000 Series: +19.0% performance gains Yes, run LLMs efficiently on your gaming rig! 📊 DeepSeek-V3 Benchmark We benchmarked deepseek-ai/DeepSeek-V3 (6K input, 1K output, TP=8) now running smoothly on vLLM V1 engine with AITER. We measured Request Goodput as requests meeting strict latency SLOs: Time To First Token (TTFT) < 3000ms AND Time Per Output Token (TPOT) < 50ms. Without AITER (Baseline): - Total Token Throughput: ~1300 tok/s - Mean TTFT: ~4073 ms - Request Goodput (TTFT <3s, TPOT <50ms): A mere 0.01 req/s With vLLM 0.9.0 (V1 Engine, AITER Enabled): - Total Token Throughput: ~2304 tok/s - Mean TTFT: ~3502 ms - Request Goodput (TTFT <3s, TPOT <50ms): A whopping 0.18 req/s! Key Wins with vLLM 0.9.0 & AITER: - 📈 ~1.77x Total Token Throughput! (From 1300 to 2304 tok/s) - 🎯 18x Request Goodput! (From 0.01 to 0.18 req/s) - Far more requests meet our strict 3s TTFT & 50ms TPOT targets! - ⏱ 1.16x Faster Mean TTFT! (From 4073ms down to 3502ms) - Get that first token quicker! - 🔓 ENABLEMENT: DeepSeek-V3 now runs on the V1 engine for ROCm 🛠 Get Started with vLLM 0.9.0 on ROCm: Launch Docker: docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri https://lnkd.in/gKrZpjTN bash Run vLLM Server (inside Docker): VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V3 -tp 8 --max-model-len 32768 --block-size 1 Ready to experience the speed? Try vLLM 0.9.0 today! 🔗 https://lnkd.in/g4uMsJAh