vLLM

vLLM · 2025-04-06T20:09:05.413Z

We are excited to announce Day 0 Support for Meta's Llama 4 Scout and Maverick. You can find useful serving commands in this blog post https://lnkd.in/gm_JUCqp

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

View all 11 employees

About us

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Website: https://github.com/vllm-project/vllm
External link for vLLM
Industry: Software Development
Company size: 51-200 employees
Type: Nonprofit

Employees at vLLM

See all employees

Updates

vLLM reposted this
Embedded LLM

7,598 followers
15h
Report this post
HUGE Performance Boost for AMD GPUs with vLLM 0.9.0! Excited to announce vLLM 0.9.0, packed with significant AMD ROCm optimizations! We're seeing incredible speed-ups that will change how you run LLMs on AMD hardware. 🔥 What's New in vLLM 0.9.0 for AMD? MI-Series GPUs: - FP8 KV cache support (ROCm V1) - Qwen3 MoE: +16.83% request throughput (MI300X) - AITER Block-Scaled GEMM: +19.4% performance boost! - DeepSeek V3/R1 optimizations: +13.8% output throughput! 🎮 Consumer GPUs (RX 7000/9000 Series): - NEW: Custom Paged Attention for Radeon - RX 9000 Series: +16.4% throughput - RX 7000 Series: +19.0% performance gains Yes, run LLMs efficiently on your gaming rig! 📊 DeepSeek-V3 Benchmark We benchmarked deepseek-ai/DeepSeek-V3 (6K input, 1K output, TP=8) now running smoothly on vLLM V1 engine with AITER. We measured Request Goodput as requests meeting strict latency SLOs: Time To First Token (TTFT) < 3000ms AND Time Per Output Token (TPOT) < 50ms. Without AITER (Baseline): - Total Token Throughput: ~1300 tok/s - Mean TTFT: ~4073 ms - Request Goodput (TTFT <3s, TPOT <50ms): A mere 0.01 req/s With vLLM 0.9.0 (V1 Engine, AITER Enabled): - Total Token Throughput: ~2304 tok/s - Mean TTFT: ~3502 ms - Request Goodput (TTFT <3s, TPOT <50ms): A whopping 0.18 req/s! Key Wins with vLLM 0.9.0 & AITER: - 📈 ~1.77x Total Token Throughput! (From 1300 to 2304 tok/s) - 🎯 18x Request Goodput! (From 0.01 to 0.18 req/s) - Far more requests meet our strict 3s TTFT & 50ms TPOT targets! - ⏱ 1.16x Faster Mean TTFT! (From 4073ms down to 3502ms) - Get that first token quicker! - 🔓 ENABLEMENT: DeepSeek-V3 now runs on the V1 engine for ROCm 🛠 Get Started with vLLM 0.9.0 on ROCm: Launch Docker: docker run -it --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device /dev/kfd --device /dev/dri https://lnkd.in/gKrZpjTN bash Run vLLM Server (inside Docker): VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V3 -tp 8 --max-model-len 32768 --block-size 1 Ready to experience the speed? Try vLLM 0.9.0 today! 🔗 https://lnkd.in/g4uMsJAh
4 Comments

Like Comment Share
vLLM reposted this
Yuan Tang

Senior Principal Software Engineer at Red Hat AI | Open Source Leader | Keynote Speaker | Author | Technical Advisor | We are hiring!
4d Edited
Report this post
We are excited to announce KServe 0.15 release, marking a significant leap forward in serving both predictive and generative AI models. Key GenAI features include: Envoy AI Gateway integration, multi-node inference via vLLM, LLM autoscaler with KEDA, and distributed KV cache with LMCache Lab.

Announcing KServe v0.15: Advancing Generative AI Model Serving ¶

kserve.github.io

2 Comments

Like Comment Share
vLLM

2,268 followers
3w
Report this post
Thank you RunLLM! 13k per month questions answered and growing!
RunLLM

1,485 followers
3w Edited

We’re proud to support the vLLM project, where RunLLM acts as the first line of advanced technical support, resolving 99% of all questions (that’s over 13k per month). Thanks to Simon Mo and the vLLM team for trusting our AI Support Engineer at scale. Read the full case study here: https://lnkd.in/gM8EHeKZ #AISupportEngineer #vLLM #TechnicalSupport #OpenSourceLLM #RAG #CustomerSupport #AItools
Like Comment Share
vLLM reposted this
Soumith Chintala

PyTorch. Robots. Research @ Meta
3w Edited
Report this post
the PyTorch Foundation is becoming a hub for open-source AI excellence. vLLM and Deepspeed are joining PyTorch as the first two projects after this change! Super excited to be able to bring together multiple like-minded, high-quality projects into the foundation. Props to Joseph Spisak, Matt W., Luca Antiga and others for driving this change! Full announcement here: https://lnkd.in/eQ-htiaN
PyTorch

281,135 followers
3w

PyTorch Foundation, a community-driven hub for open source AI, announced its expansion into an umbrella foundation. As part of this milestone, two leading open source AI projects—vLLM and DeepSpeed—have been accepted into the foundation by the Technical Advisory Council. This expansion positions PyTorch Foundation as the trusted home for a broad range of community-driven AI projects spanning the entire AI lifecycle—from training and inference and domain-specific applications to agentic frameworks. Supporting quotes provided by the following members: AMD, Arm, Amazon Web Services (AWS), Google, Huawei, Hugging Face, IBM, Intel Corporation, Lightning AI, Meta, NVIDIA, and Snowflake. 💡🔗 Read the full announcement: https://hubs.la/Q03lmJBz0 #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed
4 Comments

Like Comment Share
vLLM reposted this
Vikram Sreekanti

Co-founder & CEO, RunLLM
3w
Report this post
vLLM is one of the fastest growing open-source communities, and they've had a meteoric rise that mirrors the growing popularity of open-source models. Given how widely the project is used, the team gets hundreds of questions a day — all of which are extremely technical. Answering each question requires an understanding of cloud infrastructure, model architectures, and deployment options. Most generic RAG applications simply can't keep up with this level of complexity, which is why the vLLM team turned to RunLLM. Since they've deployed RunLLM, we've had a first hand seat to see how technical their user base and how much they benefit from a high quality AI Support Engineer. RunLLM's now answering 13k questions a month (🤯) for vLLM and enabling the team to scale without needing dedicated support. More in the case study below 👇

2 Comments

Like Comment Share
vLLM reposted this
Joseph Spisak

Product Director, Generative AI @Meta | Ex: Google, Amazon
3w Edited
Report this post
The Open Language of AI.. My journey in AI started over 10 years ago when I met Soumith, Yangqing, Dmytro, Andrew and Pieter - folks that I am proud to this day to call friends and colleagues. It’s hard to believe it’s been that long and honestly hard to believe I’ve spent 7 of those years helping to build PyTorch what it is today - an amazing and inspirational project that is foundational to ChatGPT, Llama, DeepSeek, Stable Diffusion and so many other breakthroughs. In this next phase we have a much bigger and broader vision and as part of that we are bringing in vLLM and DeepSpeed under our umbrella foundation laying the groundwork for PyTorch to continue its growth trajectory with additional high caliber projects on the way. For more on how we see the future check out the vision blog that Luca and coauthored. https://lnkd.in/gZRmv2e9 I’m excited to work with this community to realize our expanded vision of being the open language of AI. Thank you to everyone who made this next step possible including a few key folks - Ion Stoica, Aparna Ramani, Soumith Chintala, Matt W., Simon Mo and Woosuk Kwon and many others.. And thank you to all of the members: AMD, Arm, Amazon Web Services (AWS), Google, Huawei, Hugging Face, IBM, Intel Corporation, Lightning AI, NVIDIA, and Snowflake. Cheers to what's next! Details: Vision post: https://lnkd.in/gZRmv2e9 Press Release - https://lnkd.in/gBzYkyaJ vLLM Announcement - https://lnkd.in/gCEwbq3p DeepSpeed Announcement - https://lnkd.in/gXAWTNnt
10 Comments

Like Comment Share
vLLM reposted this
Samyam Rajbhandari

AI Systems Lead at Snowflake
1mo
Report this post
𝗖𝗮𝗻 𝗔𝗜 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗯𝗲 𝗲𝘃𝗲𝗻 𝗳𝗮𝘀𝘁𝗲𝗿? At Snowflake AI Research, we’ve been working to push the limits of real world speculative decoding —and we’re excited to share a major step forward. With our latest improvements, we’re seeing some of the 𝗳𝗮𝘀𝘁𝗲𝘀𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝘀𝗽𝗲𝗲𝗱𝘀 𝘆𝗲𝘁 across a range of real-world workloads. 🚀 • 𝟰𝘅 𝗳𝗮𝘀𝘁𝗲𝗿 LLM inference for coding agents like OpenHands All Hands AI 💬 • 𝟮.𝟰𝘅 𝗳𝗮𝘀𝘁𝗲𝗿 LLM inference for conservational/interactive use cases These improvements result from multiple technologies all working together: • 𝗦𝘂𝗳𝗳𝗶𝘅 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 for speculating long repetitive patterns. • 𝗦𝗲𝗮𝗺𝗹𝗲𝘀𝘀 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴-𝘁𝗼-𝗱𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 framework for creating powerful but lightweight custom draft models for non-repetitive generation. • 𝗦𝘆𝘀𝘁𝗲𝗺-𝗹𝗲𝘃𝗲𝗹 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝘀 to streamline the entire speculation code path. We are open-sourcing all of our improvements as part of Arctic Inference, an easy-to-install plugin for vLLM Check out our blog post to learn more: https://lnkd.in/gPAERCcN

Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training

snowflake.com

Like Comment Share
vLLM reposted this
Julien Chaumond

CTO at Hugging Face
1mo
Report this post
This is a MUCH BIGGER deal than one could think at first sight: ➡️Transformers as consistent architecture definition, made seamlessly available in vLLM’s super efficient runtime I’m looking forward to even more collaboration in the coming months 🔥🔥🔥
81 Comments

Like Comment Share
vLLM

2,268 followers
1mo
Report this post
We are excited to announce Day 0 Support for Meta's Llama 4 Scout and Maverick. You can find useful serving commands in this blog post https://lnkd.in/gm_JUCqp

Llama 4 in vLLM

blog.vllm.ai

Like Comment Share

vLLM

Software Development

An open source, high-throughput and memory-efficient inference and serving engine for LLMs.

About us

Employees at vLLM

Michael Goin

Inference Optimization @ Red Hat | vLLM Committer

Robert Shaw

Director of Engineering at Red Hat

Wenlong Wang

Ph.D. @ UMN | AI System & Infra | vLLM

Kyle Mistele

CTO @ Naptha.ai, software engineer, open-source contributor

Updates

Announcing KServe v0.15: Advancing Generative AI Model Serving ¶

kserve.github.io

Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training

snowflake.com

Llama 4 in vLLM

blog.vllm.ai

Join now to see what you are missing

Similar pages

Embedded LLM

Hugging Face

sgl-project

LMCache Lab

Neural Magic (Acquired by Red Hat)

Anyscale

Ollama

Red Hat

Unsloth AI

PyTorch