LMCache Lab

LMCache Lab · 2025-09-11T21:32:39.080Z

Missed our tutorial at SIGCOMM 2025? The full recording is now available! Watch the session here: https://lnkd.in/e6EbM48Y #SIGCOMM #SIGCOMM2025 #LMCache

Software Development

Chicago, IL 2,438 followers

Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache

View all 7 employees

About us

Open-source large-scale LLM serving solutions to democratize LLM Inference.

Website: https://github.com/LMCache/LMCache
External link for LMCache Lab
Industry: Software Development
Company size: 201-500 employees
Headquarters: Chicago, IL
Type: Nonprofit

Locations

Primary

Chicago, IL, US

Get directions

Employees at LMCache Lab

See all employees

Updates

LMCache Lab reposted this
Simardeep S.

Cloud-Native & Big Data Architect | CKS | CKA | CKAD | Cilium | Istio | AWS x 2 Certified
3d
Report this post
𝗞𝗦𝗲𝗿𝘃𝗲 𝘃0.15: 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀-𝗡𝗮𝘁𝗶𝘃𝗲 𝗟𝗟𝗠 𝗦𝗲𝗿𝘃𝗶𝗻𝗴 𝗧𝗵𝗮𝘁 𝗔𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗦𝗰𝗮𝗹𝗲𝘀 The gap between getting an LLM working on your laptop and running it reliably in production is still enormous. KServe is the first serving platform I've seen that actually addresses these problems at an architectural level. Multi-node inference lets you run models like Llama 405B by distributing them across GPU clusters with simple YAML configuration. No custom orchestration code, no manual sharding logic. The KEDA integration brings autoscaling that understands LLM workloads. Instead of scaling on CPU percentage or request count, you scale on running inference requests. This matters when one request might generate 10 tokens and another generates 500. Distributed KV cache through LMCache Lab means your inference pods can share cached computations. When a user asks a follow-up question, any pod in your cluster can reuse the context from previous turns, even if a different instance handled the original request. The Envoy AI Gateway adds token-based rate limiting, intelligent routing, and multi-tenant isolation—features that generic API gateways don't provide because they weren't designed for generative AI traffic patterns. The platform is open source, CNCF-backed, and used by Bloomberg, IBM, and NVIDIA. I wrote a detailed technical breakdown covering the architecture, implementation examples, and YAML configurations: https://lnkd.in/eaY9Dtkx
2 Comments

Like Comment Share
LMCache Lab

2,438 followers
1w
Report this post
We’re thrilled to see LMCache featured alongside other leading open-source projects at IBM TechXchange! 🎉
Yuan Tang

Senior Principal Software Engineer at Red Hat AI | Open Source Leader | Keynote Speaker | Author | Technical Advisor | We are hiring!
1w

Wrapping up another incredible IBM TechXchange! The Community Day continues to be my favorite part of the event. I had the chance to connect in person with so many great collaborators from the open source community. It was also amazing to see a single slide that includes many of my favorite projects vLLM, llm-d, LMCache Lab, and KServe. Each year, this conference keeps getting better with more energy, more innovation, and more inspiring people driving open technology forward. It was great to see such a strong turnout for my KServe session. Thank you to everyone who joined and asked questions! And a special shoutout to Michelle, for curating such a fantastic lineup of speakers and sessions. Junchen Jiang, Martin Hickey, Daniel Oh, Dimitri Saridakis, Matthew Prahl, Carlos Costa, Brad Topol, Michelle Kovac, MBA
- +1
Like Comment Share
LMCache Lab

2,438 followers
2w
Report this post
Check out our new blog and colab with Google GKE about LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage https://lnkd.in/g4PZXB7s
Like Comment Share
LMCache Lab reposted this
Jason Messer

Engineering Leader - Google, building AI/ML Data Layers and Pipelines | Board Member | Servant-Leader | former AWS and Microsoft
2w
Report this post
KV Caching is an optimization to accelerate inference workloads' time to first token and total tokens per second throughput by preventing LLM models from recalculating these tokens for the entire text sequence. Our team has been working with LMCache Lab to develop KV Cache offloading with multi-tier storage on #GCP to increase the tokens / second throughput of vLLM by 264% and decrease the E2E latency by 73%. Read more in: https://lnkd.in/gQv9rcZH Nathan Beach Akshay Ram Nishtha Jain Connor McCoy Clayton Coleman Danna Wang Brian Kaufman

LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage | LMCache blog website blog.lmcache.ai

1 Comment

Like Comment Share
LMCache Lab

2,438 followers
2w
Report this post
Fantastic to see real-world stories like this! Thanks, Manoranjan, for spotlighting how LMCache helps teams unlock longer context windows and maximize their GPU investment.

Manoranjan Rajguru

Microsoft | On a mission to help people land their dream Job - 15K follower | Ex Amazon Citi JPMC
3w

Had an interesting challenge this week that I think many AI engineers are facing right now. The Problem: A client came to us with an H100 GPU (144GB VRAM) but was stuck at 32K context length. They knew extending the context window would blow through their VRAM budget. Sound familiar? The Real Bottleneck: Here's what most people don't realize: when running LLMs, VRAM isn't just consumed by model weights. The Key-Value (KV) cache grows linearly with context length and becomes the actual memory hog. This is THE hardware bottleneck for long-context LLM deployments. The Solution: LMCache Instead of buying more GPUs (the expensive route), we implemented LMCache - a game-changing approach to KV cache management that: ✅ Offloads KV cache across GPU → CPU RAM → Disk storage ✅ Intelligently keeps "hot" data in fast GPU memory ✅ Prefetches data to minimize latency impact ✅ Enables context windows far beyond hardware "limits" The Results: Our client now handles context lengths well beyond 32K without additional hardware investment. They're seeing: • Faster Time to First Token (TTFT) for repeated content • Better throughput on long-context workloads • Significantly better ROI on their existing H100 Key Takeaway: You don't always need more GPUs. Sometimes you just need smarter memory management. If you're hitting VRAM limits with your LLM deployments, especially with long-context applications, LMCache integration with vLLM might be exactly what you need. Have you faced similar constraints in your AI infrastructure? How did you solve it? #AI #LLM #MachineLearning #AIInfrastructure #MLOps #GPU #DeepLearning #ArtificialIntelligence #TechSolutions #H100

Like Comment Share
LMCache Lab

2,438 followers
2w
Report this post
We’re excited to see Redis highlighting the power of LMCache for faster, more efficient LLM inference!

Redis

282,780 followers
2w

LLMs often waste compute by reprocessing repeated inputs. LMCache fixes that. By caching token-level key-value pairs for repeated text, LMCache skips redundant computation. Redis stores and serves those cached chunks in real time, making interference faster and more efficient. Together, LMCache and Redis deliver cheaper, production-ready GenAI: https://lnkd.in/gwRECCBD

Redis - The Real-time Data Platform redis.io

Like Comment Share
LMCache Lab reposted this
Faradawn Yang

AI Software
3w
Report this post
Why is my LLM still slow — even with a powerful GPU? In this tutorial, I walk through how LMCache Lab + NVIDIA Dynamo boosts vLLM by 13.1×, and how you can replicate the setup on your own. YouTube: https://lnkd.in/eBuPhu2k

How to make vLLM 13× faster — hands-on LMCache + NVIDIA Dynamo tutorial

https://www.youtube.com/

2 Comments

Like Comment Share
LMCache Lab

2,438 followers
4w
Report this post
In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production environments. However, instead of directly customizing the lmcache core, we introduced the LMCache Plugin Framework - a lightweight yet powerful plugin system that allows developers to run custom scripts within LMCache processes. Based on this plugin framework, we implemented lmcache_frontend(https://lnkd.in/gkBe8Hmi), a monitoring and proxy service that runs as a subprocess only on scheduler nodes. It provides a Web interface for cluster status visualization and implements request forwarding functionality through HTTP proxy services. This design not only facilitates deployment and management but also provides developers with an excellent plugin implementation example, demonstrating how to use the Plugin Framework to enhance system observability and control capabilities. Read more: https://lnkd.in/gNSGmgmS Doc: https://lnkd.in/gdh5cd_B
2 Comments

Like Comment Share
LMCache Lab

2,438 followers
1mo
Report this post
A deep dive on LMCache x NVIDIA Dynamo: https://lnkd.in/gaYZzPjR to learn how LMCache integrates with NVIDIA Dynamo to slash KV-cache bottlenecks and push LLM inference efficiency to the next level. We’re also honored to be featured in NVIDIA’s official blog on Dynamo: https://lnkd.in/gaF35XaT #LMCache #NVIDIA #Dynamo #LLM #Inference #KVcache

How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog developer.nvidia.com

2 Comments

Like Comment Share
LMCache Lab

2,438 followers
1mo
Report this post
Missed our tutorial at SIGCOMM 2025? The full recording is now available! Watch the session here: https://lnkd.in/e6EbM48Y #SIGCOMM #SIGCOMM2025 #LMCache
LMCache Lab

2,438 followers
1mo

Join us at SIGCOMM 2025( https://lnkd.in/gGfrp6wv ) for our full-day LMCache Tutorial — an intelligent caching middleware that makes LLM inference faster & cheaper! 📅 Sept 8, 2025 8:45 AM – 6:00 PM (Portugal Time / WEST) = 12:45 AM – 10:00 AM (PDT) What you’ll learn: 🔹 KV-cache offloading & reuse for LLMs 🔹 Cutting GPU memory + compute costs 🔹 Real-world integrations with vLLM & beyond ✅ Register here https://lnkd.in/gwyqaHWX #SIGCOMM2025 #LMCache #LLM #vLLM
Like Comment Share

LinkedIn respects your privacy

LMCache Lab

Software Development

Chicago, IL 2,438 followers

Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache

About us

Locations

Employees at LMCache Lab

Junchen Jiang

CS Prof @ UChicago, LMCache, LLM Production Stack

Samuel Shen

Math/CS @ UChicago

Rui Zhang

ML Engineer and PhD in UCSC

Jarrett Chen

Student at the University of Chicago

Updates

How to make vLLM 13× faster — hands-on LMCache + NVIDIA Dynamo tutorial

https://www.youtube.com/

Join now to see what you are missing

Similar pages

vLLM

Tandemn

sgl-project

llm-d

Jackson Square Ventures

WEKA

Precursor Ventures

Embedded LLM

Angels @ Illinois Ventures

NVIDIA