LMCache Lab
Software Development
Chicago, IL 1,828 followers
Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache
About us
Open-source large-scale LLM serving solutions to democratize LLM Inference.
- Website
-
https://github.com/LMCache/LMCache
External link for LMCache Lab
- Industry
- Software Development
- Company size
- 201-500 employees
- Headquarters
- Chicago, IL
- Type
- Nonprofit
Locations
-
Primary
Chicago, IL, US
Employees at LMCache Lab
Updates
-
At LMCache Lab, we’re obsessed with LLM performance. As prefill-decode disaggregation becomes the norm, we spotted a major, untapped scheduling opportunity for prefill nodes. 🧵👇 1/ That’s why we developed SPF (Shortest Prefill First, introduced in our SOSP 2025 paper https://lnkd.in/gpD8uHnf), a scheduling strategy that always serves requests with the shortest prefill time first. This reduces queueing delays across the board and gets your users their results faster. 2/ We’ve implemented a proof-of-concept for SPF in vLLM (PR https://lnkd.in/gZtet8Sc). 3/ Benchmarking SPF 📊 Here’s a side-by-side comparison of mean and median time-to-first-token (TTFT) for leading LLM serving platforms (200 random input lengths averaging 10,000; output length = 1; all requests arriving at once): 4/ Takeaway: Shortest Prefill First delivers an 18% reduction in mean TTFT compared to native vLLM, with even larger gains over Fireworks and DeepInfra. The median TTFT improvements are just as strong—SPF consistently delivers faster results from your hardware!
-
-
You might know LMCache Lab for our KV cache optimizations that make LLM prefilling a breeze. But that’s not all! We’re now focused on speeding up decoding too—so your LLM agents can generate new content even faster. In other words: you can save on your LLM serving bills by renting fewer machines for the same amount of work. Deploying Llama 3.1 8B Instruct on a single H100 and running at 3 queries per second (QPS=3), time per output token dropped by 2.5x with speculative decoding!
-
-
LMCache Lab reposted this
This is the fastest serving engine for LLMs! (100% open-source, makes vLLM go brr...) LMCache cuts time-to-first-token by 7x and slashes GPU costs dramatically. Think of it as a smart caching layer that remembers everything your LLM has processed before. Instead of recomputing the same text over and over, LMCache stores those calculations (KV caches) across GPU, CPU, and disk storage. Key benefits: ⚡ 7x faster time-to-first-token 💸 Huge reduction in compute costs 🔌 Drops right into existing vLLM setups 🌍 Shares cached data across multiple servers The best part is that setup requires just one command: "pip install lmcache" Find link to the GitHub repo in the first comment! ____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
-
-
Check out this fresh Redis X LMCache Integration post from the link!
Redis is revolutionizing the GenAI platform with LMCache, enhancing performance and efficiency. LMCache optimizes response time by caching reusable data chunks instead of complete prompts, reducing GPU workload. The robust infrastructure, powered by Redis, ensures scalability and rapid deployment for production readiness. For further insights, delve into the blog post penned by Rini Vasan and Yihua Cheng, detailing the benefits of LMCache and Redis in accelerating LLM inference and cost-effective responses: Read more at: [Redis Blog Post](https://lnkd.in/egK-8zts) #Redis #LMCache #GenAI #Inference #Efficiency
-
Why LMCache can be a life saver for AI Agent Companies: As Utkarsh Kanwat states in his blog “Why I’m Betting Against AI Agents in 2025”, AI Agents need to reprocess context at extreme rates which ends up being a major economical problem due to the large amount of agents running. With LMCache and our state of the art KV Cache management we can reduce those massive costs by saving the context so that it doesn’t need to be reprocessed. Article: https://lnkd.in/ge-zYbma
-
-
Want to create 𝘆𝗼𝘂𝗿 𝗼𝘄𝗻 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗘𝗻𝗱𝗽𝗼𝗶𝗻𝘁 on 𝗔𝗻𝘆 𝗖𝗹𝗼𝘂𝗱 in seconds? We're announcing the 𝗮𝗹𝗽𝗵𝗮 𝗿𝗲𝗹𝗲𝗮𝘀𝗲 of 𝗟𝗠𝗜𝗴𝗻𝗶𝘁𝗲, the one-click high-performance inference stack built for speed and scale. 🤖 Join the alpha and supercharge your AI apps: https://lnkd.in/gMKqwYuZ 📑 Read the full blog here: https://lnkd.in/gn3a8kwu Effortlessly enjoy 1️⃣ Unmatched Performance & Cost-Efficiency: Achieve up to 10x speedup and cost savings for demanding conversational and long-document AI workloads. 2️⃣ One-Click Deployment: Eliminate infrastructure complexity and launch a production-ready, scalable inference stack in minutes on any cloud or on-prem server. 3️⃣ Research-Driven Innovation: Powered by LMCache and the vLLM Production Stack, it leverages award-winning KV cache optimizations to minimize latency and maximize throughput. Powered by LMCache, vLLM, and vLLM Production Stack. #AI #LLMOps #Inference #RAG #LMCache #vLLM
-