LMCache Lab’s cover photo
LMCache Lab

LMCache Lab

Software Development

Chicago, IL 1,828 followers

Open-Source for 10X better LLM inference w. vLLM Production Stack + LMCache

About us

Open-source large-scale LLM serving solutions to democratize LLM Inference.

Industry
Software Development
Company size
201-500 employees
Headquarters
Chicago, IL
Type
Nonprofit

Locations

Employees at LMCache Lab

Updates

  • At LMCache Lab, we’re obsessed with LLM performance. As prefill-decode disaggregation becomes the norm, we spotted a major, untapped scheduling opportunity for prefill nodes. 🧵👇 1/ That’s why we developed SPF (Shortest Prefill First, introduced in our SOSP 2025 paper https://lnkd.in/gpD8uHnf), a scheduling strategy that always serves requests with the shortest prefill time first. This reduces queueing delays across the board and gets your users their results faster. 2/ We’ve implemented a proof-of-concept for SPF in vLLM (PR https://lnkd.in/gZtet8Sc). 3/ Benchmarking SPF 📊 Here’s a side-by-side comparison of mean and median time-to-first-token (TTFT) for leading LLM serving platforms (200 random input lengths averaging 10,000; output length = 1; all requests arriving at once): 4/ Takeaway: Shortest Prefill First delivers an 18% reduction in mean TTFT compared to native vLLM, with even larger gains over Fireworks and DeepInfra. The median TTFT improvements are just as strong—SPF consistently delivers faster results from your hardware!

    • No alternative text description for this image
  • You might know LMCache Lab for our KV cache optimizations that make LLM prefilling a breeze. But that’s not all! We’re now focused on speeding up decoding too—so your LLM agents can generate new content even faster. In other words: you can save on your LLM serving bills by renting fewer machines for the same amount of work. Deploying Llama 3.1 8B Instruct on a single H100 and running at 3 queries per second (QPS=3), time per output token dropped by 2.5x with speculative decoding!

    • No alternative text description for this image
  • LMCache Lab reposted this

    View profile for Akshay Pachaar

    Co-Founder DailyDoseOfDS | BITS Pilani | 3 Patents | X (187K+)

    This is the fastest serving engine for LLMs! (100% open-source, makes vLLM go brr...) LMCache cuts time-to-first-token by 7x and slashes GPU costs dramatically. Think of it as a smart caching layer that remembers everything your LLM has processed before. Instead of recomputing the same text over and over, LMCache stores those calculations (KV caches) across GPU, CPU, and disk storage. Key benefits: ⚡ 7x faster time-to-first-token  💸 Huge reduction in compute costs  🔌 Drops right into existing vLLM setups  🌍 Shares cached data across multiple servers The best part is that setup requires just one command: "pip install lmcache" Find link to the GitHub repo in the first comment! ____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!

    • No alternative text description for this image
  • Check out this fresh Redis X LMCache Integration post from the link!

    View profile for Nick Barcet

    Market opportunity creator

    Redis is revolutionizing the GenAI platform with LMCache, enhancing performance and efficiency. LMCache optimizes response time by caching reusable data chunks instead of complete prompts, reducing GPU workload. The robust infrastructure, powered by Redis, ensures scalability and rapid deployment for production readiness. For further insights, delve into the blog post penned by Rini Vasan and Yihua Cheng, detailing the benefits of LMCache and Redis in accelerating LLM inference and cost-effective responses: Read more at: [Redis Blog Post](https://lnkd.in/egK-8zts) #Redis #LMCache #GenAI #Inference #Efficiency

  • View organization page for LMCache Lab

    1,828 followers

    Why LMCache can be a life saver for AI Agent Companies: As Utkarsh Kanwat states in his blog “Why I’m Betting Against AI Agents in 2025”, AI Agents need to reprocess context at extreme rates which ends up being a major economical problem due to the large amount of agents running. With LMCache and our state of the art KV Cache management we can reduce those massive costs by saving the context so that it doesn’t need to be reprocessed. Article: https://lnkd.in/ge-zYbma

    • No alternative text description for this image
  • Want to create 𝘆𝗼𝘂𝗿 𝗼𝘄𝗻 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗘𝗻𝗱𝗽𝗼𝗶𝗻𝘁 on 𝗔𝗻𝘆 𝗖𝗹𝗼𝘂𝗱 in seconds? We're announcing the 𝗮𝗹𝗽𝗵𝗮 𝗿𝗲𝗹𝗲𝗮𝘀𝗲 of 𝗟𝗠𝗜𝗴𝗻𝗶𝘁𝗲, the one-click high-performance inference stack built for speed and scale. 🤖 Join the alpha and supercharge your AI apps: https://lnkd.in/gMKqwYuZ 📑 Read the full blog here: https://lnkd.in/gn3a8kwu Effortlessly enjoy 1️⃣ Unmatched Performance & Cost-Efficiency: Achieve up to 10x speedup and cost savings for demanding conversational and long-document AI workloads. 2️⃣ One-Click Deployment: Eliminate infrastructure complexity and launch a production-ready, scalable inference stack in minutes on any cloud or on-prem server. 3️⃣ Research-Driven Innovation: Powered by LMCache and the vLLM Production Stack, it leverages award-winning KV cache optimizations to minimize latency and maximize throughput. Powered by LMCache, vLLM, and vLLM Production Stack. #AI #LLMOps #Inference #RAG #LMCache #vLLM

    • No alternative text description for this image

Similar pages