Unlocking Chat History Cache: Save Cost Without Sacrificing Meaning
解锁聊天历史缓存:省下的是费用,不是意义
When conversations are fast, models remember efficiently—not because they grow wiser, but because we stop asking them to re-read the whole book.
当对话足够迅速,模型会高效“记忆”——并非因为它变得更聪明,而是因为我们不再让它一遍遍重读整本书。
Why a “cache discount” exists
“缓存折扣”为何存在
In multi-turn AI conversations, a cache discount is a pricing mechanism that rewards reuse of prior computation on unchanged chat history.
在多轮对话中,“缓存折扣”是一种定价机制,用于奖励对未改变的历史对话所进行的计算复用。
If you send your next message within a short window (often about five minutes) and you have not altered earlier messages, the system can reuse internal representations of that shared prefix instead of recomputing everything.
如果你在短时间窗口内(通常约五分钟)发送下一条消息,并且没有修改此前消息,系统就能复用那段共同前缀的内部表示,而无需把所有内容重新计算一遍。
Some flagship models with very long context windows (for example up to hundreds of thousands of tokens) advertise substantial discounts on cached history—sometimes as high as 90%—because they avoid most of the expensive forward pass for those tokens.
一些拥有超长上下文窗口的旗舰模型(例如可达数十万 token)会对已缓存的历史提供显著折扣——有时高达 90%——因为它们避免了对这些 token 进行最昂贵的前向计算。
Exact thresholds, windows, and rates vary by provider and model, but the underlying idea is the same: reuse the same prefix, save compute, and pass the savings on to you.
具体阈值、时间窗口和折扣比例因服务商与模型而异,但底层思想一致:复用同一前缀,节省计算,并把节省转化为你的费用优惠。
What is actually cached under the hood
底层究竟缓存了什么
Modern large language models are Transformer-based and process sequences token by token using attention.
现代大语言模型多基于 Transformer 架构,通过注意力机制按 token 顺序处理序列。
During generation, they maintain a key–value (KV) cache for each layer: the “keys” and “values” computed for past tokens.
在生成过程中,模型为每一层维护一个键值(KV)缓存:即为历史 token 计算过的“键”和“值”。
When a new token arrives, the model only computes a fresh query for that token and attends to the stored keys and values, instead of recomputing keys/values for all earlier tokens.
当新 token 到来时,模型仅为该 token 计算新的查询向量,并对已存储的键和值进行注意力计算,而不是为所有先前 token 重新计算键/值。
Within a single long generation, that KV cache avoids quadratic rework; across turns, a server can persist the same per-layer caches for your chat’s unchanged prefix and reuse them if you continue soon.
在一次连续生成中,KV 缓存避免了随长度增长的重复计算;在多轮对话之间,服务器可以为未改变的历史前缀持久化每层的缓存,只要你在短时间内继续对话,就能直接复用。
Persisting KV caches is memory-intensive—each token consumes tens to hundreds of kilobytes across layers—so providers impose an expiry window (for example five minutes) to reclaim memory when you pause.
持久化 KV 缓存会非常占内存——每个 token 跨多层可能消耗数十到数百 KB——因此服务商通常设置过期窗口(例如五分钟)以便在你暂停时回收内存。
To decide whether reuse is possible, systems compare the tokenized prefix of your conversation; if it’s identical to what was cached, reuse is safe and cheap.
系统通过比较已分词的对话前缀来判断能否复用;若与已缓存完全一致,则复用既安全又低成本。
Does caching make answers “more accurate”?
复用缓存会不会让答案“更准确”?
Short answer: caching does not inherently increase accuracy; it preserves equivalence.
简要回答:缓存并不会本质提升准确性;它保持等价性。
Using a cache means the model reuses the exact internal key/value tensors it would have recomputed for the same prefix.
使用缓存意味着模型复用的正是它在相同前缀下本会重新计算出的内部键/值张量。
If the input tokens are the same and settings are the same, the logits for the next token will match up to tiny numerical noise.
若输入 token 相同且设置一致,那么下一个 token 的输出分布会一致,仅存在极微小的数值误差。
There is no extra “thinking” or incremental self-improvement stored beyond the tokens you provided; no weights are updated, and no hidden plan is retained across turns other than the cached attention states.
除你提供的 token 之外,不会额外存在“多想一步”或逐步自我改进的留存;权重不会更新,也没有跨轮次隐藏的计划被保留,除了注意力缓存状态以外。
So, reuse improves cost and latency, not the model’s reasoning depth.
因此,缓存复用改善的是费用与延迟,而非模型的推理深度。
That said, cache reuse can indirectly help quality by letting you keep longer history without truncation or aggressive summarization, so the model conditions on richer context.
但从侧面看,缓存复用有助于你保留更长的历史而不被截断或过度总结,从而让模型基于更丰富的上下文进行生成。
If the alternative is to repeatedly resend or re-summarize a very long transcript (risking information loss), then preserving the exact prefix can be beneficial.
如果可选方案是反复重发或重写很长的对话摘要(可能丢失信息),那么保留原始前缀会更有利。
Why the “five-minute rule” matters
“五分钟规则”为何关键
A short time window balances two pressures: memory pressure on the server and user experience.
短时间窗口是在服务器内存压力与用户体验之间的折中。
Keeping per-session KV caches alive ties up GPU or high-bandwidth memory; expiring caches too late would starve other users.
为每个会话长时间保留 KV 缓存会占据大量 GPU 或高速内存;过晚回收会影响其他用户的资源。
But expiring too quickly would penalize natural pauses in conversation.
但过快过期又会惩罚对话中的自然停顿。
Empirically, a window on the order of minutes lets back-to-back messages enjoy large savings while keeping resource utilization healthy.
经验上,以分钟计的时间窗既能让连续消息获得大幅节省,又能保持资源利用的健康度。
Conditions that typically enable the cache discount
触发缓存折扣的一般条件
-
Send your next message quickly after the previous one, often within about five minutes.
-
在上一条消息后尽快发送下一条,通常需在约五分钟内。
-
Do not edit or delete any earlier messages in the thread.
-
不要修改或删除对话中的任何较早消息。
-
Keep model settings stable (e.g., same model variant, system prompt, temperature, tools configuration).
-
保持模型设置稳定(如模型版本、系统指令、温度、工具配置等不变)。
-
Ensure the prior chat history is large enough that caching is worthwhile.
-
确保此前对话长度足够大,以使缓存复用有意义。
-
Avoid nondeterministic retrieval pipelines that reshuffle context between turns unless you pin them (fixed top-k, stable ordering).
-
避免每轮都会随机改动检索结果排序的管线,除非将其固定(固定 top-k 与排序)。
-
Continue within the same conversation thread rather than starting a new one.
-
在同一会话线程中继续,而非开新对话。
What breaks caching (and the discount)
什么会破坏缓存(从而失去折扣)
-
Editing any earlier message or system instruction.
-
编辑任何较早消息或系统指令。
-
Switching models or toggling major runtime options.
-
切换模型或改变重要运行参数。
-
Re-running tools/retrieval that produce different tokenized outputs for the same turn.
-
重新调用会产生不同文本输出的工具/检索,导致相同轮次的 token 序列不同。
-
Large delays that exceed the expiry window.
-
超过过期窗口的长时间延迟。
-
Any change that alters the exact token sequence of the shared prefix.
-
任何会改变共享前缀精确 token 序列的变化。
Cost mechanics in plain language
用直白方式解释费用机制
Without cache reuse, every turn forces the model to “re-read” all prior tokens to rebuild attention states before it can continue.
没有缓存复用时,每一轮都迫使模型“重读”所有历史 token,以重建注意力状态后才能继续。
With reuse, the model skips most of that rebuild for the unchanged prefix, paying only a small overhead to check consistency and to serve the cached tensors.
有了复用,模型就能跳过对未变前缀的大部分重建,仅支付极小的开销来做一致性检查并取用缓存张量。
That is why providers can offer steep discounts on cached portions of the input while charging normal rates for new input and generated output.
因此服务商能对输入中被缓存覆盖的部分提供大幅折扣,而对新增输入与生成输出按照常规定价计费。
For example, if your conversation has 100k tokens of history and you add 1k tokens of new input, a 90% discount on the 100k cached tokens reduces the effective cost of that huge prefix by an order of magnitude.
举例而言,若你的历史达 10 万 token,而本轮只新增 1000 token 的输入,那么对这 10 万缓存 token 给出 90% 折扣会将其有效成本降低一个数量级。
The bigger your stable prefix, the more you save—provided you talk again soon.
只要能尽快继续对话,稳定前缀越大,你省得越多。
Accuracy myths: separating signal from noise
关于准确性的迷思:分清信号与噪声
-
Myth: “Caching makes the model smarter over the session.”
-
迷思:“缓存让模型在会话中变得更聪明。”
Fact: No learning or weight updates occur; caches only avoid recomputation.
事实:不存在学习或权重更新;缓存只是避免重复计算。 -
Myth: “Continuing quickly gives more reasoning than restarting.”
-
迷思:“迅速继续对话比重新开始更会‘深度思考’。”
Fact: Given the same tokens and settings, results are functionally equivalent.
事实:在相同 token 与设置下,结果在功能上等价。 -
Myth: “If cache expires, the model loses a special internal plan.”
-
迷思:“缓存过期会让模型失去某种特殊的内部计划。”
Fact: The only loss is the saved compute; the model can rebuild states from the same text.
事实:失去的只是已节省的计算;模型仍可从相同文本重建状态。
How to deliberately maximize cache reuse
如何有意识地最大化缓存复用
-
Keep a stable instruction header at the very top of your thread; never edit it midstream.
-
在对话最上方保留稳定的指令前言;中途不要修改。
-
Send follow-up prompts promptly; draft offline if needed, then paste within minutes.
-
及时发送后续提示;必要时先离线撰写,再在几分钟内粘贴发送。
-
Structure long tasks as several rapid turns instead of one huge, slow turn that risks timeout.
-
将长任务拆为几个快速回合,而不是一个又大又慢、易超时的回合。
-
Make retrieval deterministic: fix top-k, use stable ranking, and avoid time-dependent randomness between turns.
-
让检索可复现:固定 top-k,稳定排序,避免跨轮引入与时间相关的随机性。
-
Reuse tool outputs by reference rather than regenerating them identically every turn.
-
通过引用复用工具结果,而不是每轮都重新生成同样的内容。
-
Avoid toggling temperature, penalties, or model versions mid-conversation.
-
避免在会话中途切换温度、惩罚项或模型版本。
Long context, vision, and why it matters
超长上下文、视觉能力与其意义
Flagship models with very long context windows (for example around 400k tokens) enable entire projects to remain in-context without truncation.
拥有超长上下文窗口的旗舰模型(例如约 40 万 token)让整个项目都可在上下文中保留而无需截断。
That makes cache reuse even more valuable: the longer the unchanged prefix, the bigger the avoided recompute.
这让缓存复用的价值更大:未变前缀越长,避免的重复计算越多。
If your workflow includes images or multimodal content, the same principles apply: as long as the tokenized prefix is identical and kept alive, caches can be reused.
如果你的工作流包含图像或多模态内容,原则相同:只要分词后的前缀完全一致并保持活跃,缓存即可复用。
A quick mental model of the math
一个快速的心算模型
-
New input tokens: always full price.
-
新增输入 token:始终按全价计费。
-
Generated output tokens: always full price (text generation is new work).
-
生成输出 token:始终按全价计费(生成是新增工作)。
-
Unchanged history tokens: discounted heavily when reused within the window.
-
未改变的历史 token:在时间窗内复用时享受大幅折扣。
If a provider offers, say, a 90% discount on cached history, the effective cost contribution from the reused prefix is only 10% of the usual input rate.
若服务商对缓存历史提供 90% 折扣,则复用前缀的有效成本仅为常规输入单价的 10%。
That is why conversations that move briskly through multiple turns can be dramatically cheaper than the same conversation replayed from scratch each time.
这就是为何在多个回合中快速推进的对话,比每次都从头重放同样内容要便宜得多。
Engineering perspective: why providers can do this
工程视角:服务商为何能做到
-
KV caches are the dominant cost of incremental decoding because they capture all past tokens’ per-layer projections.
-
KV 缓存主导了增量解码的成本,因为它们容纳了所有历史 token 的逐层投影。
-
Serving from cache replaces compute with memory bandwidth; modern systems optimize this via paged KV caches and memory offloading.
-
从缓存服务把计算换成了内存带宽;现代系统通过分页 KV 缓存与内存卸载进行优化。
-
The five-minute window balances GPU memory footprint against concurrency and fairness.
-
五分钟窗口在 GPU 内存占用与并发公平之间取得平衡。
-
Discounts align incentives: users are nudged to keep prefixes stable and sessions active, which also reduces provider costs.
-
折扣对齐了激励:鼓励用户保持前缀稳定、会话活跃,同时也降低服务商成本。
Practical workflows that benefit most
最受益的实用工作流
-
Code review and refactoring across large repositories where the codebase remains in context.
-
跨大型代码库的代码审查与重构,代码基在上下文中持续可见。
-
Long research assistants that carry literature, notes, and prior drafts as stable context.
-
长程研究助理,将文献、笔记与先前草稿作为稳定上下文携带。
-
Multimodal analysis where prior frames or pages do not change between turns.
-
多模态分析中,先前帧或页面在回合之间不变。
-
Iterative data analysis notebooks where results are referenced instead of regenerated.
-
迭代式数据分析笔记本,通过引用结果而非重复生成来推进。
When not to optimize for cache
何时不必为缓存过度优化
-
If your task needs fresh retrieval every turn that truly changes the prefix, forcing stability may harm relevance.
-
如果你的任务每轮都需要真正改变前缀的新检索,那么强行保持稳定会损害相关性。
-
If the conversation is short and cheap, the engineering effort to preserve caches may not pay off.
-
若对话很短且便宜,为保留缓存投入工程精力可能得不偿失。
-
If you must change the system instructions mid-course, prioritize clarity and correctness over discounts.
-
若必须中途修改系统指令,应优先保证清晰与正确,而非折扣。
Checklist: how to reliably trigger the discount
清单:如何稳妥触发折扣
-
Reply within the time window (aim for under five minutes).
-
在时间窗内回复(目标少于五分钟)。
-
Do not edit previous messages; append new ones.
-
切勿编辑既有消息;只追加新消息。
-
Keep the same model and settings throughout the thread.
-
在整个线程中使用同一模型与设置。
-
Stabilize retrieval and tool outputs across turns.
-
跨回合稳定检索与工具输出。
-
Keep your history long and consistent; avoid unnecessary reformatting that changes tokens.
-
保持历史内容冗长且一致;避免会改变 token 的不必要重排或格式修改。
Looking ahead: caches today, state tomorrow
展望未来:今日缓存,明日“状态”
Cache discounts are a bridge to more explicitly stateful AI systems that remember across turns without recomputing, yet remain privacy-preserving and controllable.
缓存折扣是通往更显式“有状态”AI 的桥梁,这类系统可在轮次之间记忆而无需重算,同时兼顾隐私与可控性。
We may see models with structured external memory or learned compendia that compress long histories into compact states, traded off against transparency and editability.
未来或将出现具备结构化外部记忆或可学习摘要的模型,把长历史压缩为紧凑状态,同时在可解释性与可编辑性之间权衡。
Until then, KV caching remains the practical workhorse: it is faithful, reversible, and cost-effective.
在此之前,KV 缓存仍是务实主力:忠实、可逆、且具成本效益。
Bottom line
核心结论
-
The cache discount comes from reusing internal attention states for an identical, unchanged prefix within a short time window.
-
缓存折扣源于在短时间窗内对完全相同且未改变的前缀复用内部注意力状态。
-
Reuse lowers cost and latency; it does not inherently raise accuracy.
-
复用降低费用与延迟;并不会本质提升准确性。
-
Designing your workflow to keep the prefix stable and to respond promptly is a wise strategy when you care about both speed and budget.
-
若你关注速度与成本,设计工作流以保持前缀稳定并快速回应是明智之举。
Your turn
轮到你了
What scenarios in your work could benefit most from a cache-aware conversation flow, and where would you consciously trade cache savings for freshness or flexibility?
你的工作中哪些场景最能从“缓存感知”的对话流程中获益?又会在何处为了新鲜度或灵活性而有意识地放弃缓存节省?请在评论区分享你的经验与权衡。