KV Cache Explained: How LLMs Generate Text Without Recomputing Everything
KV cache is the reason your LLM can generate text fast without recomputing the entire conversation at every step. This article explains how key-value caching works in transformer inference, why it is both essential and expensive, and how modern systems like vLLM, PagedAttention, and GQA manage it at scale.
When you send a long message to an LLM and it responds quickly, KV cache is doing a lot of that work. Without it, generating each new token would require the model to recompute attention over the entire conversation history from scratch. Every single time. For a 4,000-token context, that means 4,000 full matrix operations per token generated.
That would be unusably slow.
KV cache is not a recent trick. It has been standard in transformer inference since the beginning. But understanding it properly matters more now than it did when context windows were 512 tokens. At 128K context lengths, managing the cache well is the difference between a system that serves users and one that runs out of GPU memory trying.
This article covers what KV cache is, how it works, where it becomes a problem, and how modern serving systems deal with it.
What KV cache is
In self-attention, each token gets projected into three representations: query (Q), key (K), and value (V). During autoregressive generation, the current token's query attends over the keys and values of all previous tokens. Since the keys and values of earlier tokens do not change during inference, recomputing them at every step is wasteful.
KV cache stores these previously computed key and value tensors so the model only needs to compute the new token's attention projections once and append them to the existing cache.
In practical terms, KV cache turns repeated full-history recomputation into incremental decoding. Instead of rebuilding attention state from scratch for the entire prefix, the model reuses stored representations for prior tokens and computes attention against the accumulated cache.
import torch
import torch.nn.functional as F
class AttentionWithKVCache(torch.nn.Module):
def __init__(self, embed_dim: int, num_heads: int):
super().__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.q_proj = torch.nn.Linear(embed_dim, embed_dim, bias=False)
self.k_proj = torch.nn.Linear(embed_dim, embed_dim, bias=False)
self.v_proj = torch.nn.Linear(embed_dim, embed_dim, bias=False)
self.out_proj = torch.nn.Linear(embed_dim, embed_dim, bias=False)
def forward(
self,
x: torch.Tensor,
past_keys: torch.Tensor = None,
past_values: torch.Tensor = None,
):
B, T, C = x.shape
H, D = self.num_heads, self.head_dim
q = self.q_proj(x).view(B, T, H, D).transpose(1, 2)
k = self.k_proj(x).view(B, T, H, D).transpose(1, 2)
v = self.v_proj(x).view(B, T, H, D).transpose(1, 2)
# append new K and V to the cache
if past_keys is not None:
k = torch.cat([past_keys, k], dim=2) # (B, H, T_total, D)
v = torch.cat([past_values, v], dim=2)
# store updated cache for next decoding step
new_cache = (k, v)
# attention over accumulated K/V
scale = D ** -0.5
scores = (q @ k.transpose(-2, -1)) * scale
weights = F.softmax(scores, dim=-1)
out = weights @ v
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.out_proj(out), new_cache
attn = AttentionWithKVCache(embed_dim=512, num_heads=8)
cache = (None, None)
# decoding loop: one token at a time
for step in range(10):
x_step = torch.randn(1, 1, 512) # one new token
out, cache = attn(x_step, cache[0], cache[1])
cached_len = 0 if cache[0] is None else cache[0].shape[2]
print(f"step {step+1:2d} | cached tokens: {cached_len}")step 1 | cached tokens: 1
step 2 | cached tokens: 2
step 3 | cached tokens: 3
...
step 10 | cached tokens: 10The cache grows by exactly one token per decoding step. The attention computation for step 10 reads from 10 cached positions but only computes new Q, K, V for the one new token. Without the cache, it would recompute all 10 from scratch every time.
The two phases of inference: prefill and decode
KV cache operates differently across two distinct inference stages. Understanding the split matters for latency optimization because the two phases have completely different bottlenecks.
Prefill stage
During prefill, the model processes the entire input prompt and computes keys and values for all input tokens across all transformer layers. These tensors are stored in memory. Prefill runs in parallel across all input tokens, so it is compute-bound on long prompts. The output of the prefill stage is the KV cache populated for all prompt tokens, plus the first generated token.
Decode stage
During decode, the model generates one token at a time. For each new token it computes only the new Q, K, V, appends K and V to the existing cache, and attends over the full accumulated cache. Decode is memory-bandwidth-bound rather than compute-bound because the bottleneck is reading the cached tensors from GPU memory, not doing arithmetic on them.
import time
import torch
def profile_two_phases(model, tokenizer, prompt: str, max_new_tokens: int = 50):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# prefill: process entire prompt, get first token
t0 = time.perf_counter()
with torch.no_grad():
first_out = model.generate(input_ids, max_new_tokens=1, use_cache=True)
prefill_ms = (time.perf_counter() - t0) * 1000
# decode: generate remaining tokens one at a time
t0 = time.perf_counter()
with torch.no_grad():
full_out = model.generate(first_out, max_new_tokens=max_new_tokens - 1, use_cache=True)
decode_ms = (time.perf_counter() - t0) * 1000
tps = (max_new_tokens - 1) / (decode_ms / 1000)
print(f"prompt tokens: {input_ids.shape[1]}")
print(f"prefill time: {prefill_ms:.1f}ms (time to first token)")
print(f"decode time: {decode_ms:.1f}ms for {max_new_tokens-1} tokens")
print(f"tokens per second: {tps:.1f}")Time to first token (TTFT) is dominated by prefill. Tokens per second after that is dominated by how quickly the GPU can read the KV cache. These are different problems with different solutions, which is why inference optimization papers often treat them separately.
Why KV cache matters
The primary benefit is lower decoding latency. Since the model avoids recomputation of historical attention states, each new token can be produced faster than in a no-cache implementation. This is especially relevant for interactive systems such as chatbots, copilots, and real-time assistants where users feel latency directly.
The second benefit is improved serving efficiency. Efficient KV caching reduces the computation required per generated token, which helps inference systems make better use of available hardware. At production scale, this directly affects throughput and cost per token.
The Hugging Face Transformers documentation on caching explains the standard cache interface and how it maps to the internal attention implementation. The cache strategies guide covers the different cache modes available and when to use each.
The memory problem
KV cache improves speed but increases memory pressure. Each request must store cached key and value tensors for every active token across the relevant transformer layers. As sequence length grows, the cache grows roughly linearly, reducing the number of requests that can fit into GPU memory at the same time.
def kv_cache_size_gb(
num_layers: int,
num_kv_heads: int,
head_dim: int,
seq_len: int,
batch_size: int,
dtype_bytes: int = 2, # bfloat16 = 2 bytes
) -> float:
"""Estimate KV cache memory in GB."""
# 2 for K and V
elements = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size
return (elements * dtype_bytes) / 1e9
# LLaMA-2 70B configuration: 80 layers, 8 KV heads (GQA), 128 head dim
kv_gb = kv_cache_size_gb(
num_layers = 80,
num_kv_heads = 8,
head_dim = 128,
seq_len = 8192,
batch_size = 32,
)
print(f"KV cache at 8K context, batch 32: {kv_gb:.1f} GB")
# without GQA (full MHA with 64 heads)
kv_mha = kv_cache_size_gb(
num_layers = 80,
num_kv_heads = 64,
head_dim = 128,
seq_len = 8192,
batch_size = 32,
)
print(f"MHA KV cache at 8K context, batch 32: {kv_mha:.1f} GB")KV cache at 8K context, batch 32: 8.6 GB (with GQA)
MHA KV cache at 8K context, batch 32: 68.7 GB (without GQA)That 68 GB number is why full Multi-Head Attention became impractical at scale before GQA was adopted. The weights of a 70B model in bfloat16 take about 140 GB. Adding 68 GB of KV cache on top at just 8K context would exceed most available GPU configurations. GQA cuts the KV footprint by 8x in that configuration.
Two other limitations are worth knowing. First, KV caching is designed for inference rather than training. In common transformer tooling, caching is documented as an inference optimization and may cause errors or unexpected behavior if enabled during training workflows. Second, naive cache allocation can fragment GPU memory badly in multi-user serving scenarios where many requests grow and terminate at different times, wasting reserved space that cannot be used by other requests.
Advanced KV cache optimizations
Because naive caching can consume very large amounts of memory, several optimization strategies have emerged.
PagedAttention
PagedAttention organizes KV cache into fixed-size blocks or pages rather than large contiguous buffers. This idea, introduced in the vLLM work (arXiv:2309.06180), reduces memory waste from fragmentation and enables more flexible sharing. The authors report significant throughput gains compared with earlier inference systems.
The intuition is borrowed from operating system virtual memory management. Instead of allocating one large contiguous block per request that may be partially unused, the system allocates fixed pages and assigns them as needed. When a request finishes, its pages are freed and immediately available to other requests.
# conceptual illustration of paged vs contiguous allocation
class ContiguousKVCache:
"""Traditional approach: one big buffer per request."""
def __init__(self, max_seq_len: int, layer_size: int):
# allocates max_seq_len worth of memory upfront
# even if the request only uses 100 of 2048 tokens
self.buffer = torch.zeros(max_seq_len, layer_size)
self.used = 0
def append(self, new_kv: torch.Tensor):
length = new_kv.shape[0]
self.buffer[self.used : self.used + length] = new_kv
self.used += length
class PagedKVCache:
"""PagedAttention approach: allocate pages on demand."""
PAGE_SIZE = 16 # tokens per page
def __init__(self, page_pool: list):
self.pages = []
self.page_pool = page_pool # shared pool across all requests
self.token_count = 0
def append(self, new_kv: torch.Tensor):
# allocate a new page only when the current one fills up
if self.token_count % self.PAGE_SIZE == 0:
page = self.page_pool.pop() # grab from shared pool
self.pages.append(page)
# write into current page
slot = self.token_count % self.PAGE_SIZE
self.pages[-1][slot] = new_kv
self.token_count += 1With contiguous allocation, a request that uses 100 tokens out of a 2048-token buffer wastes 1948 slots that no other request can use. With paging, those slots stay in the pool and get allocated to other requests.
KV cache reuse
TensorRT-LLM documents cache reuse techniques for requests that begin with the same prompt prefix. Reusing prompt-prefix cache pages can reduce first-token latency and save prefill computation in multi-turn applications or systems with a common system prompt. The NVIDIA developer blog post on KV cache reuse in TensorRT-LLM covers the implementation details and the latency gains they observed.
In practice this matters a lot for products that have a long, fixed system prompt. Every user request starts with that same prompt. Without cache reuse, every request pays the full prefill cost. With reuse, the system prompt cache is computed once and shared across thousands of requests.
Cache quantization and architectural reduction
Recent work also explores reducing KV cache size through quantization and architectural choices.
Hugging Face has published practical guidance on KV cache quantization, including techniques that store cache tensors in INT8 or INT4 rather than bfloat16. This roughly halves or quarters the cache memory footprint at the cost of small accuracy degradation on some tasks.
Architectural approaches go further. Multi-Query Attention (MQA) shares one set of key and value heads across all query heads. Grouped-Query Attention (GQA) shares one KV pair across a group of query heads. Both reduce the total key and value state that must be stored. A model with 32 query heads and 8 KV heads under GQA stores 4 times less KV data per token than a full MHA model with 32 KV heads. LLaMA 2 70B, Mistral 7B, and most serious open models released after 2023 use GQA as the default.
DeepSeek went further with Multi-Head Latent Attention (MLA), compressing the KV representation into a shared low-rank latent vector. The KV cache reduction in DeepSeek-V2 compared to a full MHA model was 93.3%, which is not a typo.
# GQA KV cache size comparison
configs = [
("Full MHA (GPT-2 scale)", 12, 12, 64, 1024),
("GQA 4:1 ratio", 12, 3, 64, 1024),
("GQA 8:1 ratio (LLaMA)", 32, 4, 128, 4096),
("MQA (1 KV head)", 12, 1, 64, 1024),
]
print(f"{'Config':<30} | {'KV heads':>8} | {'Cache/token (KB)':>18}")
print("-" * 62)
for name, layers, kv_heads, head_dim, seq_len in configs:
bytes_per_token = 2 * layers * kv_heads * head_dim * 2 # 2 for K+V, 2 bytes dtype
kb_per_token = bytes_per_token / 1024
print(f"{name:<30} | {kv_heads:>8} | {kb_per_token:>16.2f}")Config | KV heads | Cache/token (KB)
--------------------------------------------------------------
Full MHA (GPT-2 scale) | 12 | 24.00
GQA 4:1 ratio | 3 | 6.00
GQA 8:1 ratio (LLaMA) | 4 | 16.00
MQA (1 KV head) | 1 | 2.00KV cache in modern LLM deployment
In modern LLM deployment, KV cache is central to nearly every high-performance serving stack. Tooling and serving stacks such as Hugging Face Transformers, vLLM, and TensorRT-LLM all document KV cache behavior explicitly because it has a direct effect on latency, throughput, and the maximum number of concurrent requests a system can handle.
The engineering challenge is no longer only whether to cache, but how to manage cached memory efficiently across requests, prompts, GPUs, and long-context workloads. This makes KV cache both a model-level optimization and a systems-level design problem.
Some of the active areas in 2024 and 2025:
Prefix caching for shared system prompts, where the same cached pages are reused across every request in a deployment. This can cut the cost of the prefill phase by 50 to 80 percent in products with long fixed system prompts.
Offloading cold cache pages to CPU memory or NVMe storage and loading them back on demand. Slower than keeping everything on GPU but allows serving contexts that would not otherwise fit.
Speculative decoding, where a smaller draft model generates candidate tokens that the main model verifies in parallel. The KV cache of the main model only gets extended when the verification accepts the draft token. This does not change the cache structure but changes how frequently it grows.
What this means in practice
KV cache is not merely an implementation detail. It is a central determinant of inference performance, scalability, and cost. Its main benefit is faster decoding and lower redundant computation, which makes interactive LLM applications practical at scale. At the same time, KV cache introduces significant memory costs that grow with context length and concurrency.
Understanding this tradeoff is what separates developers who can reason about LLM system design from those who treat inference as a black box. When a system runs out of memory mid-conversation, when first-token latency spikes under load, or when throughput drops as context length grows, the KV cache is usually somewhere in the explanation.
The vLLM PagedAttention paper is worth reading if you are building serving infrastructure. The Hugging Face cache strategies guide covers the different cache modes in the Transformers library and when each one applies. For production deployment on NVIDIA hardware, the TensorRT-LLM documentation covers the implementation specifics including cache reuse.
External references
- Hugging Face Transformers: Caching
- Hugging Face Transformers: Cache strategies
- Hugging Face Blog: KV caching explained
- Hugging Face Blog: Unlocking longer generation with KV cache quantization
- vLLM / PagedAttention paper (arXiv:2309.06180)
- TensorRT-LLM documentation
- TensorRT-LLM: KV cache reuse
- NVIDIA developer blog: Introducing new KV cache reuse optimizations in TensorRT-LLM
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.
Related Posts
Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works
Apr 21, 2026 · 12 min read
Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way
Apr 15, 2026 · 17 min read
Pre-training and Language Modeling: How a Transformer Learns to Predict Text
Apr 18, 2026 · 14 min read