LLMs & Deep Learning·13 min read·2,538 words

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

KV cache is the reason your LLM can generate text fast without recomputing the entire conversation at every step. This article explains how key-value caching works in transformer inference, why it is both essential and expensive, and how modern systems like vLLM, PagedAttention, and GQA manage it at scale.

Krunal Kanojiya

April 21, 2026

#kv-cache#transformer#inference#llm#attention#paged-attention#vllm#deep-learning#production

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

When you send a long message to an LLM and it responds quickly, KV cache is doing a lot of that work. Without it, generating each new token would require the model to recompute attention over the entire conversation history from scratch. Every single time. For a 4,000-token context, that means 4,000 full matrix operations per token generated.

That would be unusably slow.

KV cache is not a recent trick. It has been standard in transformer inference since the beginning. But understanding it properly matters more now than it did when context windows were 512 tokens. At 128K context lengths, managing the cache well is the difference between a system that serves users and one that runs out of GPU memory trying.

This article covers what KV cache is, how it works, where it becomes a problem, and how modern serving systems deal with it.

What KV cache is

In self-attention, each token gets projected into three representations: query (Q), key (K), and value (V). During autoregressive generation, the current token's query attends over the keys and values of all previous tokens. Since the keys and values of earlier tokens do not change during inference, recomputing them at every step is wasteful.

KV cache stores these previously computed key and value tensors so the model only needs to compute the new token's attention projections once and append them to the existing cache.

In practical terms, KV cache turns repeated full-history recomputation into incremental decoding. Instead of rebuilding attention state from scratch for the entire prefix, the model reuses stored representations for prior tokens and computes attention against the accumulated cache.

python

import torch
import torch.nn.functional as F

class AttentionWithKVCache(torch.nn.Module):
    def __init__(self, embed_dim: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim  = embed_dim // num_heads
        self.q_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.out_proj  = torch.nn.Linear(embed_dim, embed_dim, bias=False)

    def forward(
        self,
        x: torch.Tensor,
        past_keys: torch.Tensor = None,
        past_values: torch.Tensor = None,
    ):
        B, T, C = x.shape
        H, D    = self.num_heads, self.head_dim

        q = self.q_proj(x).view(B, T, H, D).transpose(1, 2)
        k = self.k_proj(x).view(B, T, H, D).transpose(1, 2)
        v = self.v_proj(x).view(B, T, H, D).transpose(1, 2)

        # append new K and V to the cache
        if past_keys is not None:
            k = torch.cat([past_keys,   k], dim=2)   # (B, H, T_total, D)
            v = torch.cat([past_values, v], dim=2)

        # store updated cache for next decoding step
        new_cache = (k, v)

        # attention over accumulated K/V
        scale   = D ** -0.5
        scores  = (q @ k.transpose(-2, -1)) * scale
        weights = F.softmax(scores, dim=-1)
        out     = weights @ v

        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out), new_cache


attn  = AttentionWithKVCache(embed_dim=512, num_heads=8)
cache = (None, None)

# decoding loop: one token at a time
for step in range(10):
    x_step        = torch.randn(1, 1, 512)   # one new token
    out, cache    = attn(x_step, cache[0], cache[1])

    cached_len = 0 if cache[0] is None else cache[0].shape[2]
    print(f"step {step+1:2d} | cached tokens: {cached_len}")

plaintext

step  1 | cached tokens: 1
step  2 | cached tokens: 2
step  3 | cached tokens: 3
...
step 10 | cached tokens: 10

The cache grows by exactly one token per decoding step. The attention computation for step 10 reads from 10 cached positions but only computes new Q, K, V for the one new token. Without the cache, it would recompute all 10 from scratch every time.

The two phases of inference: prefill and decode

KV cache operates differently across two distinct inference stages. Understanding the split matters for latency optimization because the two phases have completely different bottlenecks.

Prefill stage

During prefill, the model processes the entire input prompt and computes keys and values for all input tokens across all transformer layers. These tensors are stored in memory. Prefill runs in parallel across all input tokens, so it is compute-bound on long prompts. The output of the prefill stage is the KV cache populated for all prompt tokens, plus the first generated token.

Decode stage

During decode, the model generates one token at a time. For each new token it computes only the new Q, K, V, appends K and V to the existing cache, and attends over the full accumulated cache. Decode is memory-bandwidth-bound rather than compute-bound because the bottleneck is reading the cached tensors from GPU memory, not doing arithmetic on them.

python

import time
import torch

def profile_two_phases(model, tokenizer, prompt: str, max_new_tokens: int = 50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # prefill: process entire prompt, get first token
    t0 = time.perf_counter()
    with torch.no_grad():
        first_out = model.generate(input_ids, max_new_tokens=1, use_cache=True)
    prefill_ms = (time.perf_counter() - t0) * 1000

    # decode: generate remaining tokens one at a time
    t0 = time.perf_counter()
    with torch.no_grad():
        full_out = model.generate(first_out, max_new_tokens=max_new_tokens - 1, use_cache=True)
    decode_ms = (time.perf_counter() - t0) * 1000

    tps = (max_new_tokens - 1) / (decode_ms / 1000)

    print(f"prompt tokens:    {input_ids.shape[1]}")
    print(f"prefill time:     {prefill_ms:.1f}ms  (time to first token)")
    print(f"decode time:      {decode_ms:.1f}ms  for {max_new_tokens-1} tokens")
    print(f"tokens per second: {tps:.1f}")

Time to first token (TTFT) is dominated by prefill. Tokens per second after that is dominated by how quickly the GPU can read the KV cache. These are different problems with different solutions, which is why inference optimization papers often treat them separately.

Why KV cache matters

The primary benefit is lower decoding latency. Since the model avoids recomputation of historical attention states, each new token can be produced faster than in a no-cache implementation. This is especially relevant for interactive systems such as chatbots, copilots, and real-time assistants where users feel latency directly.

The second benefit is improved serving efficiency. Efficient KV caching reduces the computation required per generated token, which helps inference systems make better use of available hardware. At production scale, this directly affects throughput and cost per token.

The Hugging Face Transformers documentation on caching explains the standard cache interface and how it maps to the internal attention implementation. The cache strategies guide covers the different cache modes available and when to use each.

The memory problem

KV cache improves speed but increases memory pressure. Each request must store cached key and value tensors for every active token across the relevant transformer layers. As sequence length grows, the cache grows roughly linearly, reducing the number of requests that can fit into GPU memory at the same time.

python

def kv_cache_size_gb(
    num_layers:   int,
    num_kv_heads: int,
    head_dim:     int,
    seq_len:      int,
    batch_size:   int,
    dtype_bytes:  int = 2,   # bfloat16 = 2 bytes
) -> float:
    """Estimate KV cache memory in GB."""
    # 2 for K and V
    elements = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size
    return (elements * dtype_bytes) / 1e9

# LLaMA-2 70B configuration: 80 layers, 8 KV heads (GQA), 128 head dim
kv_gb = kv_cache_size_gb(
    num_layers   = 80,
    num_kv_heads = 8,
    head_dim     = 128,
    seq_len      = 8192,
    batch_size   = 32,
)
print(f"KV cache at 8K context, batch 32: {kv_gb:.1f} GB")

# without GQA (full MHA with 64 heads)
kv_mha = kv_cache_size_gb(
    num_layers   = 80,
    num_kv_heads = 64,
    head_dim     = 128,
    seq_len      = 8192,
    batch_size   = 32,
)
print(f"MHA KV cache at 8K context, batch 32: {kv_mha:.1f} GB")

plaintext

KV cache at 8K context, batch 32: 8.6 GB    (with GQA)
MHA KV cache at 8K context, batch 32: 68.7 GB  (without GQA)

That 68 GB number is why full Multi-Head Attention became impractical at scale before GQA was adopted. The weights of a 70B model in bfloat16 take about 140 GB. Adding 68 GB of KV cache on top at just 8K context would exceed most available GPU configurations. GQA cuts the KV footprint by 8x in that configuration.

Two other limitations are worth knowing. First, KV caching is designed for inference rather than training. In common transformer tooling, caching is documented as an inference optimization and may cause errors or unexpected behavior if enabled during training workflows. Second, naive cache allocation can fragment GPU memory badly in multi-user serving scenarios where many requests grow and terminate at different times, wasting reserved space that cannot be used by other requests.

Advanced KV cache optimizations

Because naive caching can consume very large amounts of memory, several optimization strategies have emerged.

PagedAttention

PagedAttention organizes KV cache into fixed-size blocks or pages rather than large contiguous buffers. This idea, introduced in the vLLM work (arXiv:2309.06180), reduces memory waste from fragmentation and enables more flexible sharing. The authors report significant throughput gains compared with earlier inference systems.

The intuition is borrowed from operating system virtual memory management. Instead of allocating one large contiguous block per request that may be partially unused, the system allocates fixed pages and assigns them as needed. When a request finishes, its pages are freed and immediately available to other requests.

python

# conceptual illustration of paged vs contiguous allocation

class ContiguousKVCache:
    """Traditional approach: one big buffer per request."""
    def __init__(self, max_seq_len: int, layer_size: int):
        # allocates max_seq_len worth of memory upfront
        # even if the request only uses 100 of 2048 tokens
        self.buffer = torch.zeros(max_seq_len, layer_size)
        self.used   = 0

    def append(self, new_kv: torch.Tensor):
        length = new_kv.shape[0]
        self.buffer[self.used : self.used + length] = new_kv
        self.used += length


class PagedKVCache:
    """PagedAttention approach: allocate pages on demand."""
    PAGE_SIZE = 16   # tokens per page

    def __init__(self, page_pool: list):
        self.pages       = []
        self.page_pool   = page_pool   # shared pool across all requests
        self.token_count = 0

    def append(self, new_kv: torch.Tensor):
        # allocate a new page only when the current one fills up
        if self.token_count % self.PAGE_SIZE == 0:
            page = self.page_pool.pop()   # grab from shared pool
            self.pages.append(page)
        # write into current page
        slot = self.token_count % self.PAGE_SIZE
        self.pages[-1][slot] = new_kv
        self.token_count += 1

With contiguous allocation, a request that uses 100 tokens out of a 2048-token buffer wastes 1948 slots that no other request can use. With paging, those slots stay in the pool and get allocated to other requests.

KV cache reuse

TensorRT-LLM documents cache reuse techniques for requests that begin with the same prompt prefix. Reusing prompt-prefix cache pages can reduce first-token latency and save prefill computation in multi-turn applications or systems with a common system prompt. The NVIDIA developer blog post on KV cache reuse in TensorRT-LLM covers the implementation details and the latency gains they observed.

In practice this matters a lot for products that have a long, fixed system prompt. Every user request starts with that same prompt. Without cache reuse, every request pays the full prefill cost. With reuse, the system prompt cache is computed once and shared across thousands of requests.

Cache quantization and architectural reduction

Recent work also explores reducing KV cache size through quantization and architectural choices.

Hugging Face has published practical guidance on KV cache quantization, including techniques that store cache tensors in INT8 or INT4 rather than bfloat16. This roughly halves or quarters the cache memory footprint at the cost of small accuracy degradation on some tasks.

Architectural approaches go further. Multi-Query Attention (MQA) shares one set of key and value heads across all query heads. Grouped-Query Attention (GQA) shares one KV pair across a group of query heads. Both reduce the total key and value state that must be stored. A model with 32 query heads and 8 KV heads under GQA stores 4 times less KV data per token than a full MHA model with 32 KV heads. LLaMA 2 70B, Mistral 7B, and most serious open models released after 2023 use GQA as the default.

DeepSeek went further with Multi-Head Latent Attention (MLA), compressing the KV representation into a shared low-rank latent vector. The KV cache reduction in DeepSeek-V2 compared to a full MHA model was 93.3%, which is not a typo.

python

# GQA KV cache size comparison
configs = [
    ("Full MHA (GPT-2 scale)", 12,  12, 64,  1024),
    ("GQA 4:1 ratio",          12,  3,  64,  1024),
    ("GQA 8:1 ratio (LLaMA)", 32,  4,  128, 4096),
    ("MQA (1 KV head)",        12,  1,  64,  1024),
]

print(f"{'Config':<30} | {'KV heads':>8} | {'Cache/token (KB)':>18}")
print("-" * 62)
for name, layers, kv_heads, head_dim, seq_len in configs:
    bytes_per_token = 2 * layers * kv_heads * head_dim * 2  # 2 for K+V, 2 bytes dtype
    kb_per_token    = bytes_per_token / 1024
    print(f"{name:<30} | {kv_heads:>8} | {kb_per_token:>16.2f}")

plaintext

Config                         | KV heads | Cache/token (KB)
--------------------------------------------------------------
Full MHA (GPT-2 scale)         |       12 |            24.00
GQA 4:1 ratio                  |        3 |             6.00
GQA 8:1 ratio (LLaMA)          |        4 |            16.00
MQA (1 KV head)                |        1 |             2.00

KV cache in modern LLM deployment

In modern LLM deployment, KV cache is central to nearly every high-performance serving stack. Tooling and serving stacks such as Hugging Face Transformers, vLLM, and TensorRT-LLM all document KV cache behavior explicitly because it has a direct effect on latency, throughput, and the maximum number of concurrent requests a system can handle.

The engineering challenge is no longer only whether to cache, but how to manage cached memory efficiently across requests, prompts, GPUs, and long-context workloads. This makes KV cache both a model-level optimization and a systems-level design problem.

Some of the active areas in 2024 and 2025:

Prefix caching for shared system prompts, where the same cached pages are reused across every request in a deployment. This can cut the cost of the prefill phase by 50 to 80 percent in products with long fixed system prompts.

Offloading cold cache pages to CPU memory or NVMe storage and loading them back on demand. Slower than keeping everything on GPU but allows serving contexts that would not otherwise fit.

Speculative decoding, where a smaller draft model generates candidate tokens that the main model verifies in parallel. The KV cache of the main model only gets extended when the verification accepts the draft token. This does not change the cache structure but changes how frequently it grows.

What this means in practice

KV cache is not merely an implementation detail. It is a central determinant of inference performance, scalability, and cost. Its main benefit is faster decoding and lower redundant computation, which makes interactive LLM applications practical at scale. At the same time, KV cache introduces significant memory costs that grow with context length and concurrency.

Understanding this tradeoff is what separates developers who can reason about LLM system design from those who treat inference as a black box. When a system runs out of memory mid-conversation, when first-token latency spikes under load, or when throughput drops as context length grows, the KV cache is usually somewhere in the explanation.

Where to go deeper

The vLLM PagedAttention paper is worth reading if you are building serving infrastructure. The Hugging Face cache strategies guide covers the different cache modes in the Transformers library and when each one applies. For production deployment on NVIDIA hardware, the TensorRT-LLM documentation covers the implementation specifics including cache reuse.

External references

Frequently Asked Questions

What is KV cache in LLMs?

KV cache stores the key and value tensors that the attention mechanism computes for each token during inference. Because earlier tokens do not change once computed, the model can reuse their cached representations instead of recomputing them for every new token it generates. This turns repeated full-history recomputation into incremental decoding, which is the main reason autoregressive generation is fast enough to use in real-time applications.

Why does KV cache improve inference speed?

Without caching, generating the N-th token requires computing attention over all N-1 previous tokens from scratch. With KV cache, the model appends the new token's key and value to the existing cache and computes attention against stored tensors. The model only does new work for the new token, not for the entire history. This reduces decoding latency significantly, especially as context length grows.

What are the memory tradeoffs of KV cache?

KV cache grows linearly with sequence length. For a 70B model with 64 attention heads and a 128-dimensional head projection, every token in the context adds key and value tensors across all layers. At 8K context with batch size 32, the KV cache alone can consume 8 GB or more of GPU memory. This is why long-context inference becomes memory-bound even when compute is available.

What is PagedAttention?

PagedAttention, introduced by the vLLM project, organizes KV cache memory into fixed-size blocks or pages rather than large contiguous buffers. This mirrors how operating systems manage virtual memory. The approach reduces memory waste from fragmentation and enables more flexible memory sharing across requests. The vLLM paper reports significant throughput gains compared to earlier inference systems.

What is KV cache reuse and when does it help?

KV cache reuse means sharing the cached tensors for a common prompt prefix across multiple requests. If many users send requests with the same system prompt, the model only needs to compute and cache that prefix once. Subsequent requests reuse those cached pages, reducing first-token latency and prefill computation. TensorRT-LLM documents this as a standard optimization for multi-turn and high-throughput serving scenarios.

How do MQA and GQA reduce KV cache size?

Multi-Query Attention shares one set of key and value heads across all query heads. Grouped-Query Attention shares one KV pair across a group of query heads. Both reduce the total key and value state that must be stored in the cache. A model with 32 query heads and 8 KV heads under GQA stores 4 times less KV data per token than a full Multi-Head Attention model, which directly extends how much context the model can handle within a fixed memory budget.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

I am a technical writer and former software developer from India. I publish practical tutorials and in-depth guides on AI engineering, data engineering, programming, algorithms, blockchain, and modern software development.

GitHub LinkedIn X

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

Apr 21, 2026 · 12 min read

Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way

Apr 15, 2026 · 17 min read

Why 1M Tokens Is a Trap: The Hidden Cost of Long Context Windows

Apr 22, 2026 · 12 min read

LLMs & Deep Learning·13 min read·2,538 words

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

Krunal Kanojiya

April 21, 2026

#kv-cache#transformer#inference#llm#attention#paged-attention#vllm#deep-learning#production

That would be unusably slow.

This article covers what KV cache is, how it works, where it becomes a problem, and how modern serving systems deal with it.

What KV cache is

KV cache stores these previously computed key and value tensors so the model only needs to compute the new token's attention projections once and append them to the existing cache.

python

import torch
import torch.nn.functional as F

class AttentionWithKVCache(torch.nn.Module):
    def __init__(self, embed_dim: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim  = embed_dim // num_heads
        self.q_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj    = torch.nn.Linear(embed_dim, embed_dim, bias=False)
        self.out_proj  = torch.nn.Linear(embed_dim, embed_dim, bias=False)

    def forward(
        self,
        x: torch.Tensor,
        past_keys: torch.Tensor = None,
        past_values: torch.Tensor = None,
    ):
        B, T, C = x.shape
        H, D    = self.num_heads, self.head_dim

        q = self.q_proj(x).view(B, T, H, D).transpose(1, 2)
        k = self.k_proj(x).view(B, T, H, D).transpose(1, 2)
        v = self.v_proj(x).view(B, T, H, D).transpose(1, 2)

        # append new K and V to the cache
        if past_keys is not None:
            k = torch.cat([past_keys,   k], dim=2)   # (B, H, T_total, D)
            v = torch.cat([past_values, v], dim=2)

        # store updated cache for next decoding step
        new_cache = (k, v)

        # attention over accumulated K/V
        scale   = D ** -0.5
        scores  = (q @ k.transpose(-2, -1)) * scale
        weights = F.softmax(scores, dim=-1)
        out     = weights @ v

        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out), new_cache


attn  = AttentionWithKVCache(embed_dim=512, num_heads=8)
cache = (None, None)

# decoding loop: one token at a time
for step in range(10):
    x_step        = torch.randn(1, 1, 512)   # one new token
    out, cache    = attn(x_step, cache[0], cache[1])

    cached_len = 0 if cache[0] is None else cache[0].shape[2]
    print(f"step {step+1:2d} | cached tokens: {cached_len}")

plaintext

step  1 | cached tokens: 1
step  2 | cached tokens: 2
step  3 | cached tokens: 3
...
step 10 | cached tokens: 10

The two phases of inference: prefill and decode

KV cache operates differently across two distinct inference stages. Understanding the split matters for latency optimization because the two phases have completely different bottlenecks.

Prefill stage

Decode stage

python

import time
import torch

def profile_two_phases(model, tokenizer, prompt: str, max_new_tokens: int = 50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # prefill: process entire prompt, get first token
    t0 = time.perf_counter()
    with torch.no_grad():
        first_out = model.generate(input_ids, max_new_tokens=1, use_cache=True)
    prefill_ms = (time.perf_counter() - t0) * 1000

    # decode: generate remaining tokens one at a time
    t0 = time.perf_counter()
    with torch.no_grad():
        full_out = model.generate(first_out, max_new_tokens=max_new_tokens - 1, use_cache=True)
    decode_ms = (time.perf_counter() - t0) * 1000

    tps = (max_new_tokens - 1) / (decode_ms / 1000)

    print(f"prompt tokens:    {input_ids.shape[1]}")
    print(f"prefill time:     {prefill_ms:.1f}ms  (time to first token)")
    print(f"decode time:      {decode_ms:.1f}ms  for {max_new_tokens-1} tokens")
    print(f"tokens per second: {tps:.1f}")

Why KV cache matters

The memory problem

python

def kv_cache_size_gb(
    num_layers:   int,
    num_kv_heads: int,
    head_dim:     int,
    seq_len:      int,
    batch_size:   int,
    dtype_bytes:  int = 2,   # bfloat16 = 2 bytes
) -> float:
    """Estimate KV cache memory in GB."""
    # 2 for K and V
    elements = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size
    return (elements * dtype_bytes) / 1e9

# LLaMA-2 70B configuration: 80 layers, 8 KV heads (GQA), 128 head dim
kv_gb = kv_cache_size_gb(
    num_layers   = 80,
    num_kv_heads = 8,
    head_dim     = 128,
    seq_len      = 8192,
    batch_size   = 32,
)
print(f"KV cache at 8K context, batch 32: {kv_gb:.1f} GB")

# without GQA (full MHA with 64 heads)
kv_mha = kv_cache_size_gb(
    num_layers   = 80,
    num_kv_heads = 64,
    head_dim     = 128,
    seq_len      = 8192,
    batch_size   = 32,
)
print(f"MHA KV cache at 8K context, batch 32: {kv_mha:.1f} GB")

plaintext

KV cache at 8K context, batch 32: 8.6 GB    (with GQA)
MHA KV cache at 8K context, batch 32: 68.7 GB  (without GQA)

Advanced KV cache optimizations

Because naive caching can consume very large amounts of memory, several optimization strategies have emerged.

PagedAttention

python

# conceptual illustration of paged vs contiguous allocation

class ContiguousKVCache:
    """Traditional approach: one big buffer per request."""
    def __init__(self, max_seq_len: int, layer_size: int):
        # allocates max_seq_len worth of memory upfront
        # even if the request only uses 100 of 2048 tokens
        self.buffer = torch.zeros(max_seq_len, layer_size)
        self.used   = 0

    def append(self, new_kv: torch.Tensor):
        length = new_kv.shape[0]
        self.buffer[self.used : self.used + length] = new_kv
        self.used += length


class PagedKVCache:
    """PagedAttention approach: allocate pages on demand."""
    PAGE_SIZE = 16   # tokens per page

    def __init__(self, page_pool: list):
        self.pages       = []
        self.page_pool   = page_pool   # shared pool across all requests
        self.token_count = 0

    def append(self, new_kv: torch.Tensor):
        # allocate a new page only when the current one fills up
        if self.token_count % self.PAGE_SIZE == 0:
            page = self.page_pool.pop()   # grab from shared pool
            self.pages.append(page)
        # write into current page
        slot = self.token_count % self.PAGE_SIZE
        self.pages[-1][slot] = new_kv
        self.token_count += 1

KV cache reuse

Cache quantization and architectural reduction

Recent work also explores reducing KV cache size through quantization and architectural choices.

python

# GQA KV cache size comparison
configs = [
    ("Full MHA (GPT-2 scale)", 12,  12, 64,  1024),
    ("GQA 4:1 ratio",          12,  3,  64,  1024),
    ("GQA 8:1 ratio (LLaMA)", 32,  4,  128, 4096),
    ("MQA (1 KV head)",        12,  1,  64,  1024),
]

print(f"{'Config':<30} | {'KV heads':>8} | {'Cache/token (KB)':>18}")
print("-" * 62)
for name, layers, kv_heads, head_dim, seq_len in configs:
    bytes_per_token = 2 * layers * kv_heads * head_dim * 2  # 2 for K+V, 2 bytes dtype
    kb_per_token    = bytes_per_token / 1024
    print(f"{name:<30} | {kv_heads:>8} | {kb_per_token:>16.2f}")

plaintext

Config                         | KV heads | Cache/token (KB)
--------------------------------------------------------------
Full MHA (GPT-2 scale)         |       12 |            24.00
GQA 4:1 ratio                  |        3 |             6.00
GQA 8:1 ratio (LLaMA)          |        4 |            16.00
MQA (1 KV head)                |        1 |             2.00

KV cache in modern LLM deployment

Some of the active areas in 2024 and 2025:

Offloading cold cache pages to CPU memory or NVMe storage and loading them back on demand. Slower than keeping everything on GPU but allows serving contexts that would not otherwise fit.

What this means in practice

Where to go deeper

External references

Frequently Asked Questions

What is KV cache in LLMs?

Why does KV cache improve inference speed?

What are the memory tradeoffs of KV cache?

What is PagedAttention?

What is KV cache reuse and when does it help?

How do MQA and GQA reduce KV cache size?

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

GitHub LinkedIn X

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

Apr 21, 2026 · 12 min read

Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way

Apr 15, 2026 · 17 min read

Why 1M Tokens Is a Trap: The Hidden Cost of Long Context Windows

Apr 22, 2026 · 12 min read

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

What KV cache is

The two phases of inference: prefill and decode

Prefill stage

Decode stage

Why KV cache matters

The memory problem

Advanced KV cache optimizations

PagedAttention

KV cache reuse

Cache quantization and architectural reduction

KV cache in modern LLM deployment

What this means in practice

External references

Frequently Asked Questions

Krunal Kanojiya

Related Posts

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

What KV cache is

The two phases of inference: prefill and decode

Prefill stage

Decode stage

Why KV cache matters

The memory problem

Advanced KV cache optimizations

PagedAttention

KV cache reuse

Cache quantization and architectural reduction

KV cache in modern LLM deployment

What this means in practice

External references

Frequently Asked Questions

Krunal Kanojiya

Related Posts

What KV cache is

The two phases of inference: prefill and decode

Prefill stage

Decode stage

Why KV cache matters

The memory problem

Advanced KV cache optimizations

PagedAttention

KV cache reuse

Cache quantization and architectural reduction

KV cache in modern LLM deployment

What this means in practice

External references

Related reading

Frequently Asked Questions

Krunal Kanojiya

Related Posts

What KV cache is

The two phases of inference: prefill and decode

Prefill stage

Decode stage

Why KV cache matters

The memory problem

Advanced KV cache optimizations

PagedAttention

KV cache reuse

Cache quantization and architectural reduction

KV cache in modern LLM deployment

What this means in practice

External references

Related reading

Frequently Asked Questions

Krunal Kanojiya

Related Posts