LLMs & Deep Learning·12 min read·2,397 words

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

Building a model is one thing. Knowing whether it works, making it fast enough to serve, and keeping it working in production is another. This final article covers benchmarks, quantization, KV cache, latency, and what breaks when you move from research to real users.

Krunal Kanojiya

April 21, 2026·Updated June 06, 2026

#evaluation#benchmarks#quantization#inference#deployment#llm#production#kv-cache#latency

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

Nine articles ago we started with vectors and dot products. This is where it all lands in production.

You have built models from scratch, understood how pre-training works, seen how alignment shapes behavior, and learned how to prompt them effectively. What happens when you have to actually ship something? Benchmarks you pick will determine whether you trust your model. Inference optimization determines whether users wait 200 milliseconds or 8 seconds. Deployment decisions determine whether the thing stays working when real traffic arrives.

This is the final article in the series on AI and ML fundamentals. It connects directly to Articles 8 and 9. The fine-tuned and aligned model from Article 8 needs to be evaluated. The prompting and RAG pipelines from Article 9 (and the what is RAG in AI primer) need to run fast enough to be useful in production.

The benchmark problem

Benchmarks are proxies. The goal is not to score well on benchmarks. The goal is for the model to be useful to your users for your specific task. Those two things can diverge significantly.

MMLU was the dominant evaluation benchmark from 2021 to 2024. It tests knowledge across 57 subjects from elementary math to professional law and medicine. By mid-2024, top frontier models scored 90% or above, making it nearly useless for differentiating between them. Researchers responded with MMLU-Pro, which uses 10 answer choices instead of 4 and harder questions.

HumanEval tests code generation across 164 Python problems. As of 2025, top models score near or above 90%. It was also increasingly suspected that some models had seen these exact problems in training data, inflating scores. SWE-bench Verified, which tests real-world GitHub issue resolution in actual codebases, is now more meaningful for code evaluation because it is much harder to contaminate.

python

# a practical benchmark selection framework

def select_benchmarks(use_case: str) -> list:
    """
    Map your production use case to meaningful benchmarks.
    Do not default to MMLU just because it is famous.
    """
    benchmark_map = {
        "code_generation": [
            "SWE-bench Verified",   # real GitHub issues, hard to contaminate
            "HumanEval+",           # extended test suite, catches edge cases
            "LiveCodeBench",        # fresh problems, lower contamination risk
        ],
        "reasoning_assistant": [
            "GPQA-Diamond",         # PhD-level reasoning, not yet saturated
            "MMLU-Pro",             # harder version of MMLU
            "MATH-500",             # math reasoning benchmark
        ],
        "instruction_following": [
            "IFEval",               # tests whether model follows specific instructions
            "MT-Bench",             # multi-turn dialogue quality
        ],
        "knowledge_qa": [
            "MMLU-Pro",
            "SimpleQA",             # factual accuracy, measures hallucination
            "TruthfulQA",           # tests whether model avoids confident wrong answers
        ],
        "general_purpose": [
            "GPQA-Diamond",
            "SWE-bench Verified",
            "IFEval",
            "LiveCodeBench",
        ]
    }
    return benchmark_map.get(use_case, benchmark_map["general_purpose"])

# choosing the wrong benchmark gives you a false sense of model quality
# MMLU scores above 90% tell you almost nothing about production performance

The current benchmark hierarchy in 2026 for frontier models: GPQA-Diamond still differentiates (Claude Opus 4.6 at 91.3%, GPT-5.3 Codex at 81%), SWE-bench Verified still differentiates (Claude Opus 4.6 at 80.8%), and Humanity's Last Exam (HLE) is the hardest current benchmark, designed to be unsolvable by 2024-era models. For non-frontier models, MMLU and HumanEval still provide useful baseline checks.

The most important benchmark you will run is your own. Build an evaluation set from real production examples. Sample 200 to 500 real user queries from your logs, have someone label the ideal responses, and measure your model against those. Public benchmarks tell you about general capability. Your eval set tells you about your actual problem.

Evaluating in code

Evaluation is code, not a spreadsheet. Automated evaluation at scale requires clear metrics, reproducible prompts, and a way to track regressions.

python

import json
from typing import Callable

class LLMEvaluator:
    """Simple framework for automated LLM evaluation."""

    def __init__(self, model_fn: Callable, judge_fn: Callable = None):
        self.model = model_fn
        self.judge = judge_fn  # optional: LLM-as-judge for open-ended tasks

    def evaluate_exact_match(self, dataset: list[dict]) -> dict:
        """For tasks with ground-truth labels (classification, extraction)."""
        correct = 0
        total   = len(dataset)
        errors  = []

        for example in dataset:
            response  = self.model(example["prompt"])
            predicted = self._extract_answer(response)
            expected  = example["expected"]

            if predicted == expected:
                correct += 1
            else:
                errors.append({
                    "prompt":    example["prompt"],
                    "expected":  expected,
                    "predicted": predicted,
                    "response":  response,
                })

        return {
            "accuracy": correct / total,
            "correct":  correct,
            "total":    total,
            "errors":   errors[:10],   # first 10 failures for debugging
        }

    def evaluate_with_judge(self, dataset: list[dict], criteria: str) -> dict:
        """For open-ended tasks: use an LLM to judge quality."""
        scores = []

        for example in dataset:
            response = self.model(example["prompt"])

            judge_prompt = f"""Rate the following response on a scale of 1-5.
Criteria: {criteria}

Prompt: {example["prompt"]}
Response: {response}

Score (1-5) and brief reason:"""

            judgment = self.judge(judge_prompt)
            score    = self._extract_score(judgment)
            scores.append(score)

        return {
            "mean_score": sum(scores) / len(scores),
            "score_dist": {i: scores.count(i) for i in range(1, 6)},
        }

    def _extract_answer(self, response: str) -> str:
        return response.strip().split("\n")[0]

    def _extract_score(self, judgment: str) -> int:
        for char in judgment:
            if char.isdigit() and int(char) in range(1, 6):
                return int(char)
        return 3  # default to middle if extraction fails

LLM-as-judge evaluation has become standard for open-ended tasks where ground-truth labels do not exist. You send the model's response to a stronger model (often GPT-4o or Claude) and ask it to rate quality. This scales better than human evaluation and correlates reasonably well with human judgment, though it inherits the biases of the judge model.

Inference: what actually runs when you generate a token

Understanding inference matters because it determines cost, latency, and where your optimization budget goes.

LLM inference has two distinct phases. The prefill phase processes all input tokens simultaneously using the transformer's parallel attention. This is fast per token but increases linearly with context length. The decode phase generates one token at a time autoregressively, reading from the KV cache at each step. This is slower per step but does not grow with context for each individual step.

python

import time
import torch

def profile_inference(model, tokenizer, prompt: str, max_new_tokens: int = 100):
    """Profile prefill vs decode phases."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # measure prefill (first token)
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_new_tokens=1,
            do_sample=False
        )
    prefill_time = time.perf_counter() - start
    ttft = prefill_time  # time to first token

    # measure decode (remaining tokens)
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(
            outputs,
            max_new_tokens=max_new_tokens - 1,
            do_sample=False
        )
    decode_time = time.perf_counter() - start
    tokens_per_second = (max_new_tokens - 1) / decode_time

    return {
        "prompt_tokens":  input_ids.shape[1],
        "ttft_seconds":   ttft,
        "tps":            tokens_per_second,
        "total_seconds":  prefill_time + decode_time,
    }

The KV cache stores the key and value tensors computed for each input token so they do not need to be recomputed when generating the next token. For a 70B model with 64 attention heads and 128-dimensional head projections serving a batch of requests at 8K context, the KV cache alone can consume 8 GB or more of GPU memory. This is why DeepSeek's Multi-Head Latent Attention (MLA, covered in Article 6) reduces the KV cache by 93%, and why Grouped Query Attention (GQA) is standard in all modern large models.

Quantization: trading precision for speed and memory

A 7B model in 16-bit (bfloat16) precision requires roughly 14 GB of GPU memory. The same model in 4-bit quantization requires roughly 4 GB, which fits on a consumer GPU. The tradeoff is accuracy degradation.

python

# comparing quantization formats in terms of memory

def model_memory_gb(params_billion: float, bits: int) -> float:
    """Estimate model memory in GB."""
    bytes_per_param = bits / 8
    return params_billion * 1e9 * bytes_per_param / 1e9

model_sizes = [7, 13, 34, 70]
formats     = [("BF16", 16), ("INT8", 8), ("INT4", 4), ("INT3", 3)]

print(f"{'Size':<6}", end="")
for name, _ in formats:
    print(f"  {name:>8}", end="")
print()

for size in model_sizes:
    print(f"{size}B    ", end="")
    for name, bits in formats:
        mem = model_memory_gb(size, bits)
        print(f"  {mem:>6.1f}GB", end="")
    print()

plaintext

Size     BF16      INT8      INT4      INT3
7B      14.0GB    7.0GB    3.5GB    2.6GB
13B     26.0GB   13.0GB    6.5GB    4.9GB
34B     68.0GB   34.0GB   17.0GB   12.8GB
70B    140.0GB   70.0GB   35.0GB   26.3GB

The main quantization methods in production as of 2025:

GPTQ quantizes weights to 4-bit or 8-bit offline, before deployment. The quantized model loads and runs like a normal model. Compatible with most inference frameworks.

AWQ (Activation-Aware Weight Quantization) accounts for the distribution of activations when quantizing, reducing quality loss compared to naive weight quantization. It consistently outperforms GPTQ at the same bit width on reasoning benchmarks.

GGUF is the format used by llama.cpp, the main CPU inference framework. It supports a range of quantization levels from Q2 to Q8. Q4_K_M and Q5_K_M are common choices that balance quality and speed for local deployment.

Recent research on Qwen2.5 and DeepSeek models shows that accuracy differences between BF16 and 4-bit quantization are consistently under 1% on MMLU, MATH, and GPQA for models above 7 billion parameters. Smaller models degrade more noticeably under aggressive quantization.

python

# practical quantization decision guide

def choose_quantization(
    gpu_vram_gb: float,
    model_size_b: float,
    quality_priority: str = "high"
) -> str:
    """
    Recommend quantization format based on hardware and requirements.
    quality_priority: "high", "balanced", or "speed"
    """
    bf16_memory  = model_size_b * 2    # GB
    int8_memory  = model_size_b * 1    # GB
    int4_memory  = model_size_b * 0.5  # GB

    available = gpu_vram_gb * 0.85   # leave 15% headroom

    if bf16_memory <= available and quality_priority == "high":
        return "BF16 (full precision, best quality)"

    if int8_memory <= available and quality_priority in ("high", "balanced"):
        return "INT8 / GPTQ-8bit (small quality loss, recommended)"

    if int4_memory <= available:
        if quality_priority == "speed":
            return "INT4 / GPTQ-4bit (faster, measurable quality loss)"
        return "AWQ-4bit (better quality than GPTQ at same bit width)"

    return "Model too large for this GPU - consider model sharding or smaller model"

# examples
print(choose_quantization(gpu_vram_gb=24,  model_size_b=7))   # fits in BF16
print(choose_quantization(gpu_vram_gb=24,  model_size_b=34))  # needs INT8
print(choose_quantization(gpu_vram_gb=24,  model_size_b=70))  # needs INT4
print(choose_quantization(gpu_vram_gb=8,   model_size_b=13))  # needs INT4

What breaks in production

Benchmarks run your model once, sequentially, on carefully formatted inputs. Production is not that.

Latency under load: A model that generates 80 tokens per second serving one request may produce 12 tokens per second when serving 20 concurrent requests, because GPU memory bandwidth is shared. Benchmark your latency at your expected concurrency level, not at idle.

Context length edge cases: Your RAG pipeline retrieves 3,000 tokens of context per request. Your average user sends 200-token messages. But some users send 2,000-token messages with 3,000 tokens of retrieved context, and the combined input hits the context limit. The model either truncates silently or fails. Define your context budget before deployment.

Distribution shift: Benchmark questions were written by researchers. Your users ask different things. A customer support model that scores 95% on a clean evaluation set might fail on 30% of real support tickets that include screenshots described in words, order numbers embedded in natural language, or multilingual queries the benchmark never covered.

Cost at scale: A model that costs $0.01 per 1,000 tokens looks cheap until you are serving 10 million requests per day with an average of 2,000 tokens per request. That is $200,000 per day. Token efficiency matters enormously at scale. Prompt compression, shorter system prompts, and output length control are worth investing in.

Model updates breaking your application: When an API provider updates their model, benchmark scores may improve but your specific use case may regress. Applications that worked reliably with a specific model version sometimes break with the next. Pin your model version for production. Test the new version in staging before promoting.

API versus self-hosted: the real decision

Most teams start with an API model and some eventually move to self-hosting. The decision is not primarily about cost. It is about requirements.

Use an API model when you need frontier capability and are not training your own, when you cannot manage GPU infrastructure, or when your volume is low enough that the per-token cost is acceptable. OpenAI, Anthropic, and Google all offer reliable APIs with SLAs.

Self-host when you have strict data privacy requirements and cannot send user data to a third party, when you need a fine-tuned model that no API offers, when your volume is high enough that per-token API costs exceed self-hosting infrastructure costs, or when you need guaranteed uptime without dependency on an external provider.

For most developer teams building their first LLM product, start with an API. When you have real production traffic and real cost data, you will know exactly where the pain points are. That is when the self-hosting decision becomes concrete rather than speculative.

Connecting the entire series

This is where it comes together. Article 1 gave you vectors and matrices. Article 2 gave you probability and cross-entropy loss. Article 3 showed how neural networks learn using gradients and backpropagation. Article 4 turned tokens into meaningful vectors. Article 5 explained why RNNs broke and why something needed to replace them. Article 6 built the transformer and the attention mechanism that replaced them. Article 7 showed how pre-training on trillions of tokens gives the model its capabilities. Article 8 covered alignment and why the model follows your instructions. Article 9 showed how to use all of that in real applications.

This article showed how to know whether it works, how to make it fast enough, and what to watch for when it goes to production.

The series was designed so that each article made the next one necessary. Embeddings without attention would be a dead end. Attention without pre-training would be an empty architecture. Pre-training without alignment would be a text autocomplete engine. Alignment without prompting knowledge would produce poor-quality integrations. Prompting without evaluation would leave you guessing whether anything is actually working.

None of these ten concepts stands alone. They form one continuous stack, from the math in Article 1 to the production monitoring in this one. If something breaks in your LLM application, the answer usually lives somewhere in that stack.

Where to go from here

The best next step is to build something. Pick the smallest version of a real application: a document QA system using RAG, a code review tool using an aligned model, or a fine-tuned classifier. Work through the full stack from embedding documents to evaluating outputs to measuring latency. The concepts in this series will mean something different once you have hit the actual failure modes yourself.

The same evaluation-and-deployment discipline becomes stricter when model outputs control real systems. Physical AI for ML engineers explains what changes when perception, latency, safety, and actions in the physical world enter the loop.

This series is complete. Ten concepts, each depending on the one before, covering the full stack from mathematics to deployed product.

Frequently Asked Questions

What are the most important LLM benchmarks in 2026?

MMLU and HumanEval have saturated at the top. Frontier models score 90%+ on both, which makes them useless for differentiating production choices. The benchmarks that still matter: GPQA-Diamond for expert reasoning, SWE-bench Verified for real-world software engineering, MMLU-Pro for harder multidisciplinary knowledge, LiveCodeBench for coding that is less likely to be contaminated by training data, and IFEval for instruction following. The right benchmark depends on your actual use case.

What is quantization and how does it affect model quality?

Quantization reduces the precision of model weights from 32-bit or 16-bit floats to smaller representations like 8-bit or 4-bit integers. A 7B model in 16-bit precision requires about 14GB of GPU memory. The same model in 4-bit requires about 4GB. The tradeoff is a small quality drop, which varies by task and quantization method. Recent research on Qwen2.5 and DeepSeek shows that accuracy differences between BF16 and 4-bit quantization are consistently under 1% on most benchmarks for models above 7B parameters.

What is the KV cache and why does it matter for LLM inference?

During autoregressive generation, the model computes key and value vectors for every token in the context. These are stored in the KV cache so they do not need to be recomputed for every new token generated. Without the cache, generating each token would require a full forward pass over the entire conversation history. The KV cache grows linearly with context length and is one of the main memory bottlenecks in production LLM serving.

What is the difference between time to first token and tokens per second?

Time to first token (TTFT) is how long the model takes to produce its first output token after receiving the input. It depends on how long the prefill phase takes, processing all input tokens simultaneously. Tokens per second (TPS) is the rate of generation after the first token. These measure different phases of inference and have different optimization strategies. TTFT matters most in interactive applications. TPS matters most for throughput in batch processing.

When should I use an API versus self-hosting a model?

API models (OpenAI, Anthropic, Google) are better when: you need frontier capability, you cannot manage infrastructure, you have low to medium volume, or you need fast iteration. Self-hosting is better when: you have strict data privacy requirements, you need a very specific fine-tuned model, your volume is high enough that per-token costs matter significantly, or you need guaranteed uptime without dependency on a third party.

What breaks in LLM production that benchmarks do not catch?

Latency under load, which benchmarks run sequentially at idle. Context length edge cases, since benchmarks use short inputs. Distribution shift, because production users ask very different things than benchmark authors expected. Prompt injection attacks, which no standard benchmark tests for. Cost at scale, since benchmark scores say nothing about token efficiency. And regression from model updates, because a model that improves on benchmarks can regress on your specific use case.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

I am a technical writer and former software developer from India. I publish practical tutorials and in-depth guides on AI engineering, data engineering, programming, algorithms, blockchain, and modern software development.

GitHub LinkedIn X

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

Apr 21, 2026 · 13 min read

Why 1M Tokens Is a Trap: The Hidden Cost of Long Context Windows

Apr 22, 2026 · 12 min read

TurboQuant Explained: Google's Breakthrough in AI Model Compression

Mar 31, 2026 · 8 min read

LLMs & Deep Learning·12 min read·2,397 words

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

Krunal Kanojiya

April 21, 2026·Updated June 06, 2026

#evaluation#benchmarks#quantization#inference#deployment#llm#production#kv-cache#latency

Nine articles ago we started with vectors and dot products. This is where it all lands in production.

The benchmark problem

Benchmarks are proxies. The goal is not to score well on benchmarks. The goal is for the model to be useful to your users for your specific task. Those two things can diverge significantly.

python

# a practical benchmark selection framework

def select_benchmarks(use_case: str) -> list:
    """
    Map your production use case to meaningful benchmarks.
    Do not default to MMLU just because it is famous.
    """
    benchmark_map = {
        "code_generation": [
            "SWE-bench Verified",   # real GitHub issues, hard to contaminate
            "HumanEval+",           # extended test suite, catches edge cases
            "LiveCodeBench",        # fresh problems, lower contamination risk
        ],
        "reasoning_assistant": [
            "GPQA-Diamond",         # PhD-level reasoning, not yet saturated
            "MMLU-Pro",             # harder version of MMLU
            "MATH-500",             # math reasoning benchmark
        ],
        "instruction_following": [
            "IFEval",               # tests whether model follows specific instructions
            "MT-Bench",             # multi-turn dialogue quality
        ],
        "knowledge_qa": [
            "MMLU-Pro",
            "SimpleQA",             # factual accuracy, measures hallucination
            "TruthfulQA",           # tests whether model avoids confident wrong answers
        ],
        "general_purpose": [
            "GPQA-Diamond",
            "SWE-bench Verified",
            "IFEval",
            "LiveCodeBench",
        ]
    }
    return benchmark_map.get(use_case, benchmark_map["general_purpose"])

# choosing the wrong benchmark gives you a false sense of model quality
# MMLU scores above 90% tell you almost nothing about production performance

Evaluating in code

Evaluation is code, not a spreadsheet. Automated evaluation at scale requires clear metrics, reproducible prompts, and a way to track regressions.

python

import json
from typing import Callable

class LLMEvaluator:
    """Simple framework for automated LLM evaluation."""

    def __init__(self, model_fn: Callable, judge_fn: Callable = None):
        self.model = model_fn
        self.judge = judge_fn  # optional: LLM-as-judge for open-ended tasks

    def evaluate_exact_match(self, dataset: list[dict]) -> dict:
        """For tasks with ground-truth labels (classification, extraction)."""
        correct = 0
        total   = len(dataset)
        errors  = []

        for example in dataset:
            response  = self.model(example["prompt"])
            predicted = self._extract_answer(response)
            expected  = example["expected"]

            if predicted == expected:
                correct += 1
            else:
                errors.append({
                    "prompt":    example["prompt"],
                    "expected":  expected,
                    "predicted": predicted,
                    "response":  response,
                })

        return {
            "accuracy": correct / total,
            "correct":  correct,
            "total":    total,
            "errors":   errors[:10],   # first 10 failures for debugging
        }

    def evaluate_with_judge(self, dataset: list[dict], criteria: str) -> dict:
        """For open-ended tasks: use an LLM to judge quality."""
        scores = []

        for example in dataset:
            response = self.model(example["prompt"])

            judge_prompt = f"""Rate the following response on a scale of 1-5.
Criteria: {criteria}

Prompt: {example["prompt"]}
Response: {response}

Score (1-5) and brief reason:"""

            judgment = self.judge(judge_prompt)
            score    = self._extract_score(judgment)
            scores.append(score)

        return {
            "mean_score": sum(scores) / len(scores),
            "score_dist": {i: scores.count(i) for i in range(1, 6)},
        }

    def _extract_answer(self, response: str) -> str:
        return response.strip().split("\n")[0]

    def _extract_score(self, judgment: str) -> int:
        for char in judgment:
            if char.isdigit() and int(char) in range(1, 6):
                return int(char)
        return 3  # default to middle if extraction fails

Inference: what actually runs when you generate a token

Understanding inference matters because it determines cost, latency, and where your optimization budget goes.

python

import time
import torch

def profile_inference(model, tokenizer, prompt: str, max_new_tokens: int = 100):
    """Profile prefill vs decode phases."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # measure prefill (first token)
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_new_tokens=1,
            do_sample=False
        )
    prefill_time = time.perf_counter() - start
    ttft = prefill_time  # time to first token

    # measure decode (remaining tokens)
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(
            outputs,
            max_new_tokens=max_new_tokens - 1,
            do_sample=False
        )
    decode_time = time.perf_counter() - start
    tokens_per_second = (max_new_tokens - 1) / decode_time

    return {
        "prompt_tokens":  input_ids.shape[1],
        "ttft_seconds":   ttft,
        "tps":            tokens_per_second,
        "total_seconds":  prefill_time + decode_time,
    }

Quantization: trading precision for speed and memory

python

# comparing quantization formats in terms of memory

def model_memory_gb(params_billion: float, bits: int) -> float:
    """Estimate model memory in GB."""
    bytes_per_param = bits / 8
    return params_billion * 1e9 * bytes_per_param / 1e9

model_sizes = [7, 13, 34, 70]
formats     = [("BF16", 16), ("INT8", 8), ("INT4", 4), ("INT3", 3)]

print(f"{'Size':<6}", end="")
for name, _ in formats:
    print(f"  {name:>8}", end="")
print()

for size in model_sizes:
    print(f"{size}B    ", end="")
    for name, bits in formats:
        mem = model_memory_gb(size, bits)
        print(f"  {mem:>6.1f}GB", end="")
    print()

plaintext

Size     BF16      INT8      INT4      INT3
7B      14.0GB    7.0GB    3.5GB    2.6GB
13B     26.0GB   13.0GB    6.5GB    4.9GB
34B     68.0GB   34.0GB   17.0GB   12.8GB
70B    140.0GB   70.0GB   35.0GB   26.3GB

The main quantization methods in production as of 2025:

GPTQ quantizes weights to 4-bit or 8-bit offline, before deployment. The quantized model loads and runs like a normal model. Compatible with most inference frameworks.

python

# practical quantization decision guide

def choose_quantization(
    gpu_vram_gb: float,
    model_size_b: float,
    quality_priority: str = "high"
) -> str:
    """
    Recommend quantization format based on hardware and requirements.
    quality_priority: "high", "balanced", or "speed"
    """
    bf16_memory  = model_size_b * 2    # GB
    int8_memory  = model_size_b * 1    # GB
    int4_memory  = model_size_b * 0.5  # GB

    available = gpu_vram_gb * 0.85   # leave 15% headroom

    if bf16_memory <= available and quality_priority == "high":
        return "BF16 (full precision, best quality)"

    if int8_memory <= available and quality_priority in ("high", "balanced"):
        return "INT8 / GPTQ-8bit (small quality loss, recommended)"

    if int4_memory <= available:
        if quality_priority == "speed":
            return "INT4 / GPTQ-4bit (faster, measurable quality loss)"
        return "AWQ-4bit (better quality than GPTQ at same bit width)"

    return "Model too large for this GPU - consider model sharding or smaller model"

# examples
print(choose_quantization(gpu_vram_gb=24,  model_size_b=7))   # fits in BF16
print(choose_quantization(gpu_vram_gb=24,  model_size_b=34))  # needs INT8
print(choose_quantization(gpu_vram_gb=24,  model_size_b=70))  # needs INT4
print(choose_quantization(gpu_vram_gb=8,   model_size_b=13))  # needs INT4

What breaks in production

Benchmarks run your model once, sequentially, on carefully formatted inputs. Production is not that.

API versus self-hosted: the real decision

Most teams start with an API model and some eventually move to self-hosting. The decision is not primarily about cost. It is about requirements.

Connecting the entire series

This article showed how to know whether it works, how to make it fast enough, and what to watch for when it goes to production.

Where to go from here

This series is complete. Ten concepts, each depending on the one before, covering the full stack from mathematics to deployed product.

Frequently Asked Questions

What are the most important LLM benchmarks in 2026?

What is quantization and how does it affect model quality?

What is the KV cache and why does it matter for LLM inference?

What is the difference between time to first token and tokens per second?

When should I use an API versus self-hosting a model?

What breaks in LLM production that benchmarks do not catch?

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

GitHub LinkedIn X

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

Apr 21, 2026 · 13 min read

Why 1M Tokens Is a Trap: The Hidden Cost of Long Context Windows

Apr 22, 2026 · 12 min read

TurboQuant Explained: Google's Breakthrough in AI Model Compression

Mar 31, 2026 · 8 min read

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

The benchmark problem

Evaluating in code

Inference: what actually runs when you generate a token

Quantization: trading precision for speed and memory

What breaks in production

API versus self-hosted: the real decision

Connecting the entire series

Frequently Asked Questions

Krunal Kanojiya

Related Posts

Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works

The benchmark problem

Evaluating in code

Inference: what actually runs when you generate a token

Quantization: trading precision for speed and memory

What breaks in production

API versus self-hosted: the real decision

Connecting the entire series

Frequently Asked Questions

Krunal Kanojiya

Related Posts