Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works
Building a model is one thing. Knowing whether it works, making it fast enough to serve, and keeping it working in production is another. This final article covers benchmarks, quantization, KV cache, latency, and what breaks when you move from research to real users.
Nine articles ago we started with vectors and dot products. This is where it all lands in production.
You have built models from scratch, understood how pre-training works, seen how alignment shapes behavior, and learned how to prompt them effectively. What happens when you have to actually ship something? Benchmarks you pick will determine whether you trust your model. Inference optimization determines whether users wait 200 milliseconds or 8 seconds. Deployment decisions determine whether the thing stays working when real traffic arrives.
This is the final article in the series on AI and ML fundamentals. It connects directly to Articles 8 and 9. The fine-tuned and aligned model from Article 8 needs to be evaluated. The prompting and RAG pipelines from Article 9 need to run fast enough to be useful in production.
The benchmark problem
Benchmarks are proxies. The goal is not to score well on benchmarks. The goal is for the model to be useful to your users for your specific task. Those two things can diverge significantly.
MMLU was the dominant evaluation benchmark from 2021 to 2024. It tests knowledge across 57 subjects from elementary math to professional law and medicine. By mid-2024, top frontier models scored 90% or above, making it nearly useless for differentiating between them. Researchers responded with MMLU-Pro, which uses 10 answer choices instead of 4 and harder questions.
HumanEval tests code generation across 164 Python problems. As of 2025, top models score near or above 90%. It was also increasingly suspected that some models had seen these exact problems in training data, inflating scores. SWE-bench Verified, which tests real-world GitHub issue resolution in actual codebases, is now more meaningful for code evaluation because it is much harder to contaminate.
# a practical benchmark selection framework
def select_benchmarks(use_case: str) -> list:
"""
Map your production use case to meaningful benchmarks.
Do not default to MMLU just because it is famous.
"""
benchmark_map = {
"code_generation": [
"SWE-bench Verified", # real GitHub issues, hard to contaminate
"HumanEval+", # extended test suite, catches edge cases
"LiveCodeBench", # fresh problems, lower contamination risk
],
"reasoning_assistant": [
"GPQA-Diamond", # PhD-level reasoning, not yet saturated
"MMLU-Pro", # harder version of MMLU
"MATH-500", # math reasoning benchmark
],
"instruction_following": [
"IFEval", # tests whether model follows specific instructions
"MT-Bench", # multi-turn dialogue quality
],
"knowledge_qa": [
"MMLU-Pro",
"SimpleQA", # factual accuracy, measures hallucination
"TruthfulQA", # tests whether model avoids confident wrong answers
],
"general_purpose": [
"GPQA-Diamond",
"SWE-bench Verified",
"IFEval",
"LiveCodeBench",
]
}
return benchmark_map.get(use_case, benchmark_map["general_purpose"])
# choosing the wrong benchmark gives you a false sense of model quality
# MMLU scores above 90% tell you almost nothing about production performanceThe current benchmark hierarchy in 2026 for frontier models: GPQA-Diamond still differentiates (Claude Opus 4.6 at 91.3%, GPT-5.3 Codex at 81%), SWE-bench Verified still differentiates (Claude Opus 4.6 at 80.8%), and Humanity's Last Exam (HLE) is the hardest current benchmark, designed to be unsolvable by 2024-era models. For non-frontier models, MMLU and HumanEval still provide useful baseline checks.
The most important benchmark you will run is your own. Build an evaluation set from real production examples. Sample 200 to 500 real user queries from your logs, have someone label the ideal responses, and measure your model against those. Public benchmarks tell you about general capability. Your eval set tells you about your actual problem.
Evaluating in code
Evaluation is code, not a spreadsheet. Automated evaluation at scale requires clear metrics, reproducible prompts, and a way to track regressions.
import json
from typing import Callable
class LLMEvaluator:
"""Simple framework for automated LLM evaluation."""
def __init__(self, model_fn: Callable, judge_fn: Callable = None):
self.model = model_fn
self.judge = judge_fn # optional: LLM-as-judge for open-ended tasks
def evaluate_exact_match(self, dataset: list[dict]) -> dict:
"""For tasks with ground-truth labels (classification, extraction)."""
correct = 0
total = len(dataset)
errors = []
for example in dataset:
response = self.model(example["prompt"])
predicted = self._extract_answer(response)
expected = example["expected"]
if predicted == expected:
correct += 1
else:
errors.append({
"prompt": example["prompt"],
"expected": expected,
"predicted": predicted,
"response": response,
})
return {
"accuracy": correct / total,
"correct": correct,
"total": total,
"errors": errors[:10], # first 10 failures for debugging
}
def evaluate_with_judge(self, dataset: list[dict], criteria: str) -> dict:
"""For open-ended tasks: use an LLM to judge quality."""
scores = []
for example in dataset:
response = self.model(example["prompt"])
judge_prompt = f"""Rate the following response on a scale of 1-5.
Criteria: {criteria}
Prompt: {example["prompt"]}
Response: {response}
Score (1-5) and brief reason:"""
judgment = self.judge(judge_prompt)
score = self._extract_score(judgment)
scores.append(score)
return {
"mean_score": sum(scores) / len(scores),
"score_dist": {i: scores.count(i) for i in range(1, 6)},
}
def _extract_answer(self, response: str) -> str:
return response.strip().split("\n")[0]
def _extract_score(self, judgment: str) -> int:
for char in judgment:
if char.isdigit() and int(char) in range(1, 6):
return int(char)
return 3 # default to middle if extraction failsLLM-as-judge evaluation has become standard for open-ended tasks where ground-truth labels do not exist. You send the model's response to a stronger model (often GPT-4o or Claude) and ask it to rate quality. This scales better than human evaluation and correlates reasonably well with human judgment, though it inherits the biases of the judge model.
Inference: what actually runs when you generate a token
Understanding inference matters because it determines cost, latency, and where your optimization budget goes.
LLM inference has two distinct phases. The prefill phase processes all input tokens simultaneously using the transformer's parallel attention. This is fast per token but increases linearly with context length. The decode phase generates one token at a time autoregressively, reading from the KV cache at each step. This is slower per step but does not grow with context for each individual step.
import time
import torch
def profile_inference(model, tokenizer, prompt: str, max_new_tokens: int = 100):
"""Profile prefill vs decode phases."""
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# measure prefill (first token)
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
input_ids,
max_new_tokens=1,
do_sample=False
)
prefill_time = time.perf_counter() - start
ttft = prefill_time # time to first token
# measure decode (remaining tokens)
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
outputs,
max_new_tokens=max_new_tokens - 1,
do_sample=False
)
decode_time = time.perf_counter() - start
tokens_per_second = (max_new_tokens - 1) / decode_time
return {
"prompt_tokens": input_ids.shape[1],
"ttft_seconds": ttft,
"tps": tokens_per_second,
"total_seconds": prefill_time + decode_time,
}The KV cache stores the key and value tensors computed for each input token so they do not need to be recomputed when generating the next token. For a 70B model with 64 attention heads and 128-dimensional head projections serving a batch of requests at 8K context, the KV cache alone can consume 8 GB or more of GPU memory. This is why DeepSeek's Multi-Head Latent Attention (MLA, covered in Article 6) reduces the KV cache by 93%, and why Grouped Query Attention (GQA) is standard in all modern large models.
Quantization: trading precision for speed and memory
A 7B model in 16-bit (bfloat16) precision requires roughly 14 GB of GPU memory. The same model in 4-bit quantization requires roughly 4 GB, which fits on a consumer GPU. The tradeoff is accuracy degradation.
# comparing quantization formats in terms of memory
def model_memory_gb(params_billion: float, bits: int) -> float:
"""Estimate model memory in GB."""
bytes_per_param = bits / 8
return params_billion * 1e9 * bytes_per_param / 1e9
model_sizes = [7, 13, 34, 70]
formats = [("BF16", 16), ("INT8", 8), ("INT4", 4), ("INT3", 3)]
print(f"{'Size':<6}", end="")
for name, _ in formats:
print(f" {name:>8}", end="")
print()
for size in model_sizes:
print(f"{size}B ", end="")
for name, bits in formats:
mem = model_memory_gb(size, bits)
print(f" {mem:>6.1f}GB", end="")
print()Size BF16 INT8 INT4 INT3
7B 14.0GB 7.0GB 3.5GB 2.6GB
13B 26.0GB 13.0GB 6.5GB 4.9GB
34B 68.0GB 34.0GB 17.0GB 12.8GB
70B 140.0GB 70.0GB 35.0GB 26.3GBThe main quantization methods in production as of 2025:
GPTQ quantizes weights to 4-bit or 8-bit offline, before deployment. The quantized model loads and runs like a normal model. Compatible with most inference frameworks.
AWQ (Activation-Aware Weight Quantization) accounts for the distribution of activations when quantizing, reducing quality loss compared to naive weight quantization. It consistently outperforms GPTQ at the same bit width on reasoning benchmarks.
GGUF is the format used by llama.cpp, the main CPU inference framework. It supports a range of quantization levels from Q2 to Q8. Q4_K_M and Q5_K_M are common choices that balance quality and speed for local deployment.
Recent research on Qwen2.5 and DeepSeek models shows that accuracy differences between BF16 and 4-bit quantization are consistently under 1% on MMLU, MATH, and GPQA for models above 7 billion parameters. Smaller models degrade more noticeably under aggressive quantization.
# practical quantization decision guide
def choose_quantization(
gpu_vram_gb: float,
model_size_b: float,
quality_priority: str = "high"
) -> str:
"""
Recommend quantization format based on hardware and requirements.
quality_priority: "high", "balanced", or "speed"
"""
bf16_memory = model_size_b * 2 # GB
int8_memory = model_size_b * 1 # GB
int4_memory = model_size_b * 0.5 # GB
available = gpu_vram_gb * 0.85 # leave 15% headroom
if bf16_memory <= available and quality_priority == "high":
return "BF16 (full precision, best quality)"
if int8_memory <= available and quality_priority in ("high", "balanced"):
return "INT8 / GPTQ-8bit (small quality loss, recommended)"
if int4_memory <= available:
if quality_priority == "speed":
return "INT4 / GPTQ-4bit (faster, measurable quality loss)"
return "AWQ-4bit (better quality than GPTQ at same bit width)"
return "Model too large for this GPU — consider model sharding or smaller model"
# examples
print(choose_quantization(gpu_vram_gb=24, model_size_b=7)) # fits in BF16
print(choose_quantization(gpu_vram_gb=24, model_size_b=34)) # needs INT8
print(choose_quantization(gpu_vram_gb=24, model_size_b=70)) # needs INT4
print(choose_quantization(gpu_vram_gb=8, model_size_b=13)) # needs INT4What breaks in production
Benchmarks run your model once, sequentially, on carefully formatted inputs. Production is not that.
Latency under load: A model that generates 80 tokens per second serving one request may produce 12 tokens per second when serving 20 concurrent requests, because GPU memory bandwidth is shared. Benchmark your latency at your expected concurrency level, not at idle.
Context length edge cases: Your RAG pipeline retrieves 3,000 tokens of context per request. Your average user sends 200-token messages. But some users send 2,000-token messages with 3,000 tokens of retrieved context, and the combined input hits the context limit. The model either truncates silently or fails. Define your context budget before deployment.
Distribution shift: Benchmark questions were written by researchers. Your users ask different things. A customer support model that scores 95% on a clean evaluation set might fail on 30% of real support tickets that include screenshots described in words, order numbers embedded in natural language, or multilingual queries the benchmark never covered.
Cost at scale: A model that costs $0.01 per 1,000 tokens looks cheap until you are serving 10 million requests per day with an average of 2,000 tokens per request. That is $200,000 per day. Token efficiency matters enormously at scale. Prompt compression, shorter system prompts, and output length control are worth investing in.
Model updates breaking your application: When an API provider updates their model, benchmark scores may improve but your specific use case may regress. Applications that worked reliably with a specific model version sometimes break with the next. Pin your model version for production. Test the new version in staging before promoting.
API versus self-hosted: the real decision
Most teams start with an API model and some eventually move to self-hosting. The decision is not primarily about cost — it is about requirements.
Use an API model when you need frontier capability and are not training your own, when you cannot manage GPU infrastructure, or when your volume is low enough that the per-token cost is acceptable. OpenAI, Anthropic, and Google all offer reliable APIs with SLAs.
Self-host when you have strict data privacy requirements and cannot send user data to a third party, when you need a fine-tuned model that no API offers, when your volume is high enough that per-token API costs exceed self-hosting infrastructure costs, or when you need guaranteed uptime without dependency on an external provider.
For most developer teams building their first LLM product, start with an API. When you have real production traffic and real cost data, you will know exactly where the pain points are. That is when the self-hosting decision becomes concrete rather than speculative.
Connecting the entire series
This is where it comes together. Article 1 gave you vectors and matrices. Article 2 gave you probability and cross-entropy loss. Article 3 showed how neural networks learn using gradients and backpropagation. Article 4 turned tokens into meaningful vectors. Article 5 explained why RNNs broke and why something needed to replace them. Article 6 built the transformer and the attention mechanism that replaced them. Article 7 showed how pre-training on trillions of tokens gives the model its capabilities. Article 8 covered alignment and why the model follows your instructions. Article 9 showed how to use all of that in real applications.
This article showed how to know whether it works, how to make it fast enough, and what to watch for when it goes to production.
The series was designed so that each article made the next one necessary. Embeddings without attention would be a dead end. Attention without pre-training would be an empty architecture. Pre-training without alignment would be a text autocomplete engine. Alignment without prompting knowledge would produce poor-quality integrations. Prompting without evaluation would leave you guessing whether anything is actually working.
None of these ten concepts stands alone. They form one continuous stack, from the math in Article 1 to the production monitoring in this one. If something breaks in your LLM application, the answer usually lives somewhere in that stack.
The best next step is to build something. Pick the smallest version of a real application: a document QA system using RAG, a code review tool using an aligned model, or a fine-tuned classifier. Work through the full stack from embedding documents to evaluating outputs to measuring latency. The concepts in this series will mean something different once you have hit the actual failure modes yourself.
This series is complete. Ten concepts, each depending on the one before, covering the full stack from mathematics to deployed product.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.