Exact vs Approximate Nearest Neighbor (ANN) Explained
A research-backed, technically precise explanation of exact nearest neighbor search versus approximate nearest neighbor (ANN) algorithms. Learn why exact search fails at scale, how ANN achieves massive speedups with bounded accuracy loss, how recall is measured, and how to tune ANN parameters for your production workload.
A database holds one million document chunks. A query arrives. The system needs to find the ten most semantically relevant chunks from those one million in under 100 milliseconds.
The mathematically exact approach computes the distance between the query vector and all one million stored vectors, then sorts them and returns the top ten. On a modern CPU, a single dot product on a 1536-dimensional float32 vector takes about four microseconds. One million dot products take four seconds. That is 40 times over the latency budget.
This is not a hardware problem. Faster hardware shrinks the absolute numbers but does not change the O(n) scaling. At ten million vectors, exact search takes 40 seconds. At one billion, it takes over an hour.
Approximate nearest neighbor search solves this by accepting a contractually bounded accuracy loss in exchange for a 100 to 1000 times speedup. Understanding how that tradeoff works, how to measure it, and how to tune it is the foundation of every production vector search system.
This article is part of the How Vector Databases Work Internally series. The distance metrics used in each comparison step are covered in cosine similarity vs Euclidean distance. The two most important ANN algorithms, HNSW and IVF, are covered in their own dedicated articles: HNSW algorithm explained and IVF index explained. The similarity search article covers how ANN fits into the full eight-step search pipeline.
Exact Nearest Neighbor Search: What It Does and Why It Cannot Scale
Exact nearest neighbor search (also called k-NN or brute-force search) answers the query "find the K vectors in this collection closest to the query vector" by doing exactly what the question implies: compute the distance from the query to every stored vector, sort by that distance, and return the top K.
import numpy as np
import time
def exact_nearest_neighbor(
corpus: np.ndarray,
query: np.ndarray,
k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
"""
Exact k-nearest neighbor search.
Time complexity: O(n * d) where n = corpus size, d = dimensions.
Returns: (distances, indices) both sorted ascending by distance.
"""
# Compute distance from query to every stored vector
# For normalized vectors: cosine distance = 1 - dot_product
similarities = corpus @ query # shape: (n,)
distances = 1.0 - similarities # convert to distance
# Sort all n distances and return top k
top_k_indices = np.argsort(distances)[:k]
top_k_distances = distances[top_k_indices]
return top_k_distances, top_k_indices
# Benchmark exact search at different corpus sizes
d = 1536 # OpenAI text-embedding-3-small dimension
k = 10
query = np.random.randn(d).astype(np.float32)
query /= np.linalg.norm(query)
for n in [1_000, 10_000, 100_000, 1_000_000]:
corpus = np.random.randn(n, d).astype(np.float32)
corpus /= np.linalg.norm(corpus, axis=1, keepdims=True)
start = time.perf_counter()
dists, idxs = exact_nearest_neighbor(corpus, query, k)
elapsed_ms = (time.perf_counter() - start) * 1000
print(f"n={n:>10,}: {elapsed_ms:>8.1f}ms")
# Approximate output (modern CPU, NumPy with BLAS):
# n= 1,000: 0.8ms
# n= 10,000: 7.4ms
# n= 100,000: 74.2ms
# n= 1,000,000: 748.3msk-NN is an exact search method that compares a query vector to every vector in the dataset to find the closest matches. This guarantees accurate results but becomes computationally expensive as the dataset grows.
The scaling is linear with no possibility of early termination: even if the first vector you compare is the true nearest neighbor, you still must compare against all remaining vectors because you have no way to know without checking that nothing else is closer.
At one million vectors and 1536 dimensions, a single query takes approximately 750ms on a single CPU core using optimized BLAS. At ten million vectors, it becomes 7.5 seconds. Neither is acceptable for a production search system serving concurrent users.
The Approximate Nearest Neighbor Insight
ANN algorithms are based on a simple observation: for most real-world queries against real-world data, you do not need to compare the query against everything. Most stored vectors are clearly far from the query. If you could efficiently identify and skip those, you could achieve the same result set in a fraction of the time.
ANN algorithms can terminate early once a satisfactory match is identified, leading to faster searches and lower computational requirements. This trade-off between speed and accuracy makes ANN more practical for real-time applications.
The key insight is that the true nearest neighbors of any query vector are not randomly distributed across the collection. They cluster together in the vector space. If you can navigate efficiently to the neighborhood of the query, you find most of the true nearest neighbors by exploring only a small fraction of the collection.
This is not a guess or a heuristic. It is a structural property of the data: semantically similar items cluster together in embedding space. The HNSW graph exploits this by connecting each vector to its nearest neighbors, forming navigable paths through the space. A query enters the graph at a distant starting point and follows edges toward the query vector, arriving at the local neighborhood without examining 99.9 percent of the collection.
import faiss
import numpy as np
import time
d = 1536
n = 1_000_000
k = 10
# Build data
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
query = np.random.randn(1, d).astype(np.float32)
faiss.normalize_L2(query)
# Exact search (flat index, no ANN)
flat_index = faiss.IndexFlatIP(d)
flat_index.add(corpus)
t0 = time.perf_counter()
_, exact_ids = flat_index.search(query, k)
exact_ms = (time.perf_counter() - t0) * 1000
print(f"Exact search: {exact_ms:.1f}ms")
# ANN search (HNSW index)
hnsw_index = faiss.IndexHNSWFlat(d, 16) # M=16 connections per node
hnsw_index.hnsw.efConstruction = 64
hnsw_index.hnsw.efSearch = 64
hnsw_index.add(corpus)
t0 = time.perf_counter()
_, ann_ids = hnsw_index.search(query, k)
ann_ms = (time.perf_counter() - t0) * 1000
print(f"ANN search (HNSW): {ann_ms:.1f}ms")
# Recall computation
overlap = len(set(exact_ids[0]) & set(ann_ids[0]))
recall = overlap / k
print(f"Recall@{k}: {recall:.2f}")
# Approximate output:
# Exact search: 748.3ms
# ANN search (HNSW): 4.7ms — 159x faster
# Recall@10: 0.97 — 97% of exact results foundThe ANN search finds 97 percent of the exact results in 0.6 percent of the time. The 3 percent miss rate is the accuracy cost of approximation.
How Recall Is Defined and Measured
Recall is the primary metric for evaluating ANN search quality. It answers: out of all the true nearest neighbors that exact search would have returned, what fraction did ANN actually return?
Recall@k typically measures the proportion of the true k nearest neighbors (as determined by an exact search) that are present within the top k results returned by the ANN algorithm. For example, a recall of 0.95 means the ANN search found 95% of the actual nearest neighbors.
import numpy as np
import faiss
def compute_recall_at_k(
exact_ids: np.ndarray,
ann_ids: np.ndarray,
k: int,
) -> float:
"""
Compute Recall@K for a single query.
exact_ids: ground truth top-k indices from exact search
ann_ids: approximate top-k indices from ANN search
"""
true_set = set(exact_ids[:k].tolist())
found_set = set(ann_ids[:k].tolist())
return len(true_set & found_set) / k
def measure_ann_recall(
corpus: np.ndarray,
queries: np.ndarray,
k: int = 10,
hnsw_ef: int = 64,
) -> dict:
"""
Measure mean Recall@K and mean latency for an HNSW index.
Compares ANN results against flat (exact) search ground truth.
"""
d = corpus.shape[1]
n_queries = queries.shape[0]
# Build exact (ground truth) index
flat_index = faiss.IndexFlatIP(d)
flat_index.add(corpus)
_, exact_ids = flat_index.search(queries, k)
# Build HNSW ANN index
hnsw_index = faiss.IndexHNSWFlat(d, 16)
hnsw_index.hnsw.efConstruction = 64
hnsw_index.hnsw.efSearch = hnsw_ef
hnsw_index.add(corpus)
recalls = []
latencies = []
for i in range(n_queries):
q = queries[i:i+1]
t0 = time.perf_counter()
_, ann_ids = hnsw_index.search(q, k)
latency_ms = (time.perf_counter() - t0) * 1000
recall = compute_recall_at_k(exact_ids[i], ann_ids[0], k)
recalls.append(recall)
latencies.append(latency_ms)
return {
"mean_recall": float(np.mean(recalls)),
"p95_recall": float(np.percentile(recalls, 5)), # worst 5%
"mean_latency_ms": float(np.mean(latencies)),
"p95_latency_ms": float(np.percentile(latencies, 95)),
}
# Run the measurement across different ef_search values
d = 128
n_corpus = 100_000
n_queries = 500
k = 10
corpus = np.random.randn(n_corpus, d).astype(np.float32)
queries = np.random.randn(n_queries, d).astype(np.float32)
faiss.normalize_L2(corpus)
faiss.normalize_L2(queries)
print(f"{'ef_search':>10} | {'Recall@10':>10} | {'P95 Recall':>10} | {'Mean Latency':>13} | {'P95 Latency':>12}")
print("-" * 70)
for ef in [16, 32, 64, 128, 256, 512]:
results = measure_ann_recall(corpus, queries, k=k, hnsw_ef=ef)
print(
f"{ef:>10} | "
f"{results['mean_recall']:>10.4f} | "
f"{results['p95_recall']:>10.4f} | "
f"{results['mean_latency_ms']:>11.2f}ms | "
f"{results['p95_latency_ms']:>10.2f}ms"
)
# Approximate output (100K vectors, 128 dimensions):
# ef_search | Recall@10 | P95 Recall | Mean Latency | P95 Latency
# ------------------------------------------------------------------
# 16 | 0.8723 | 0.7000 | 0.15ms | 0.22ms
# 32 | 0.9401 | 0.8000 | 0.24ms | 0.35ms
# 64 | 0.9731 | 0.9000 | 0.42ms | 0.61ms
# 128 | 0.9891 | 0.9600 | 0.78ms | 1.11ms
# 256 | 0.9961 | 0.9800 | 1.45ms | 2.03ms
# 512 | 0.9988 | 0.9900 | 2.81ms | 3.92msThe recall-latency curve shows the classic diminishing returns pattern. Doubling ef_search from 64 to 128 buys about 1.6 percentage points of recall (97.3 to 98.9 percent) at the cost of 0.36ms additional latency. Doubling again from 128 to 256 buys only 0.7 percentage points (98.9 to 99.6 percent) at the cost of 0.67ms. The curve flattens as you approach the exact search recall ceiling of 100 percent.
Surprisingly, we see clients that are not aware of the magnitude of recall loss their current ANN retrieval system incurs, nor have they systematically measured where on the latency vs. recall curve they are operating.
The Recall-Latency-Throughput Triangle
Three metrics govern ANN system performance. They are related, and improving one always costs one or both of the others.
Recall is the accuracy metric: what fraction of true nearest neighbors does ANN return? Target: 95 to 99 percent for most production workloads.
Latency is the response time for a single query: how long does one search take? Target: depends on application. Interactive search typically requires under 100ms at p95. Batch RAG pipelines can tolerate higher latency.
Throughput is the number of queries the system can handle per second (QPS): how many concurrent searches can it serve? Target: depends on traffic volume and SLA.
Recall vs Latency vs Throughput:
Higher recall → needs higher ef_search
→ each query explores more nodes
→ higher latency per query
→ lower throughput (fewer queries per second)
Higher throughput → smaller ef_search
→ faster per-query
→ more queries per second
→ lower recall
More hardware → scales throughput without affecting recall or latency
→ but expensiveThe ANN-Benchmarks project, developed by Martin Aumueller, Erik Bernhardsson, and Alexander Faithfull and hosted at ann-benchmarks.com, provides standardized measurements of this tradeoff across all major ANN algorithms on multiple real-world datasets. Their plots of recall versus queries per second on the same hardware are the definitive reference for algorithm comparison.
# The tradeoff is fundamental — it cannot be avoided, only managed
# Real production numbers from ann-benchmarks.com (SIFT-1M dataset, 128 dims):
ann_benchmark_data = {
# (ef_search or nprobe, recall@10, QPS on single CPU core)
"HNSW M=16": [
(16, 0.783, 8100),
(32, 0.901, 4800),
(64, 0.963, 2700),
(128, 0.985, 1600),
(256, 0.994, 950),
(512, 0.998, 540),
],
"IVF-Flat nlist=1024": [
(8, 0.721, 6900),
(16, 0.852, 3700),
(32, 0.928, 2100),
(64, 0.970, 1200),
(128, 0.988, 660),
(256, 0.996, 360),
],
}
print("Algorithm | ef/nprobe | Recall@10 | QPS")
print("-" * 52)
for algo, points in ann_benchmark_data.items():
for ef, recall, qps in points:
print(f"{algo:<20} | {ef:>9} | {recall:.3f} | {qps:>5}")
print()The data shows that HNSW and IVF produce comparable recall-QPS tradeoff curves on this dataset. The choice between them for your specific workload requires benchmarking on your actual data because the relative performance varies with dimensionality, data distribution, and the fraction of vectors that pass metadata filters.
Why Exact Search Is Still Used in Production
Exact search has three legitimate production use cases despite its O(n) cost.
Ground truth measurement. Before shipping an ANN index to production, you must know its actual recall on your real data and query distribution. The measurement process requires exact search as the reference. You cannot know what ANN missed without an exhaustive comparison.
Small collections. Under approximately 50,000 to 100,000 vectors, the performance difference between exact and ANN search is negligible at the serving infrastructure level. A flat index on 50,000 vectors answers a query in under 5ms. The operational simplicity of no index tuning is worth more than the marginal speedup of HNSW at that scale.
Active segment search. As described in the how vector databases work internally article, every vector database maintains an active (growing) segment that has not yet been sealed and indexed. This segment uses brute-force (exact) search because it is small enough that this is fast, and because building an HNSW index incrementally on a rapidly changing dataset produces low-quality graphs.
import faiss
import numpy as np
# Active segment: flat (exact) search on small growing collection
class ActiveSegment:
def __init__(self, d: int):
self.index = faiss.IndexFlatIP(d)
self.texts = []
def add(self, vector: np.ndarray, text: str):
v = vector.reshape(1, -1).astype(np.float32)
faiss.normalize_L2(v)
self.index.add(v)
self.texts.append(text)
def search(self, query: np.ndarray, k: int) -> list[tuple[str, float]]:
q = query.reshape(1, -1).astype(np.float32)
faiss.normalize_L2(q)
scores, ids = self.index.search(q, min(k, self.index.ntotal))
return [(self.texts[i], float(s)) for i, s in zip(ids[0], scores[0])]
@property
def size(self) -> int:
return self.index.ntotal
# Production pattern: use flat search until the segment is large enough
# to justify building an HNSW index
SEAL_THRESHOLD = 100_000 # seal and build HNSW when 100K vectors accumulated
segment = ActiveSegment(d=384)
# ... add vectors as they arrive ...
# When segment.size >= SEAL_THRESHOLD:
# seal the segment, build HNSW, open new active segmentExact nearest neighbor might be slow, but it is the best option when accuracy is your priority or you are using small datasets.
ANN Algorithm Families
ANN algorithms fall into four structural families. Understanding the family helps you reason about the tradeoffs before looking at specific implementations.
Tree-Based Methods
Tree-based methods recursively partition the vector space using hyperplanes. The classic example is the k-d tree, which alternates splitting dimensions at each level of the tree. At query time, the tree is traversed from root to leaf, pruning branches whose bounding regions are farther than the current best candidate.
Tree-based methods work well in two to three dimensions. According to GeeksforGeeks' ANN guide, k-d trees perform poorly with high-dimensional data. The curse of dimensionality means that the bounding regions of tree nodes overlap heavily in high dimensions, forcing the algorithm to visit nearly every node in the tree. At 30 or more dimensions, the k-d tree degenerates to brute force. At 1536 dimensions, it is completely impractical.
# K-d tree performance degradation with dimension (illustrative)
# Using sklearn's KDTree
from sklearn.neighbors import KDTree
import numpy as np
import time
for d in [3, 10, 30, 100]:
corpus = np.random.randn(10_000, d).astype(np.float32)
query = np.random.randn(1, d).astype(np.float32)
tree = KDTree(corpus)
t0 = time.perf_counter()
for _ in range(100):
tree.query(query, k=10)
elapsed_ms = (time.perf_counter() - t0) * 1000 / 100
print(f"d={d:>4}: {elapsed_ms:.2f}ms per query")
# d= 3: 0.04ms
# d= 10: 0.09ms
# d= 30: 0.81ms — already degrading
# d= 100: 8.34ms — approaching brute force speedHash-Based Methods (LSH)
Locality Sensitive Hashing (LSH) uses randomly chosen hash functions designed so that nearby vectors hash to the same bucket with high probability. A query hashes to a bucket and only compares against vectors in that bucket.
LSH is theoretically elegant but practically difficult to tune. The recall is controlled by the number of hash tables and the hash function parameters, and the relationship between those and the resulting recall on specific data distributions is not intuitive. It has largely been displaced by graph-based methods in production systems.
Graph-Based Methods (HNSW)
Graph-based methods, specifically HNSW, are the dominant approach in production vector databases. Each vector is a node. Nodes are connected to their nearest neighbors. At query time, the graph is traversed from a starting point to the local neighborhood of the query using greedy best-first search.
HNSW is covered in complete detail, with diagrams, in the HNSW algorithm article. The key properties: incremental insertion (no batch rebuild required), high recall at competitive latency, and O(log n) query time with no training phase.
Cluster-Based Methods (IVF)
Cluster-based methods, specifically IVF (Inverted File Index), partition vectors into clusters using k-means. At query time, only the clusters nearest to the query are searched. IVF requires a training phase before vectors can be inserted and cannot be built incrementally, but has lower memory overhead than HNSW because it does not store graph edges.
IVF is covered in complete detail in the IVF index article. The cross-link between the HNSW and IVF articles (C2.4 to C2.5 in this series) reflects their complementary position as the two dominant production algorithms.
Recall Varies Across Query Types
One property of ANN recall that practitioners frequently miss: recall is not uniform across all queries. The mean recall across all queries hides per-query variance that can be significant.
import numpy as np
import faiss
def per_query_recall_distribution(
corpus: np.ndarray,
queries: np.ndarray,
k: int = 10,
ef_search: int = 64,
) -> list[float]:
"""
Return per-query recall to reveal distribution, not just mean.
"""
d = corpus.shape[1]
flat = faiss.IndexFlatIP(d)
flat.add(corpus)
_, exact_ids = flat.search(queries, k)
hnsw = faiss.IndexHNSWFlat(d, 16)
hnsw.hnsw.efSearch = ef_search
hnsw.add(corpus)
_, ann_ids = hnsw.search(queries, k)
per_query = []
for i in range(len(queries)):
recall = len(set(exact_ids[i]) & set(ann_ids[i])) / k
per_query.append(recall)
return per_query
d = 128
corpus = np.random.randn(100_000, d).astype(np.float32)
queries = np.random.randn(1_000, d).astype(np.float32)
faiss.normalize_L2(corpus)
faiss.normalize_L2(queries)
recalls = per_query_recall_distribution(corpus, queries, k=10, ef_search=64)
recalls_arr = np.array(recalls)
print(f"Mean recall: {recalls_arr.mean():.4f}")
print(f"Median recall: {np.median(recalls_arr):.4f}")
print(f"P5 recall: {np.percentile(recalls_arr, 5):.4f}") # worst 5% of queries
print(f"P1 recall: {np.percentile(recalls_arr, 1):.4f}") # worst 1% of queries
print(f"Min recall: {recalls_arr.min():.4f}")
# Example output:
# Mean recall: 0.9731
# Median recall: 1.0000
# P5 recall: 0.8000
# P1 recall: 0.7000
# Min recall: 0.5000The P5 recall (worst 5 percent of queries) is 0.80 even when the mean is 0.97. That means 50 out of 1000 queries receive only 8 or fewer of their 10 true nearest neighbors. For a RAG system, those are the queries most likely to miss the correct answer document.
The recall of approximate HNSW search is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Insertion order can shift recall by up to 12 percentage points.
This variance is data-dependent. Vectors in sparse regions of the embedding space (unusual queries, rare topics) tend to have lower recall because their nearest neighbors are farther away and the HNSW graph has fewer short-range connections in that region. Identifying which query categories in your workload hit low recall and increasing ef_search specifically for those is a high-value production optimization.
Tuning ANN Parameters for a Target Recall
The standard approach to ANN tuning in production is to establish a target recall (typically 95 percent mean recall with 90 percent P5 recall) and then find the lowest ef_search that achieves it on your representative query set.
import numpy as np
import faiss
from typing import Callable
def find_ef_for_target_recall(
corpus: np.ndarray,
representative_queries: np.ndarray,
ground_truth_ids: np.ndarray,
target_mean_recall: float = 0.95,
target_p5_recall: float = 0.90,
k: int = 10,
) -> int:
"""
Binary search for the minimum ef_search that achieves the target recall.
corpus: your full vector collection
representative_queries: 500 to 1000 queries covering your real workload
ground_truth_ids: exact search results for those queries
target_mean_recall: minimum acceptable mean recall across queries
target_p5_recall: minimum acceptable P5 recall (worst 5% of queries)
"""
d = corpus.shape[1]
hnsw = faiss.IndexHNSWFlat(d, 16)
hnsw.hnsw.efConstruction = 64
hnsw.add(corpus)
lo, hi = 16, 512
while lo < hi:
mid = (lo + hi) // 2
hnsw.hnsw.efSearch = mid
_, ann_ids = hnsw.search(representative_queries, k)
recalls = [
len(set(ground_truth_ids[i]) & set(ann_ids[i])) / k
for i in range(len(representative_queries))
]
recalls_arr = np.array(recalls)
mean_r = float(recalls_arr.mean())
p5_r = float(np.percentile(recalls_arr, 5))
if mean_r >= target_mean_recall and p5_r >= target_p5_recall:
hi = mid # can we achieve the target with lower ef?
else:
lo = mid + 1 # need higher ef
return lo
# Usage in production setup:
# 1. Collect 500 representative queries from your query logs
# 2. Run exact search on those queries to get ground truth
# 3. Find the minimum ef_search that hits your recall target
# 4. Set that ef_search in your vector database configuration
# 5. Re-measure after adding significant new data (index quality can drift)
optimal_ef = find_ef_for_target_recall(
corpus=corpus,
representative_queries=queries,
ground_truth_ids=exact_ids,
target_mean_recall=0.97,
target_p5_recall=0.92,
k=10,
)
print(f"Minimum ef_search for target recall: {optimal_ef}")This binary search approach finds the minimum ef_search that meets your recall requirements. Running it with 500 to 1000 representative queries gives a reliable estimate. Running it with only 50 queries risks overfitting to a specific sample.
The tuning should be repeated whenever the collection changes significantly (more than 10 to 20 percent growth, or a change in the distribution of content being indexed). Index quality is not static: inserting vectors into an HNSW graph incrementally can degrade recall if the insertion order differs significantly from a batch build.
When ANN Recall Is Not the Right Metric to Optimize
Recall@K against exact search is the right metric when exact search would give the correct answer. But "correct" in a semantic search context depends on whether the embedding model has placed the relevant documents near the query in the first place.
If a document answers the user's question but its embedding is far from the query embedding (because the model uses different vocabulary, the chunk is too large and its embedding is diffuse, or the model is poorly suited to this domain), then even perfect recall@K against exact search will miss it. The document is correctly identified as far from the query by both exact and ANN search because the embedding model placed it there.
In this case, the correct fix is not to increase ef_search. The correct fix is to improve the embedding model, improve chunking strategy, or add hybrid search with BM25 to cover vocabulary mismatches. The how vector databases work internally article covers the full ingestion pipeline where these upstream decisions are made.
The appeal of ANN is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter.
Exact vs ANN: Side-by-Side Reference
Property | Exact (Flat) Search | ANN Search (HNSW / IVF)
----------------------+---------------------------+---------------------------
Result guarantee | Mathematically exact | Approximate (bounded error)
Recall@K | 1.000 (100%) | 0.90 to 0.99 (tunable)
Time complexity | O(n * d) | O(log n) or O(sqrt(n))
Latency at 1M vectors | ~750ms (1536-dim, CPU) | 5 to 20ms (HNSW, ef=64)
Latency at 10M | ~7,500ms | 5 to 25ms (scales well)
Memory overhead | None beyond raw vectors | 1.5x to 2x (HNSW graph)
Training required | No | Yes (IVF), No (HNSW)
Incremental inserts | Yes (trivial) | Yes (HNSW), Rebuild (IVF)
Use case | Small collections, bench | Production at any scale
| marking, active segment | above ~100K vectorsConnecting the Series
This article establishes why ANN is necessary (exact search cannot scale) and how accuracy is measured (Recall@K against ground truth). The next two articles go deep on the two dominant ANN algorithms.
The HNSW algorithm article covers the graph construction and query traversal algorithm in full detail with diagrams. It connects to the IVF index article for the direct algorithmic comparison. Those two articles (C2.4 and C2.5 in this series) share a cross-link that compares HNSW and IVF on the same dimensions: build time, memory, recall at matched latency, and ideal scale.
The vector indexing article covers the broader discipline of how index types are chosen and evaluated. The vector query lifecycle article shows exactly where ANN search sits within the full request processing pipeline.
Summary
Exact nearest neighbor search is O(n) and scales linearly with collection size. At one million 1536-dimensional vectors, a single query takes approximately 750ms on a modern CPU, which is far outside production latency budgets. ANN search uses index structures to reduce that to 5 to 20ms with 95 to 99 percent recall.
Recall@K, measured against exact search ground truth on representative queries, is the primary accuracy metric for ANN systems. It is not uniform across queries: the mean hides per-query variance where some queries receive significantly lower recall than average. Production tuning should target both mean recall and P5 or P10 recall to ensure acceptable performance on worst-case queries.
The ef_search parameter in HNSW and the nprobe parameter in IVF are the primary levers for controlling the recall-latency tradeoff. Binary search over those parameters against a representative query sample is the correct method for finding the minimum value that meets your accuracy requirements.
Exact search remains useful for small collections, for active segment search within production databases, and as the ground truth reference when calibrating ANN recall. For every other production vector search workload, ANN is the correct architectural choice.
Sources and Further Reading
- Milvus AI Reference. What Is the Difference Between k-NN and ANN in Vector Search? milvus.io/ai-quick-reference/knn-vs-ann
- Milvus AI Reference. What Is Recall in Vector Search? milvus.io/ai-quick-reference/recall-in-vector-search
- Elastic. Understanding the Approximate Nearest Neighbor Algorithm. elastic.co/blog/understanding-ann
- GeeksforGeeks. Approximate Nearest Neighbor (ANN) Search. geeksforgeeks.org/machine-learning/approximate-nearest-neighbor-ann-search
- CelerData. Approximate Nearest Neighbor (ANN). celerdata.com/glossary/approximate-nearest-neighbor
- Shaped AI. A Comprehensive Guide to Approximate Nearest Neighbor Algorithms. shaped.ai/blog/approximate-nearest-neighbors-algorithms
- Shadecoder. Approximate Nearest Neighbor: A Comprehensive Guide for 2025. shadecoder.com/topics/approximate-nearest-neighbor
- APXML. Core Concepts of ANN Search. apxml.com/courses/vector-databases-semantic-search/chapter-3-approximate-nearest-neighbor-search
- OpenSource Connections. Vector Search: Navigating Recall and Performance. opensourceconnections.com/blog/2025/02/27/vector-search-navigating-recall-and-performance
- Weaviate. Evaluation Metrics for Search and Recommendation Systems. weaviate.io/blog/retrieval-evaluation-metrics
- ANN-Benchmarks. Benchmarking Results for ANN Algorithms. ann-benchmarks.com
- Aumueller, Bernhardsson, Faithfull. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. arxiv.org/abs/1807.05614
- Dobson et al. The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in HNSW. arxiv.org/abs/2405.17813
- FAISS. Getting Started and Index Types. faiss.ai
- Pinecone. What Is Approximate Nearest Neighbor Search? pinecone.io/learn/approximate-nearest-neighbor
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.