What is an IVF index in vector search?

An IVF (Inverted File Index) is an approximate nearest neighbor index that partitions the vector space into nlist clusters using k-means. Each cluster has a centroid vector. Every indexed vector is assigned to its nearest centroid. At query time, the index compares the query to all centroids, selects the nprobe nearest centroids, and searches only the vectors in those clusters. This eliminates the vast majority of comparisons while maintaining high recall for typical query distributions.

How do I choose nlist for an IVF index?

The standard starting formula is nlist equals the square root of the total number of vectors. For one million vectors, nlist of 1024 is a reasonable starting point. A smaller nlist produces larger clusters that require more exhaustive search within each. A larger nlist produces smaller clusters and faster per-cluster search but requires more nprobe to achieve the same recall because the true nearest neighbors are spread across more clusters. Always measure recall on your actual data rather than relying solely on the formula.

What is nprobe and how does it affect recall?

nprobe is the number of clusters searched during each query. A query vector is first compared to all nlist centroids. The nprobe closest centroids are selected and their inverted lists are searched exhaustively. Higher nprobe covers more of the vector space and finds more of the true nearest neighbors (higher recall) at the cost of examining more vectors per query (higher latency). nprobe can be changed at query time without rebuilding the index, making it the primary knob for tuning the recall-latency tradeoff in production.

Why does IVF require a training phase?

IVF requires k-means clustering to learn the nlist cluster centroids before any vectors can be indexed. K-means is an iterative algorithm that repeatedly assigns vectors to their nearest centroid and recomputes each centroid as the mean of its assigned vectors. This training phase produces the partition structure that determines which vectors will be in the same cluster and therefore searched together. The training phase requires a representative sample of the vectors you plan to index and typically runs on a subset of 10 to 100 times nlist vectors.

What is the cluster boundary problem in IVF search?

When a query vector falls near the boundary between two clusters, its true nearest neighbors may be split across both clusters. If nprobe is set to 1 (only one cluster searched), the neighbors in the adjacent cluster are missed entirely, causing a recall failure. This is called the cluster boundary problem. Increasing nprobe to include the neighboring cluster recovers those neighbors. A query near the boundary of k clusters may need nprobe of at least k to recover all true nearest neighbors. This is why nprobe must be tuned against recall measurements on your actual query distribution rather than set arbitrarily.

IVF Index Explained for Vector Search

Q: What is the difference between IVFFlat and IVF-PQ?

IVFFlat stores full float32 vectors within each cluster's inverted list. It is fast and exact within each cluster but uses the same memory as raw vector storage plus the centroid overhead. IVF-PQ (Product Quantization) compresses the vectors in each inverted list using Product Quantization before storing them, reducing memory by 32 to 64 times at the cost of approximation in the distance computation within clusters. IVF-PQ is the standard approach for billion-scale vector search where full float32 storage is not feasible.

A reference librarian does not read every book in the library to find you the one you need. They know the library is organized by subject, and they narrow the search to the shelves most likely to contain what you are looking for before reading any specific entry.

IVF (Inverted File Index) applies this exact intuition to vector search. Before any fine-grained similarity comparison happens, a coarse stage identifies which region of the vector space is most relevant to the query. Only vectors in that region are compared. The rest of the collection is skipped entirely.

The result is a dramatic reduction in the number of distance computations per query: from O(n) in brute-force search to approximately O(nlist + n/nlist * nprobe) in IVF. At one million vectors with nlist=1024 and nprobe=10, that is roughly 11,000 comparisons instead of one million.

This article covers the full IVF algorithm from k-means training through query traversal and parameter tuning, including the IVF-PQ variant that makes billion-scale deployment feasible. It is part of the How Vector Databases Work Internally series and connects directly to the HNSW algorithm article, which covers the graph-based alternative that dominates at moderate scale. The exact vs ANN article establishes the context of why approximate search is necessary, and the Product Quantization article covers the compression technique that IVF-PQ adds on top of the base IVF structure.

The Name and the Analogy

The term "inverted file" comes from information retrieval. A forward index maps each document to its list of words. An inverted index maps each word to the list of documents containing it. The inverted structure allows you to go from "I have a word" to "these are the documents that contain it" without scanning every document.

IVF in vector search uses the same inversion logic. Instead of words and documents, it uses cluster centroids and vectors. A forward assignment maps each vector to its centroid. The inverted list maps each centroid to all vectors assigned to it. At query time, "I have a query vector, which centroids is it near?" immediately narrows the search to the inverted lists of those centroids.

plaintext

Forward (per-vector view):
  vector_1 → assigned to centroid_7
  vector_2 → assigned to centroid_2
  vector_3 → assigned to centroid_7
  vector_4 → assigned to centroid_15
  ...

Inverted (per-centroid view):
  centroid_2:  [vector_2, vector_8, vector_14, ...]
  centroid_7:  [vector_1, vector_3, vector_9, ...]
  centroid_15: [vector_4, vector_11, ...]
  ...

Query: "Which centroids are nearest to q?"
  → centroid_7 and centroid_2 are the two nearest
  → search only inverted lists for centroid_7 and centroid_2
  → compare q against [vector_1, vector_3, vector_9, ...] and [vector_2, vector_8, ...]
  → skip all vectors in all other clusters

According to Zilliz's IVF index guide, clustering centroids act as proxies for narrowing the search space. When a query vector is received, the system first compares it to all centroids to identify the nearest clusters. Only the vectors in these selected clusters are then scanned for similarity against the query.

Phase 1: Training the Index

Before any vector can be indexed, IVF must learn the cluster structure of the data. This training phase runs k-means clustering on a representative sample of the vectors you plan to index.

K-Means Clustering

K-means starts with nlist randomly chosen centroids. It then iterates:

Assign each training vector to its nearest centroid.
Recompute each centroid as the mean of all vectors assigned to it.
Repeat until centroid positions stop changing (convergence).

python

import numpy as np
import faiss

def train_ivf_index(
    training_vectors: np.ndarray,
    d: int,
    nlist: int,
    n_iterations: int = 20,
) -> faiss.IndexIVFFlat:
    """
    Train an IVF-Flat index by running k-means on training_vectors.
    The training phase learns nlist cluster centroids.
    After training, new vectors can be added to the index.
    """
    # The quantizer is a flat index used to find the nearest centroid
    # for each vector during both training and query
    quantizer = faiss.IndexFlatIP(d)   # Inner product (cosine after L2 norm)

    index = faiss.IndexIVFFlat(
        quantizer,
        d,
        nlist,
        faiss.METRIC_INNER_PRODUCT,
    )

    # Training: run k-means on the training data
    # training_vectors should be at least 39 * nlist for stable clustering
    print(f"Training on {len(training_vectors):,} vectors with nlist={nlist}...")
    index.train(training_vectors)
    print(f"Training complete. Index is_trained: {index.is_trained}")

    return index


# How many training vectors are needed?
# Rule of thumb: at least 39 * nlist, ideally more
# FAISS will warn if you provide fewer than this minimum
nlist = 1024
min_training = 39 * nlist    # ~40,000 vectors for nlist=1024
ideal_training = 100 * nlist  # ~100,000 vectors for better cluster quality

print(f"nlist={nlist}: minimum {min_training:,}, ideal {ideal_training:,} training vectors")
# nlist=1024: minimum 40,000, ideal 100,000 training vectors

The training phase only runs once per index. Once centroids are learned, they are fixed. All subsequent vector additions use these pre-learned centroids to determine cluster assignment.

The training data must be representative of the full collection you plan to index. If your training data does not cover all semantic regions of the embedding space, some regions will have poorly placed centroids and recall will suffer for queries in those regions. A common practice is to train on a random sample of 5 to 10 percent of the full corpus, which is large enough to represent the distribution without making training impractically slow.

Visualizing the Partitioned Space

After training, the vector space is divided into nlist Voronoi cells. Each cell contains all points closer to its centroid than to any other centroid. The boundaries between cells are equidistant from neighboring centroids.

plaintext

2D illustration of IVF partitioning (nlist=6):

        ┌─────────────────────────────────────────┐
        │   C1          C2          C3            │
        │   ×  . . .    ×  . .     ×  . . .      │
        │  . . . .     . . . . .   . . . .        │
        │ . . . .     . . . . .    . . . .        │
        │  Cell 1      Cell 2        Cell 3       │
        ├──────────────┬─────────────┬────────────┤
        │  Cell 4      │   Cell 5    │   Cell 6   │
        │  . . . .     │  . . . . .  │  . . .     │
        │   . . .      │  . . . .    │   . .      │
        │   C4 ×       │   C5 ×      │   C6 ×     │
        └──────────────┴─────────────┴────────────┘

Query q: falls near the boundary of Cell 2 and Cell 5.
nprobe=1 → searches Cell 2 only → may miss neighbors in Cell 5.
nprobe=2 → searches Cell 2 and Cell 5 → recovers boundary neighbors.

This boundary effect is the fundamental limitation of IVF. According to VeloDB's IVF explanation, a very small nprobe (like 1) makes queries very fast but can miss neighbors that lie outside the single closest cluster. This is the cluster boundary problem: a query near a cluster boundary might actually be closer to points in an adjacent cluster.

Phase 2: Adding Vectors to the Index

After training, vectors are added to the index. Each vector is assigned to the centroid it is closest to and appended to that centroid's inverted list.

python

import numpy as np
import faiss
import time

d       = 1536   # OpenAI text-embedding-3-small dimensions
n       = 1_000_000
nlist   = 1024

# Generate synthetic vectors for demonstration
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)

# Train and then add vectors
quantizer = faiss.IndexFlatIP(d)
index     = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on a representative sample
train_sample = corpus[:100_000]
t_train = time.perf_counter()
index.train(train_sample)
train_s = time.perf_counter() - t_train
print(f"Training time: {train_s:.1f}s on {len(train_sample):,} vectors")

# Add all vectors (assigns each to nearest centroid)
t_add = time.perf_counter()
index.add(corpus)
add_s = time.perf_counter() - t_add
print(f"Indexing time: {add_s:.1f}s for {n:,} vectors")

print(f"Index contains: {index.ntotal:,} vectors in {nlist} clusters")
print(f"Avg vectors per cluster: {n // nlist:,}")

# Training time: 12.3s on 100,000 vectors
# Indexing time:  4.8s for 1,000,000 vectors
# Index contains: 1,000,000 vectors in 1024 clusters
# Avg vectors per cluster: 976

The indexing phase (adding vectors after training) is fast because it only requires one distance computation per vector: compare it against all nlist centroids and assign to the nearest. This is O(n * nlist * d), which for typical values is much faster than building an HNSW graph.

Phase 3: Query Execution

The query pipeline has two stages: coarse search and fine search.

Stage 1: Coarse Search (Centroid Comparison)

The query vector is compared against all nlist centroids. The nprobe closest centroids are selected. This stage is fast because there are only nlist centroids: typically 256 to 4096 comparisons, not millions.

plaintext

Stage 1: Coarse centroid search

Query vector q:
  Compare q to centroid_0:    similarity = 0.31
  Compare q to centroid_1:    similarity = 0.72   ← 2nd nearest
  Compare q to centroid_2:    similarity = 0.89   ← nearest
  Compare q to centroid_3:    similarity = 0.45
  ...
  Compare q to centroid_1023: similarity = 0.18

nprobe = 2 → search inverted lists for centroid_2 and centroid_1
Total centroid comparisons: nlist = 1024 (fast, one pass)

Stage 2: Fine Search (Within Selected Clusters)

The inverted lists of the nprobe selected centroids are searched exhaustively. Every vector in those lists is compared against the query vector. The top-K across all searched lists are returned.

python

import numpy as np
import faiss

def ivf_search_with_measurement(
    index: faiss.IndexIVFFlat,
    queries: np.ndarray,
    k: int = 10,
    nprobe: int = 10,
) -> tuple[np.ndarray, np.ndarray, float]:
    """
    Run IVF search and return distances, indices, and latency.
    nprobe can be changed at query time — no index rebuild needed.
    """
    index.nprobe = nprobe

    import time
    t0 = time.perf_counter()
    distances, indices = index.search(queries, k)
    elapsed_ms = (time.perf_counter() - t0) * 1000 / len(queries)

    return distances, indices, elapsed_ms


def compute_recall(ground_truth: np.ndarray, ann_results: np.ndarray) -> float:
    k = ground_truth.shape[1]
    recalls = [
        len(set(ground_truth[i]) & set(ann_results[i])) / k
        for i in range(len(ground_truth))
    ]
    return float(np.mean(recalls))


# Build ground truth with flat (exact) search
d = 128
n = 500_000
n_queries = 500
k = 10

corpus  = np.random.randn(n, d).astype(np.float32)
queries = np.random.randn(n_queries, d).astype(np.float32)
faiss.normalize_L2(corpus)
faiss.normalize_L2(queries)

flat_index = faiss.IndexFlatIP(d)
flat_index.add(corpus)
gt_distances, gt_indices = flat_index.search(queries, k)

# Build IVF index
nlist = 1024
quantizer = faiss.IndexFlatIP(d)
ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
ivf.train(corpus[:50_000])
ivf.add(corpus)

# Measure recall vs latency at different nprobe values
print(f"{'nprobe':>8} | {'Recall@10':>10} | {'Latency (ms)':>13} | {'Vectors searched':>17}")
print("-" * 60)

for nprobe in [1, 5, 10, 20, 40, 64, 100, 200, nlist]:
    _, ann_ids, latency = ivf_search_with_measurement(ivf, queries, k, nprobe)
    recall = compute_recall(gt_indices, ann_ids)

    # Approximate number of vectors searched per query
    avg_list_size = n // nlist
    approx_searched = min(nprobe * avg_list_size, n)

    print(
        f"{nprobe:>8} | {recall:>10.4f} | {latency:>11.3f}ms | "
        f"{approx_searched:>15,}  ({100*approx_searched/n:.1f}%)"
    )

# Approximate output (500K vectors, 128 dims, nlist=1024):
# nprobe   | Recall@10 |  Latency (ms) | Vectors searched
# ------------------------------------------------------------
#       1  |    0.4832 |        0.082ms |             488  (0.1%)
#       5  |    0.7813 |        0.218ms |           2,441  (0.5%)
#      10  |    0.8924 |        0.401ms |           4,883  (1.0%)
#      20  |    0.9448 |        0.782ms |           9,766  (2.0%)
#      40  |    0.9731 |        1.541ms |          19,531  (3.9%)
#      64  |    0.9856 |        2.438ms |          31,250  (6.3%)
#     100  |    0.9934 |        3.796ms |          48,828  (9.8%)
#     200  |    0.9978 |        7.512ms |          97,656  (19.5%)
#    1024  |    1.0000 |       38.241ms |         500,000  (100%) ← exact search

The output shows the tradeoff in concrete numbers. At nprobe=10, you search only 1 percent of the collection and achieve 89 percent recall. At nprobe=40, you search 4 percent and achieve 97 percent recall. At nprobe=1024 (equal to nlist), you search 100 percent and achieve 100 percent recall — equivalent to exact search, but slower than a flat index because of the centroid lookup overhead.

The Two Key Parameters: nlist and nprobe

nlist: How Many Clusters

nlist is set at index creation time and cannot be changed without rebuilding the index. It controls how many Voronoi cells partition the vector space.

plaintext

Effect of nlist on cluster structure:

Small nlist (e.g., 256 for 1M vectors):
  Each cluster contains ~3,906 vectors on average.
  Centroid comparison: 256 operations (very fast).
  Fine search per cluster: ~3,906 vectors examined.
  Downside: Large clusters mean each nprobe searches many vectors.
             Must use higher nprobe to achieve same recall.

Large nlist (e.g., 4096 for 1M vectors):
  Each cluster contains ~244 vectors on average.
  Centroid comparison: 4,096 operations (slower first stage).
  Fine search per cluster: ~244 vectors examined.
  Downside: Many small clusters → true neighbors spread across more clusters.
             Must use higher nprobe to cover the same proportion of space.

Optimal nlist (heuristic: sqrt(n)):
  For 1M vectors: sqrt(1,000,000) = 1,000 → use nlist=1,024 (nearest power of 2).
  Balances centroid comparison cost against per-cluster search cost.

According to multiple sources including RAPIDS cuVS documentation and NVIDIA's technical blog, empirically the square root of the number of indexed vectors is a good starting point for nlist. The pgvector documentation provides a complementary heuristic: for datasets under one million rows, use nlist = rows / 1000; for larger datasets, use nlist = sqrt(rows).

python

def recommended_nlist(n_vectors: int) -> int:
    """Starting nlist recommendation based on dataset size."""
    import math
    if n_vectors < 1_000_000:
        return max(1, n_vectors // 1000)
    else:
        raw = int(math.sqrt(n_vectors))
        # Round to nearest power of 2 for memory alignment
        return 2 ** round(math.log2(raw))

for n in [10_000, 100_000, 1_000_000, 10_000_000, 100_000_000]:
    nlist = recommended_nlist(n)
    avg_per_cluster = n // nlist
    print(f"n={n:>12,}  →  nlist={nlist:>6,}  (avg {avg_per_cluster:>6,} vectors/cluster)")

# n=       10,000  →  nlist=    10  (avg   1000 vectors/cluster)
# n=      100,000  →  nlist=   100  (avg   1000 vectors/cluster)
# n=    1,000,000  →  nlist=  1024  (avg    976 vectors/cluster)
# n=   10,000,000  →  nlist=  4096  (avg   2441 vectors/cluster)
# n=  100,000,000  →  nlist= 16384  (avg   6104 vectors/cluster)

Because nlist is fixed at creation time, changing it requires a full index rebuild. Test several values (ideally in powers of two such as 512, 1024, 2048) on your actual data before committing to a production configuration.

nprobe: How Many Clusters to Search

nprobe is a query-time parameter. It can be changed per query without touching the index. This makes it the primary lever for controlling the recall-latency tradeoff in production.

python

# nprobe can be set globally or per-query in FAISS
index.nprobe = 10   # applies to all subsequent searches

# In Qdrant, nprobe is set via search params
from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

# Standard search uses default nprobe
results = client.search(
    collection_name="my_collection",
    query_vector=query_vec,
    limit=10,
)

# High-recall search with explicit nprobe tuning
results_high_recall = client.search(
    collection_name="my_collection",
    query_vector=query_vec,
    limit=10,
    search_params=models.SearchParams(
        quantization=models.QuantizationSearchParams(
            rescore=True,
            oversampling=2.0,
        ),
        exact=False,
    ),
)

A practical rule for nprobe starting points based on nlist:

plaintext

nlist setting   | Low recall target (90%) | High recall target (99%)
----------------+-------------------------+--------------------------
256             | nprobe = 5 to 10        | nprobe = 20 to 40
1024            | nprobe = 10 to 20       | nprobe = 40 to 100
4096            | nprobe = 20 to 50       | nprobe = 100 to 200

Always measure recall on your actual data rather than relying on these tables. The boundary between clusters depends on data distribution, which varies significantly across different embedding models and document types.

The Cluster Boundary Problem in Depth

The cluster boundary problem is the most important failure mode to understand when deploying an IVF index.

plaintext

Cluster boundary scenario:

Cluster A centroid: [0.8, 0.2, ...]
Cluster B centroid: [0.7, 0.3, ...]

Query q: [0.75, 0.25, ...]  ← exactly between both centroids

True nearest neighbor T: [0.72, 0.28, ...] ← in Cluster B's inverted list

Distance from q to A centroid: 0.071
Distance from q to B centroid: 0.054  ← B is slightly closer

With nprobe=1: search Cluster B only → finds T → correct
With nprobe=1: search Cluster A only → misses T → wrong

But what if the clustering placed q slightly inside Cluster A?
q at [0.76, 0.24, ...]:
  Distance to A centroid: 0.057 ← A is now slightly closer
  nprobe=1 → searches A → misses T in B

This is a miss probability that cannot be eliminated with nprobe=1.
It only goes away when nprobe is large enough to cover both A and B.

python

def demonstrate_boundary_problem(n_boundary_queries: int = 1000):
    """
    Measure recall specifically for queries near cluster boundaries.
    These queries require higher nprobe to achieve the same recall as
    queries near cluster centroids.
    """
    import numpy as np
    import faiss

    d = 64
    n = 100_000
    nlist = 256

    corpus = np.random.randn(n, d).astype(np.float32)
    faiss.normalize_L2(corpus)

    quantizer = faiss.IndexFlatIP(d)
    ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
    ivf.train(corpus)
    ivf.add(corpus)

    flat = faiss.IndexFlatIP(d)
    flat.add(corpus)

    # Generate queries near cluster boundaries by averaging two centroid locations
    centroids = faiss.vector_float_to_array(ivf.quantizer.xb).reshape(nlist, d)

    boundary_queries = []
    for _ in range(n_boundary_queries):
        i, j = np.random.choice(nlist, 2, replace=False)
        midpoint = (centroids[i] + centroids[j]) / 2
        midpoint /= np.linalg.norm(midpoint)
        boundary_queries.append(midpoint)

    boundary_queries = np.array(boundary_queries, dtype=np.float32)

    _, gt = flat.search(boundary_queries, 10)

    print("Recall for queries near cluster boundaries:")
    print(f"{'nprobe':>8} | {'Recall@10':>10}")
    print("-" * 24)
    for nprobe in [1, 2, 5, 10, 20, 50]:
        ivf.nprobe = nprobe
        _, ann = ivf.search(boundary_queries, 10)
        recall = np.mean([
            len(set(gt[i]) & set(ann[i])) / 10
            for i in range(len(boundary_queries))
        ])
        print(f"{nprobe:>8} | {recall:>10.4f}")

# Boundary queries show lower recall than average at same nprobe,
# confirming that nprobe must be tuned against worst-case queries,
# not just mean recall across all queries.

The diagnostic implication is important: if your production search has systematically lower recall for a specific category of queries, those queries may be landing near cluster boundaries. The fix is to increase nprobe (accepting higher latency) or to re-run training with more nlist clusters (so boundaries are more precisely defined). According to the Milvus IVF vs HNSW guide, a smaller nprobe leads to faster queries but may reduce recall, while a larger nprobe covers more clusters and leads to higher recall but higher latency.

IVF Variants: From IVFFlat to IVF-PQ

The base IVF algorithm stores full float32 vectors in each inverted list. This is called IVFFlat. Several variants reduce memory usage or improve performance at scale.

IVFFlat

Stores full float32 vectors. Simple, fast, and exact within each searched cluster. Memory usage is approximately the same as raw vector storage plus nlist centroid vectors.

python

# IVFFlat: full float32 vectors, exact search within clusters
quantizer = faiss.IndexFlatIP(d)
index_flat = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
index_flat.train(training_data)
index_flat.add(corpus)

# Memory: n * d * 4 bytes (raw vectors) + nlist * d * 4 bytes (centroids)
# For 1M vectors at 1536 dims: ~6 GB + ~6 MB centroid overhead

IVF-PQ: Product Quantization for Compressed Storage

IVF-PQ compresses the vectors in each inverted list using Product Quantization (PQ). PQ splits each vector into M subvectors, clusters each subspace into 256 centroids, and stores only the centroid index (one byte) per subvector. A 1536-dimensional float32 vector (6,144 bytes) compressed with M=96 becomes 96 bytes, a 64x reduction.

python

import faiss
import numpy as np

d = 1536
n = 10_000_000   # 10 million vectors — IVF-PQ territory
nlist = 4096
M_pq = 96        # number of PQ subvectors (d must be divisible by M_pq)
nbits = 8        # 8 bits per subvector = 256 centroids per subspace

quantizer   = faiss.IndexFlatIP(d)
index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, M_pq, nbits)

# Train requires representative sample
training_data = np.random.randn(min(n, 200_000), d).astype(np.float32)
faiss.normalize_L2(training_data)
index_ivfpq.train(training_data)

# Add all vectors
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
index_ivfpq.add(corpus)

# Memory comparison
raw_bytes    = n * d * 4
ivfflat_bytes = raw_bytes   # IVFFlat stores full vectors
ivfpq_bytes   = n * M_pq    # IVF-PQ stores M_pq bytes per vector
centroid_bytes = nlist * d * 4   # centroid vectors are stored uncompressed

print(f"IVFFlat memory:  {ivfflat_bytes / 1e9:.1f} GB")
print(f"IVF-PQ memory:   {(ivfpq_bytes + centroid_bytes) / 1e9:.2f} GB")
print(f"Compression:     {ivfflat_bytes / (ivfpq_bytes + centroid_bytes):.0f}x")

# For 10M vectors at 1536 dims:
# IVFFlat memory:  61.4 GB
# IVF-PQ memory:    0.97 GB
# Compression:       63x

The compression is dramatic. Sixty-one gigabytes of IVFFlat storage becomes under one gigabyte with IVF-PQ. For deployments where RAM is constrained or the corpus size exceeds available memory, IVF-PQ is often the only feasible option.

The cost is accuracy loss during the fine search stage. IVF-PQ computes approximate distances using compressed codes rather than exact dot products. This introduces an additional source of error on top of the cluster selection approximation. In practice, well-tuned IVF-PQ achieves 80 to 95 percent recall at the same nprobe as IVFFlat, depending on the compression ratio and the data distribution.

According to VeloDB's IVF guide, if the lists store compressed codes (IVFPQ or similar), the distance is approximated via those codes, possibly using precomputed lookup tables for efficiency.

The Product Quantization article covers the PQ compression algorithm in full detail, including subspace decomposition, codebook training, and asymmetric distance computation.

Comparing IVF and HNSW

The HNSW article covers the graph-based algorithm that is the primary alternative to IVF. The two algorithms make different tradeoffs across every relevant dimension.

plaintext

Dimension              | IVF                         | HNSW
-----------------------+-----------------------------+----------------------------
Algorithm type         | Cluster-based partitioning  | Graph traversal
Training required      | Yes (k-means phase)         | No
Incremental insert     | Yes (assign to nearest      | Yes (add node + connect)
                       | centroid, no rebuild)        |
Insert quality         | Stable (fixed centroids)    | Degrades if insertion order
                       |                             | is poorly distributed
Build time             | Fast after training         | Moderate
Memory overhead        | Centroid storage only       | Graph edges: M * 8 bytes
                       | (~1% of raw vectors)        | per node (1.5 to 2x total)
Recall at same latency | Lower (cluster boundaries)  | Higher (denser graph)
Filtered search        | Stable (search by cluster)  | Degrades with very selective
                       |                             | filters
GPU acceleration       | Strong (IVF-PQ with GPU)    | Limited
Billion-scale          | Yes (IVF-PQ)                | Difficult (memory)
Scale ceiling          | Effectively unlimited (PQ)  | ~200M to 500M in RAM
Parameter tuning       | nlist (build), nprobe (query)| M (build), ef (build+query)
Default in             | FAISS, Milvus (PQ variant)  | Qdrant, Weaviate, pgvector

The guidance from Milvus's IVF vs HNSW comparison: IVF can first perform a coarse-grained filter at the centroid level to quickly narrow down the candidate set, then conduct fine-grained distance calculations within the selected clusters. This maintains stable and predictable performance even when a large portion of data is filtered out.

HNSW's filtered search degrades when a metadata filter is highly selective because the graph traversal must skip many nodes, breaking the navigable structure. IVF is less affected because each cluster is an independent inverted list: a filter can be applied independently within each searched cluster without affecting the centroid comparison stage.

Choose IVF when:

Your dataset is very large (hundreds of millions or billions of vectors) and you need the compression of IVF-PQ to fit in memory. You have a static or slowly changing corpus where the training phase cost is amortized over long index lifetimes. Your queries involve highly selective metadata filters. You have GPU infrastructure and can use GPU-accelerated IVF-PQ for sub-millisecond search on billion-scale datasets.

Choose HNSW when:

Your corpus is actively growing and you need efficient incremental insertion. Your dataset size is in the range of millions to low hundreds of millions of vectors that fit in RAM. You need maximum recall at the lowest possible latency. Your embedding model is well-suited to graph traversal (dense, uniformly distributed embeddings).

Using IVFFlat with pgvector

For teams already running PostgreSQL, pgvector supports IVFFlat through a standard SQL index syntax. This brings approximate vector search into the same database that stores your structured application data.

sql

-- Create a table with a vector column
CREATE TABLE documents (
    id         SERIAL PRIMARY KEY,
    content    TEXT NOT NULL,
    category   VARCHAR(50),
    embedding  vector(1536)   -- pgvector type
);

-- Create IVFFlat index
-- lists = nlist, analogous to FAISS nlist parameter
CREATE INDEX documents_embedding_ivfflat
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1024);

-- Set nprobe at query time via session variable
SET ivfflat.probes = 20;

-- Search with metadata filter + IVF approximate search
SELECT
    id,
    content,
    embedding <=> '[0.41, -0.22, ...]'::vector AS distance
FROM documents
WHERE category = 'support'
ORDER BY embedding <=> '[0.41, -0.22, ...]'::vector
LIMIT 10;

The <=> operator is pgvector's cosine distance operator. With the IVFFlat index active and ivfflat.probes set, the query uses the IVF two-stage search rather than exhaustive comparison. The WHERE category = 'support' filter is applied during the fine search stage within each cluster.

According to Maurício Maia's pgvector IVFFlat guide, the number of lists should be set as low as the application allows for recall, or alternatively, increase probes as high as the application allows. For datasets up to one million rows, nlist = rows / 1000. For datasets above one million rows, nlist = sqrt(rows).

Training Data Requirements and Common Mistakes

The quality of an IVF index depends entirely on the quality of k-means clustering, which depends on the quality and representativeness of training data.

Training data must be representative. If your collection has ten semantic categories (customer support, engineering, legal, finance, etc.) but your training data only covers three, the centroids will be poorly placed for the other seven. Queries in underrepresented regions will have low recall because nearby vectors are assigned to distant centroids.

python

def check_cluster_balance(index: faiss.IndexIVFFlat, n_vectors: int) -> dict:
    """
    Analyze cluster size distribution to detect imbalanced training.
    Well-trained IVF indexes should have roughly similar cluster sizes.
    Large variance indicates poor training data or non-uniform data distribution.
    """
    # Get the size of each inverted list
    list_sizes = [index.invlists.list_size(i) for i in range(index.nlist)]

    sizes = np.array(list_sizes)
    expected = n_vectors / index.nlist

    return {
        "expected_per_cluster": expected,
        "mean_size":   float(sizes.mean()),
        "min_size":    int(sizes.min()),
        "max_size":    int(sizes.max()),
        "std_dev":     float(sizes.std()),
        "cv":          float(sizes.std() / sizes.mean()),  # coefficient of variation
        "empty_clusters": int((sizes == 0).sum()),
        "oversized":   int((sizes > 3 * expected).sum()),
    }

stats = check_cluster_balance(ivf, n_vectors=len(corpus))
print(f"Cluster balance stats:")
print(f"  Expected per cluster: {stats['expected_per_cluster']:.0f}")
print(f"  Mean:                 {stats['mean_size']:.0f}")
print(f"  Min / Max:            {stats['min_size']} / {stats['max_size']}")
print(f"  Coefficient of variation: {stats['cv']:.3f}")
print(f"  Empty clusters:       {stats['empty_clusters']}")
print(f"  Oversized clusters:   {stats['oversized']}")

# Healthy index: CV < 0.5, zero empty clusters, no extreme outliers.
# Unhealthy index: CV > 1.0, empty clusters, or clusters 10x the expected size.
# Fix: retrain with more representative and larger training data.

FAISS minimum training requirement. FAISS requires at least 39 * nlist training vectors. In practice, 100 * nlist or more produces better cluster quality. For nlist=1024, that is 40,000 minimum and 100,000 recommended.

Retraining after major data changes. If more than 20 to 30 percent of your corpus changes, the centroid positions learned on the old distribution become suboptimal for the new distribution. The practical fix is to schedule periodic retraining during low-traffic windows, rebuild the index on the new corpus, and swap atomically.

Summary

IVF partitions the vector space into nlist clusters using k-means. Each cluster has a centroid. Each vector is assigned to its nearest centroid and stored in that centroid's inverted list. At query time, the query is compared to all nlist centroids (coarse search) and the nprobe nearest clusters are searched exhaustively (fine search).

The two key parameters are nlist (set at build time, controls cluster granularity, recommended starting point is sqrt(n)) and nprobe (set at query time, controls how many clusters are searched, directly controls the recall-latency tradeoff).

The cluster boundary problem is the fundamental recall limitation: queries near cluster boundaries need higher nprobe to cover neighboring clusters. Production nprobe should be calibrated against recall measurements on a representative query sample, paying attention to the distribution of per-query recall rather than just the mean.

IVFFlat stores full float32 vectors. IVF-PQ adds Product Quantization compression for 32 to 64 times memory reduction, enabling billion-scale deployments. IVF as a family handles selective metadata filters more gracefully than HNSW and is better suited to GPU-accelerated workloads. HNSW achieves higher recall at the same latency for moderate-scale workloads that fit in RAM.

The complete algorithm comparison with HNSW is in the HNSW article. The PQ compression mechanism is in the Product Quantization article. The full database architecture that surrounds both indexes is the How Vector Databases Work Internally pillar.

Sources and Further Reading

Milvus Blog. How to Choose Between IVF and HNSW for ANN Vector Search. milvus.io/blog/understanding-ivf-vector-index
VeloDB. Inverted File Index (IVF) in Vector Similarity Search. velodb.io/glossary/inverted-file-index-in-vector-similarity-search
Zilliz AI FAQ. How Do IVF Indexes Work in Vector Databases? zilliz.com/ai-faq/how-do-inverted-file-ivf-indexes-work
Pinecone. Nearest Neighbor Indexes for Similarity Search. pinecone.io/learn/series/faiss/vector-indexes
Pinecone. Product Quantization: Compressing High-Dimensional Vectors. pinecone.io/learn/series/faiss/product-quantization
Apache Doris. IVF Index Documentation. doris.apache.org/docs/dev/ai/vector-search/ivf
APXML. Inverted File Index (IVF) Variations. apxml.com/courses/advanced-vector-search-llms/chapter-1-ann-algorithms/ivf-variations
NVIDIA Technical Blog. Accelerated Vector Search: Approximating with NVIDIA cuVS IVF-Flat. developer.nvidia.com/blog/accelerated-vector-search-approximating-with-nvidia-cuvs-ivf-flat
RAPIDS cuVS. IVF-Flat Index Documentation. docs.rapids.ai/api/cuvs/stable/indexes/ivfflat
Maurício Maia. Optimizing PostgreSQL pgvector with IVFFlat Indexing. medium.com/@mauricio/optimizing-ivfflat-indexing-with-pgvector-in-postgresql
TigerData. What Are IVFFlat Indexes in pgvector and How Do They Work? tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector
Ahmad Jawabreh. Inverted File Indexing (IVF) in FAISS: A Comprehensive Guide. medium.com/@Jawabreh0/inverted-file-indexing-ivf-in-faiss
Siddharth Jain. Deep Dive into FAISS IndexIVFPQ for Vector Search. blog.siddjain.com/2023/12/30/deep-dive-into-faiss-indexivfpq-for-vector-search
arXiv. Incremental IVF Index Maintenance for Streaming Vector Search. arxiv.org/abs/2411.00970

IVF Index Explained for Vector Search

The Name and the Analogy

Phase 1: Training the Index

K-Means Clustering

Visualizing the Partitioned Space

Phase 2: Adding Vectors to the Index

Phase 3: Query Execution

Stage 1: Coarse Search (Centroid Comparison)

Stage 2: Fine Search (Within Selected Clusters)

The Two Key Parameters: nlist and nprobe

nlist: How Many Clusters

nprobe: How Many Clusters to Search

The Cluster Boundary Problem in Depth

IVF Variants: From IVFFlat to IVF-PQ

IVFFlat

IVF-PQ: Product Quantization for Compressed Storage

Comparing IVF and HNSW

Using IVFFlat with pgvector

Training Data Requirements and Common Mistakes

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts