How does a vector database store data internally?

A vector database stores each record as three components: a unique ID, a fixed-length float array (the vector), and a flexible metadata payload. The float array is stored in a contiguous binary format optimized for the memory access patterns of ANN algorithms. The metadata is stored separately in a key-value or column store. A write-ahead log (WAL) records every mutation before it is applied, enabling crash recovery and replication. The ANN index (HNSW or IVF) is built over the float arrays and stored alongside them.

What is the difference between a vector index and a vector database?

A vector index like FAISS is a library that implements ANN search algorithms. It handles indexing and querying float arrays but nothing else. A vector database is a complete data management system built around a vector index. It adds persistent storage, CRUD operations, metadata filtering, replication, crash recovery via WAL, access control, horizontal scaling, and a client API. FAISS is a tool you embed in code. A vector database is infrastructure you operate.

How does similarity search work inside a vector database?

When a query arrives, the database embeds the query text into a float vector using the same model that indexed the documents. That vector is passed to the ANN index, which traverses the index structure to find the stored vectors closest to the query vector according to a distance metric (cosine similarity, Euclidean distance, or dot product). The top-K vector IDs are returned. The database fetches the full payloads for those IDs from the storage layer and returns them to the caller.

What is pre-filtering vs post-filtering in vector databases?

Post-filtering runs ANN search to retrieve the top-K candidates and then applies metadata filters to discard those that do not match. This is fast but can produce fewer than K results if many candidates fail the filter. Pre-filtering applies metadata filters first to identify eligible vector IDs and then restricts the ANN search to only those IDs. Pre-filtering gives correct result counts but degrades ANN index efficiency when the filter is highly selective. Qdrant, Weaviate, and Milvus implement hybrid approaches that choose the strategy dynamically based on selectivity.

How does horizontal scaling work in vector databases?

Purpose-built vector databases scale horizontally by sharding the vector collection across multiple nodes. Each shard holds a subset of the vectors and its own ANN index. A query coordinator broadcasts the query to all shards in parallel, collects the top-K results from each, and merges the ranked lists using a global sort before returning the final top-K to the caller. Milvus uses a disaggregated architecture separating storage, indexing, and query nodes. Pinecone's serverless model handles sharding transparently.

What is a segment in vector database architecture?

A segment is an immutable, self-contained unit of storage that holds a subset of vectors, their metadata, and an ANN index built over those vectors. New writes go into a growable in-memory segment. Once a segment reaches a size threshold it is sealed, flushed to disk, and an HNSW index is built over it. The database maintains multiple segments and searches all of them at query time, merging results. Periodic compaction merges small segments into larger ones to reduce query overhead and reclaim space from deleted vectors.

How Vector Databases Work Internally: Indexing, Search and Storage Explained

Q: How do vector databases handle updates and deletes?

Most vector databases handle deletes by soft-deleting: the vector is marked as deleted in a separate bitmap or filter but not immediately removed from the index. The index is periodically compacted to purge deleted vectors. Updates are implemented as a delete followed by an insert. This pattern avoids expensive in-place index modifications that would degrade HNSW graph quality. Qdrant uses a segment-based architecture where updates go into a new segment and compaction merges segments asynchronously.

Most engineers who use a vector database think of it as a black box: send in a vector, get back the nearest neighbors. That model works until something goes wrong. A search that should return ten results returns three. Ingestion slows down at two million vectors. A metadata filter changes the answer set unexpectedly. Latency spikes under concurrent load.

Every one of those failure modes has an architectural explanation. Understanding what happens inside a vector database, from the moment a write arrives to the moment a search result is returned, is what gives you the tools to reason about those failures and fix them at the right layer.

This pillar covers the full internal architecture of a production vector database: the ingestion pipeline, the storage layer, the ANN index structures, the similarity metrics that determine what "nearest" means, the query lifecycle from API call to ranked result, metadata filtering strategies, and how the system scales horizontally. Each section connects to a dedicated cluster article that goes deeper on one piece.

If you are new to the field, the Vector Database Fundamentals series covers what a vector database is, what embeddings are, and why the underlying indexes work the way they do. This series assumes that foundation and goes a level deeper into the internal machinery.

The Four Layers of a Vector Database

Every production vector database, regardless of vendor, is built from the same four architectural layers stacked on top of each other.

plaintext

┌─────────────────────────────────────────────┐
│              API / Query Layer               │
│   REST, gRPC, Python SDK, filter parsing     │
├─────────────────────────────────────────────┤
│              Index Layer                     │
│   HNSW, IVF, flat (brute force), PQ          │
├─────────────────────────────────────────────┤
│              Storage Layer                   │
│   Vector store, metadata store, WAL          │
├─────────────────────────────────────────────┤
│           Infrastructure Layer               │
│   Replication, sharding, compaction, GC      │
└─────────────────────────────────────────────┘

The API layer receives queries and writes, parses filters, and routes requests. The index layer implements the ANN algorithms that make similarity search fast. The storage layer persists vectors, metadata, and the index itself to disk. The infrastructure layer handles durability, distribution, and maintenance operations that keep the system healthy over time.

Understanding each layer independently is what allows you to reason about performance and correctness separately. A slow query is almost always an index layer problem. A missing result after a recent upsert is almost always a storage or compaction timing problem. A metadata filter that returns unexpected results is almost always an API layer parsing problem or a filtering strategy mismatch.

The Ingestion Pipeline: What Happens When You Write a Vector

A write operation in a vector database is more complex than it appears from the client side. A single upsert call triggers a sequence of operations across multiple layers.

python

from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

client.upsert(
    collection_name="knowledge-base",
    points=[
        models.PointStruct(
            id=1,
            vector=[0.41, -0.22, 0.88, ...],   # 1536 floats
            payload={"text": "refund policy", "source": "docs/refund.md"}
        )
    ]
)

From the perspective of the database internals, this single call triggers the following sequence:

plaintext

Client upsert request
        ↓
1. Validation
   Dimension check: vector must have exactly 1536 floats
   ID format check: must be valid integer or UUID
   Payload schema: no constraints in most vector DBs
        ↓
2. Write-Ahead Log (WAL) append
   Mutation serialized and written to WAL on disk
   WAL fsync: operation is now crash-safe
   Acknowledgment can now be returned to client
        ↓
3. In-memory segment write
   Vector bytes appended to mmap'd vector store
   Payload written to metadata key-value store
   Soft-delete bitmap updated (if this is an update)
        ↓
4. Background: index update
   New vector inserted into HNSW graph (incremental)
   Or: vector appended to unsealed IVF list
        ↓
5. Background: segment sealing and compaction
   When segment reaches size threshold → seal + build full HNSW
   Periodic compaction merges small segments
   Deleted vectors purged from storage

According to Zilliz's vector database internals guide, a vector database provides data storage, CRUD operations, metadata filtering, and horizontal scaling. The durability guarantee requires that at step 2, the WAL write and fsync completes before the client receives an acknowledgment. Everything after that is best-effort background work.

The WAL is what separates a vector database from a standalone vector index like FAISS. FAISS has no persistence mechanism. If the process crashes, all data is lost. A vector database writes every mutation to an append-only log before applying it, so a crashed process can replay the WAL on restart and recover to a consistent state.

The Storage Layer: How Vectors Are Persisted

Storing millions of high-dimensional float vectors efficiently requires a different approach than storing rows in a relational database.

Vector Storage Format

Vectors are stored as flat binary arrays of 32-bit floats (float32) or 16-bit floats (float16 for compression). A 1536-dimensional float32 vector occupies exactly 1536 × 4 = 6,144 bytes. Vectors for a collection are stored contiguously, which allows the ANN index to read sequential ranges of vectors efficiently using memory-mapped files (mmap).

python

import numpy as np
import struct

def serialize_vector(vec: list[float]) -> bytes:
    """Convert a Python float list to packed binary float32."""
    return struct.pack(f"{len(vec)}f", *vec)

def deserialize_vector(data: bytes) -> list[float]:
    """Convert packed binary float32 back to Python floats."""
    n = len(data) // 4
    return list(struct.unpack(f"{n}f", data))

# A 1536-dim vector
vec = [0.41, -0.22, 0.88, 0.03, -0.71, ...]   # 1536 elements

serialized = serialize_vector(vec)
print(f"Serialized size: {len(serialized)} bytes")
# Serialized size: 6144 bytes

recovered = deserialize_vector(serialized)
assert recovered == vec

The flat binary format is cache-friendly for the ANN index. When HNSW traverses the graph and computes the distance between two nodes, it reads both vectors from memory sequentially. Contiguous float storage maximizes CPU cache hit rates for those reads.

Metadata Storage

The payload (metadata) for each vector is stored separately from the vector bytes, typically in a column store or a key-value store indexed by vector ID. This separation allows the ANN index to operate entirely on float arrays without loading metadata, keeping the hot path for search as lean as possible.

python

# Logical separation between vector store and metadata store

# Vector store: ID → raw float bytes
vector_store = {
    1: serialize_vector([0.41, -0.22, 0.88, ...]),
    2: serialize_vector([-0.11, 0.73, -0.44, ...]),
}

# Metadata store: ID → structured payload
metadata_store = {
    1: {"text": "refund policy", "source": "docs/refund.md",   "category": "support"},
    2: {"text": "timeout errors", "source": "docs/errors.md",  "category": "engineering"},
}

# Index: float arrays only, stores IDs not payloads
# At query time: index returns [id=1, id=2], then metadata_store[1] fetched

According to WEKA's vector database architecture guide, the vector database persistently stores associated metadata with each data point alongside the vector in the storage layer. The structure is usually optimized for fast retrieval and low memory usage.

The Segment Model

Rather than maintaining one monolithic index over the entire collection, modern vector databases use a segment-based storage model. A segment is an immutable unit of storage containing a batch of vectors, their metadata, and an ANN index built specifically over that batch.

plaintext

Collection: "knowledge-base" (5M vectors total)

Segment 1 (sealed, 1M vectors):
  vector_data.bin    → 6 GB of float32 arrays
  metadata.db        → 1M payload records
  hnsw_index.bin     → HNSW graph for these 1M vectors
  deleted.bitmap     → tracks which IDs are soft-deleted

Segment 2 (sealed, 1M vectors):
  [same structure]

Segment 3 (sealed, 1M vectors):
  [same structure]

Segment 4 (sealed, 1M vectors):
  [same structure]

Segment 5 (active, 1M vectors, growing):
  vector_data.bin    → vectors appended incrementally
  metadata.db        → payloads appended incrementally
  flat_index.bin     → brute-force search (no HNSW yet)
  wal.log            → uncommitted mutations

New writes always go into the active segment. The active segment uses brute-force (flat) search rather than HNSW because HNSW cannot efficiently insert vectors one at a time without rebuilding graph edges. When the active segment reaches a size threshold (typically 100,000 to 1,000,000 vectors), it is sealed, a full HNSW index is built over its contents, and a new active segment is opened.

At query time, all segments are searched in parallel and results are merged by score before the top-K are returned. The brute-force search on the small active segment is fast because it contains relatively few vectors. The HNSW search on sealed segments is fast because of the ANN index.

According to Chakra Dev's vector database introduction, Qdrant is designed for production with consistency and handles frequent updates well through its segment-based update model.

The Index Layer: ANN Algorithms in Production

The index is where most of the interesting engineering in a vector database happens. It is the component that converts an O(n) brute-force search into an O(log n) or O(√n) approximate search.

The two dominant algorithms in production deployments are HNSW for high-recall applications and IVF for memory-constrained or billion-scale applications. Both are covered in detail in the cluster articles, but the internal mechanics are worth introducing here because they determine the performance characteristics of everything built on top.

HNSW: The Default for Most Deployments

HNSW (Hierarchical Navigable Small World) builds a layered proximity graph during index construction. Each vector becomes a node. Nodes are connected to their nearest neighbors in the layer. The graph has multiple layers: higher layers have fewer nodes and longer-range connections for fast navigation; the bottom layer contains all nodes with short-range connections for precise search.

plaintext

HNSW graph structure (4 layers, 8 vectors):

Layer 3 (2 nodes):    [A] ——————————— [F]

Layer 2 (4 nodes):    [A] —— [B]      [E] —— [F]

Layer 1 (6 nodes):    [A] — [B] — [C]   [D] — [E] — [F]

Layer 0 (8 nodes):    [A]-[B]-[C]-[X]-[D]-[E]-[F]-[G]
                      (all nodes, dense connections)

At query time, the algorithm enters at the top layer through the entry point node, greedily moves toward the query vector by following edges to closer neighbors, then descends one layer at a time until it reaches layer 0 where it performs a precise local search. The result is a set of candidate IDs sorted by distance to the query vector.

The HNSW index in Qdrant is stored as a serialized graph structure on disk and memory-mapped at runtime. The graph edges are adjacency lists: each node stores a list of up to M neighbor IDs, where M is the m parameter (default 16). For one million vectors at M=16, the graph edges alone occupy approximately 1,000,000 × 16 × 8 bytes = 128 MB.

The full HNSW algorithm article covers the construction process, parameter tuning, and the mathematical guarantees on recall.

IVF: Cluster-Based Partitioning for Memory Efficiency

IVF (Inverted File Index) takes a different approach. During index construction, k-means clustering partitions all vectors into nlist clusters. Each cluster has a centroid. Every vector is assigned to its nearest centroid and stored in that centroid's inverted list.

plaintext

IVF structure (nlist=4 clusters, 12 vectors):

Centroid 0: [0.3, 0.7, ...]   → inverted list: [v1, v3, v7, v11]
Centroid 1: [0.8, 0.2, ...]   → inverted list: [v2, v5, v9]
Centroid 2: [-0.4, 0.6, ...]  → inverted list: [v4, v6, v8]
Centroid 3: [-0.1, -0.8, ...] → inverted list: [v0, v10, v12]

At query time, the algorithm computes the distance from the query to each centroid (only nlist comparisons, typically a few hundred to a few thousand), selects the top nprobe closest centroids, and exhaustively searches only the vectors in those clusters. This reduces the comparison count from O(n) to approximately O(nlist + n/nlist × nprobe).

The IVF index requires a training phase before vectors can be inserted. This training step runs k-means to learn the cluster centroids. Vectors inserted after training are assigned to the nearest existing centroid. The tradeoff: IVF has lower memory overhead than HNSW (no graph edges to store) but lower recall when the true nearest neighbor happens to be in a cluster whose centroid was not selected.

The IVF index deep dive article covers parameter selection, the training phase, recall vs speed tradeoffs, and when IVF beats HNSW.

Product Quantization: Compressing the Storage Footprint

At billion-scale, even efficient float32 storage becomes prohibitive. A billion 1536-dimensional float32 vectors occupies 1B × 6,144 bytes = 6 petabytes. Product Quantization (PQ) compresses vectors by 32 to 64 times at the cost of moderate recall degradation.

PQ splits each vector into M subvectors, clusters each subspace independently into 256 centroids, and stores only the centroid index (one byte) per subvector rather than the raw float values. A 1536-dimensional vector compressed with M=96 subvectors becomes 96 bytes instead of 6,144 bytes, a 64x compression.

python

import numpy as np
import faiss

d      = 1536   # vector dimension
M      = 96     # number of subvectors (d must be divisible by M)
nbits  = 8      # bits per subvector centroid (256 centroids)

# Create IVF-PQ index: cluster-based search + PQ compression
nlist    = 1000
quantizer = faiss.IndexFlatIP(d)
index     = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits)

# Must train on a representative sample before adding vectors
vectors = np.random.randn(100_000, d).astype(np.float32)
faiss.normalize_L2(vectors)
index.train(vectors)
index.add(vectors)

# Memory used per vector after PQ:
bytes_per_vector_raw = d * 4                  # 6144 bytes (float32)
bytes_per_vector_pq  = M * (nbits // 8)      # 96 bytes
compression_ratio    = bytes_per_vector_raw / bytes_per_vector_pq
print(f"Compression ratio: {compression_ratio:.0f}x")
# Compression ratio: 64x

The Product Quantization article covers the subspace decomposition in detail, how the codebooks are learned, and how asymmetric distance computation preserves recall during compressed search.

Similarity Metrics: What "Nearest" Means

Before any search can run, the database must know what distance function to use for comparing vectors. The choice of metric is fixed at collection creation time and cannot be changed without rebuilding the index.

python

from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

# Distance metric is set at collection creation time
client.create_collection(
    collection_name="text-search",
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.COSINE,   # fixed for this collection
    ),
)

Three metrics appear across every major vector database:

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite directions) to 1 (identical direction). For normalized vectors (L2 norm = 1), cosine similarity is mathematically equivalent to the dot product, which is faster to compute because it skips the norm division. Text embeddings are almost always compared using cosine similarity because the encoding of meaning is directional: two documents about the same topic point in similar directions regardless of their lengths.

Euclidean distance (L2 distance) measures the straight-line distance between two points in vector space. It is sensitive to vector magnitude. Two vectors pointing in the same direction but with very different lengths will have a large Euclidean distance even though their cosine similarity is high. Euclidean distance is the right choice when the magnitude of the embedding carries meaningful information, as in some image and audio embedding models.

Dot product computes the sum of element-wise products. Without normalization, dot product is sensitive to both direction and magnitude. With normalization to unit length, it becomes identical to cosine similarity. Maximum Inner Product Search (MIPS) using the raw dot product is common in recommendation systems where the magnitude of a user's preference vector carries signal.

python

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.linalg.norm(a - b))

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b))

a = np.array([1.0, 0.0, 0.0])
b = np.array([0.9, 0.1, 0.0])
c = np.array([3.6, 0.4, 0.0])   # same direction as b, 4x longer

print(f"cosine(b, c) = {cosine_similarity(b, c):.4f}")   # 1.0000 — same direction
print(f"euclid(b, c) = {euclidean_distance(b, c):.4f}")  # 2.7019 — different magnitude
print(f"dot(b, c)    = {dot_product(b, c):.4f}")         # 3.28   — scale-sensitive

The cosine similarity vs Euclidean distance article covers the mathematical relationship between them, when each is appropriate, and what happens when you use the wrong metric for your embedding model.

The Query Lifecycle: From API Call to Ranked Results

A single search request triggers a precisely ordered sequence of operations across the API, index, and storage layers. Understanding this sequence is what allows you to instrument the right places when something is slow.

plaintext

Step 1: Query arrives at API layer
        POST /collections/knowledge-base/points/search
        {
          "vector": [0.41, -0.22, 0.88, ...],
          "filter": {"must": [{"key": "category", "match": {"value": "support"}}]},
          "limit": 10,
          "with_payload": true
        }

Step 2: Filter parsing and strategy selection
        Parse filter → {"category": "support"}
        Estimate selectivity: how many vectors match this filter?
        If >5% of collection: use post-filtering
        If <5% of collection: use pre-filtering

Step 3: ANN search (if post-filtering)
        Retrieve top-100 candidates from HNSW across all segments
        Segments searched in parallel across CPU cores
        Results merged and sorted by similarity score

Step 4: Metadata filter applied (post-filtering)
        For each candidate ID: fetch payload from metadata store
        Discard candidates where category ≠ "support"
        If fewer than 10 remain: optionally expand search

Step 5: Score normalization and result assembly
        Apply score threshold (if set)
        Select top-10 by score
        Fetch full payloads for those 10 IDs

Step 6: Response serialization
        Serialize result list to JSON (REST) or protobuf (gRPC)
        Return to client

The full lifecycle is covered step by step in the vector query lifecycle article.

Metadata Filtering: The Engineering Challenge

Metadata filtering is the capability that separates a vector database from a standalone vector index. It allows search results to be restricted to records matching structured criteria — "return the 10 most similar documents, but only from documents created in the last 30 days, in the engineering category, and not deleted."

The engineering challenge is that metadata filtering and ANN search do not compose naturally. An HNSW index is built over all vectors. If you need to restrict search to 5 percent of them, the index does not know which 5 percent are eligible. There are two approaches and each has failure modes.

Post-Filtering

Run ANN search to get top-K candidates, then apply the metadata filter to discard non-matching results.

python

# Pseudocode: post-filtering
candidates = hnsw_search(query_vec, k=100)          # get 100 candidates from index
filtered   = [c for c in candidates if c.category == "support"]
return filtered[:10]                                 # return top 10 matching ones

Post-filtering is fast because the ANN index runs at full efficiency. The failure mode: if the filter is highly selective (only 2 percent of vectors match), retrieving 100 candidates from the ANN index may return only 2 that pass the filter. The result set has 2 items instead of 10.

Pre-Filtering

Apply the metadata filter first to get the set of eligible IDs, then run ANN search restricted to those IDs.

python

# Pseudocode: pre-filtering
eligible_ids = metadata_store.query(category="support")   # e.g. 50,000 IDs
result = hnsw_search(query_vec, k=10, allowed_ids=eligible_ids)

Pre-filtering always returns exactly K results (assuming K eligible vectors exist). The failure mode: restricting the ANN search to a specific ID subset breaks the HNSW graph traversal. The graph was built over the full collection. When traversal is forced to only visit eligible nodes, the graph structure becomes suboptimal and recall degrades.

Adaptive Filtering

Production systems like Qdrant and Weaviate implement adaptive strategies that choose between pre and post-filtering based on the estimated selectivity of the filter.

plaintext

Qdrant adaptive filtering logic (simplified):

estimated_match_fraction = metadata_store.estimate(filter)

if estimated_match_fraction > 0.05:   # more than 5% match → post-filter
    candidates = hnsw.search(query, k=10 * oversampling_factor)
    return filter_and_truncate(candidates, filter, k=10)

else:                                  # fewer than 5% match → pre-filter
    eligible = metadata_store.query(filter)
    return hnsw.search_with_allowlist(query, k=10, allowlist=eligible)

The transition between strategies happens automatically based on a cardinality estimate computed from index statistics on the metadata store. According to Pinecone's vector database guide, metadata storage and filtering allows users to query the database using additional metadata filters for finer-grained queries.

Exact vs Approximate Search

Vector databases always offer a flat (exact) search mode alongside ANN search. Understanding when to use each requires understanding what you are trading.

python

from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

# Exact (brute-force) search — compares against every vector
exact_results = client.search(
    collection_name="knowledge-base",
    query_vector=query_vec,
    limit=10,
    search_params=models.SearchParams(exact=True),   # bypass ANN index
)

# Approximate search (default) — uses HNSW
approx_results = client.search(
    collection_name="knowledge-base",
    query_vector=query_vec,
    limit=10,
    search_params=models.SearchParams(
        hnsw_ef=128,   # expand search frontier (higher = better recall, slower)
    ),
)

Exact search is the ground truth. It returns the true nearest neighbors regardless of what the ANN index would return. It is O(n) and impractical at millions of vectors under production latency requirements. Its value is as a reference for measuring ANN recall: run both, compare the result sets, and compute the overlap fraction.

Approximate search trades a small bounded accuracy loss for a massive speed improvement. A well-tuned HNSW index with ef=128 achieves 95 to 99 percent recall compared to exact search while being 100 to 1000 times faster. The ef parameter controls the size of the candidate set maintained during graph traversal: larger ef means more candidates explored, higher recall, higher latency.

The exact vs approximate nearest neighbor article covers the mathematical recall guarantees, how to measure recall on your own data, and how to set ef for a target recall threshold.

Distributed Architecture: Scaling Beyond One Node

A single vector database node on modern hardware handles tens of millions of vectors comfortably. Beyond that threshold, or when write throughput or query concurrency exceeds single-node capacity, horizontal distribution is required.

Sharding

The collection is split into shards. Each shard holds a subset of the vectors and its own ANN index. Sharding is typically done by hash of the vector ID so that the assignment is deterministic and balanced across nodes.

plaintext

Collection "knowledge-base" (100M vectors, 4 shards):

Shard 0 (Node A):  vectors where ID % 4 == 0  → 25M vectors + HNSW index
Shard 1 (Node B):  vectors where ID % 4 == 1  → 25M vectors + HNSW index
Shard 2 (Node C):  vectors where ID % 4 == 2  → 25M vectors + HNSW index
Shard 3 (Node D):  vectors where ID % 4 == 3  → 25M vectors + HNSW index

Query coordinator:
  1. Broadcast query to all 4 shards
  2. Collect top-K results from each shard
  3. Merge 4 × K results → global top-K
  4. Return final top-K to client

The merge step at the coordinator is where the global top-K is assembled. Each shard returns its local top-K sorted by score. The coordinator merges these K sorted lists into one global ranking and selects the top-K from the combined list.

Replication

Each shard is replicated across multiple nodes for fault tolerance. Reads can be served from any replica. Writes must be propagated to all replicas before the write is acknowledged (synchronous replication) or propagated asynchronously after acknowledgment (asynchronous replication). The choice determines whether the system is CP (consistent and partition-tolerant) or AP (available and partition-tolerant) in CAP theorem terms.

According to WEKA's architecture guide, vector databases scale with increasing data volume and user demands, with more support for parallel and distributed processing. The serverless architectures of vector databases also optimize cost at scale.

Milvus: Disaggregated Architecture

Milvus takes disaggregation further by separating storage, indexing, and query responsibilities into independent services:

plaintext

Milvus disaggregated architecture:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Query Nodes │    │  Data Nodes │    │ Index Nodes  │
│  (stateless) │    │ (ingestion) │    │(index build) │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                   │
       └──────────────────┼───────────────────┘
                          │
              ┌───────────▼───────────┐
              │   Object Storage      │
              │  (S3 / MinIO / GCS)   │
              │  segments, indexes    │
              └───────────────────────┘

Query nodes are stateless. They load segment data from object storage on demand, serve queries, and can be scaled horizontally without affecting data durability. Index nodes build HNSW indexes over sealed segments asynchronously. Data nodes receive write traffic and manage the WAL. Object storage provides the durable, infinitely scalable backing store.

This disaggregated model allows each component to scale independently based on its bottleneck. If queries are slow, add query nodes. If ingestion is falling behind, add data nodes. If index build is lagging, add index nodes.

CRUD Operations: Updates and Deletes

Vector indexes are fundamentally append-friendly structures. Inserting a new vector into an HNSW graph requires adding a new node with edges to its nearest neighbors, which is relatively efficient. Deleting or updating a vector is not.

An HNSW graph node cannot be cleanly removed without potentially disconnecting the graph and degrading recall for other nodes. Most vector databases solve this with soft deletes: a vector is marked as deleted in a bitmap, and the ANN search skips deleted IDs when assembling results.

python

# Qdrant delete operation — soft delete under the hood
client.delete(
    collection_name="knowledge-base",
    points_selector=models.PointIdsList(points=[1, 2, 3]),
)

# What happens internally:
# 1. IDs [1, 2, 3] added to the deleted_ids bitmap for their segment
# 2. ANN search results containing these IDs are filtered out
# 3. The actual vector bytes remain in storage
# 4. During compaction: deleted vectors are physically removed
#    and the segment's HNSW index is rebuilt without them

Updates are implemented as a delete followed by an insert. The old vector is soft-deleted and a new vector (with potentially different float values and metadata) is inserted into the active segment. This means a "updated" record appears in two places temporarily: the soft-deleted old version in its original segment, and the new version in the active segment. Compaction eventually consolidates them.

Vector Indexing as a Discipline

Vector indexing is the engineering discipline that sits between raw float storage and fast ANN search. It covers the data structures, algorithms, and parameter choices that determine how quickly a vector database can find the neighbors of a query vector.

The three choices every system operator makes:

Index type. HNSW for high recall at moderate scale. IVF for memory efficiency or very large scale. Flat (exact) for correctness testing or small collections. IVF-PQ for billion-scale with memory constraints.

Index parameters. For HNSW: m (graph connectivity, default 16) and ef_construction (build quality, default 64). Higher values produce better recall at the cost of longer build times and more memory. For IVF: nlist (cluster count, typically sqrt(n)) and nprobe (clusters searched per query).

Distance metric. Cosine similarity for text embeddings. Euclidean for image and audio models where magnitude carries signal. Dot product for recommendation systems using matrix factorization embeddings.

The interaction between index type, parameters, and metric determines the achievable recall-latency curve for your specific collection. Measuring that curve on your actual data and workload is the only reliable way to configure a vector database for production.

Connecting the Cluster Articles

This pillar provides the architectural frame. Each cluster article goes deep on one component:

The similarity search mechanics article covers how the distance computation inside an ANN search actually runs, from the first graph traversal step to the final ranked list.

The cosine similarity vs Euclidean distance article works through the mathematics of both metrics with geometric intuition and code, and maps each to the embedding models where it is appropriate.

The exact vs approximate nearest neighbor article covers the precision-speed tradeoff formally, how to measure recall, and how ANN algorithms bound their approximation error.

The HNSW algorithm article covers the full graph construction and traversal algorithm with diagrams, parameter tuning, and the small-world network theory that explains why it works.

The IVF index article covers the k-means training phase, cluster assignment, the nprobe tradeoff, and when IVF beats HNSW. It cross-links with the HNSW article for the direct comparison.

The Product Quantization article covers the subspace decomposition algorithm, codebook training, asymmetric distance computation, and how PQ combines with IVF for billion-scale deployment.

The vector indexing article covers the indexing discipline as a whole: why indexes exist, what properties a good vector index must have, and how index quality is measured.

The vector query lifecycle article traces a single search request through every internal component from API parsing to response serialization, providing the observability frame for debugging production issues.

Summary

A vector database is not a vector index with a REST API bolted on. It is a complete data management system with at least four layers of engineering: the API and filter parsing layer, the ANN index layer (HNSW, IVF, flat), the storage layer (binary vector store, metadata key-value store, WAL, segments), and the infrastructure layer (replication, sharding, compaction).

Every write goes through the WAL before acknowledgment, ensuring crash recovery. Vectors are stored as contiguous float32 binary arrays. The segment model allows incremental ingestion without continuous index rebuilding. Deletes are soft to avoid corrupting the HNSW graph and are cleaned up by compaction.

Every search traverses the ANN index across all segments in parallel, applies metadata filtering according to an adaptive strategy based on filter selectivity, assembles the global top-K across shards if the collection is distributed, and fetches full payloads from the metadata store before returning to the client.

The cluster articles in this series go deep on each component. The Vector Database Fundamentals series covers the foundational concepts that this series builds on.

Sources and Further Reading

Pinecone. What Is a Vector Database and How Does It Work? pinecone.io/learn/vector-database
Zilliz. What Is a Vector Database and How Does It Work? zilliz.com/learn/what-is-vector-database
Milvus. What Exactly Is a Vector Database? milvus.io/blog/what-is-a-vector-database.md
Qdrant. What Is a Vector Database? qdrant.tech/articles/what-is-a-vector-database
IBM. What Is a Vector Database? ibm.com/think/topics/vector-database
Databricks. What Is a Vector Database? databricks.com/blog/what-is-vector-database
WEKA. Vector Database: Everything You Need to Know. weka.io/learn/guide/ai-ml/vector-database
Yugabyte. What Is a Vector Database? Examples, Use Cases. yugabyte.com/blog/what-is-a-vector-database
Pure Storage. Vector Databases and Storage for Generative AI. blog.purestorage.com/purely-technical/vector-database-and-storage
arXiv. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge. arxiv.org/abs/2310.11703
arXiv. Vector Search for the Future: From Memory-Resident to Cloud-Native Architectures. arxiv.org/abs/2601.01937
Weaviate. Vector Database Architecture. weaviate.io/developers/weaviate/concepts/vector-index
Murf AI. 7 Best Vector Databases in 2026. murf.ai/blog/best-vector-databases

How Vector Databases Work Internally: Indexing, Search and Storage Explained

The Four Layers of a Vector Database

The Ingestion Pipeline: What Happens When You Write a Vector

The Storage Layer: How Vectors Are Persisted

Vector Storage Format

Metadata Storage

The Segment Model

The Index Layer: ANN Algorithms in Production

HNSW: The Default for Most Deployments

IVF: Cluster-Based Partitioning for Memory Efficiency

Product Quantization: Compressing the Storage Footprint

Similarity Metrics: What "Nearest" Means

The Query Lifecycle: From API Call to Ranked Results

Metadata Filtering: The Engineering Challenge

Post-Filtering

Pre-Filtering

Adaptive Filtering

Exact vs Approximate Search

Distributed Architecture: Scaling Beyond One Node

Sharding

Replication

Milvus: Disaggregated Architecture

CRUD Operations: Updates and Deletes

Vector Indexing as a Discipline

Connecting the Cluster Articles

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts