What Is Semantic Search? How It Works Step by Step
A research-backed, step-by-step guide to semantic search. Learn how it differs from keyword search, how the full pipeline works from chunking to reranking, what makes it fail, and how to build a working semantic search system with Python code examples.
A user types "why does my app crash when the network is slow" into a support portal. The document that answers their question is titled "Handling connection timeouts in mobile applications." It contains the words "timeout," "mobile," and "connection" but not "crash," "app," or "slow."
A keyword search returns nothing useful. A semantic search returns the exact document as the top result.
That gap is what semantic search solves. It is not a minor improvement over keyword search. It is a different model of what retrieval should do: find content that satisfies the intent of the query, not content that happens to share words with it.
This article covers what semantic search is, how the entire pipeline works from document ingestion to ranked results, where it fails and why, and how to build a working semantic search system in Python. It sits at the center of the Vector Database Fundamentals series, connecting the foundational concepts in vectors, embeddings, and dense vs sparse representations to the practical architecture of vector databases and retrieval augmented generation.
What Semantic Search Actually Is
Semantic search is a retrieval method that finds content based on meaning and intent rather than lexical overlap. It achieves this by representing both the query and the documents as vectors in a shared embedding space, then returning documents whose vectors are closest to the query vector.
According to Google Cloud's semantic search documentation, semantic search is a data searching technique that focuses on understanding the contextual meaning and intent behind a user's search query, rather than only matching keywords.
The key word is contextual. Two sentences can share zero words and still describe the same thing. Semantic search captures that relationship because both sentences, when passed through the same embedding model, land at nearby coordinates in the vector space.
Query: "why does my app crash when the network is slow"
Document: "Handling connection timeouts in mobile applications"
Keyword overlap: 0 shared terms
Cosine similarity: 0.78 (highly similar)That similarity score comes from the geometry of the learned embedding space, not from any explicit synonym mapping. The model learned from training data that "crash," "app," and "slow network" appear in the same context as "timeout," "mobile," and "connection."
The Full Pipeline: From Raw Documents to Search Results
Semantic search is not a single operation. It is a pipeline with distinct offline and online phases. Understanding each step is what separates teams who build reliable systems from teams who struggle with poor retrieval quality.
Phase 1: Offline Indexing
This phase runs once (and reruns when new documents arrive). Its output is a populated vector database ready to receive queries.
Step 1.1: Document Ingestion and Cleaning
Raw documents arrive in various formats: PDFs, HTML pages, Markdown files, database rows, API responses. Before embedding, each document needs to be extracted into clean text.
import PyPDF2
import re
def extract_text_from_pdf(pdf_path: str) -> str:
with open(pdf_path, "rb") as f:
reader = PyPDF2.PdfReader(f)
text = " ".join(page.extract_text() or "" for page in reader.pages)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
raw_text = extract_text_from_pdf("product-manual.pdf")According to Parallel.ai's semantic search implementation guide, your semantic search system will only be as accurate as the data you index. Removing duplicates, fixing formatting issues, and filtering out low-quality content is not optional housekeeping. It directly determines retrieval quality.
Step 1.2: Chunking
A 50-page document cannot be embedded as a single vector. The embedding model has a token limit (typically 512 to 8192 tokens depending on the model), and even if the document fits, a single vector for 50 pages would average together so many topics that the resulting embedding would be too diffuse to retrieve precisely.
Chunking splits documents into smaller segments, each of which gets its own embedding. The goal is segments that are small enough to embed precisely but large enough to contain a complete idea.
According to Elastic's chunking strategies documentation, LLMs have a limited number of tokens they can take as context, and it is much more efficient to send only relevant chunks rather than a whole document.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_document(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]:
"""
Split text into overlapping chunks.
chunk_size: target number of characters per chunk
overlap: characters shared between adjacent chunks
(prevents answers from being cut at a boundary)
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)
chunks = chunk_document(raw_text)
print(f"Document split into {len(chunks)} chunks")
# Document split into 47 chunksThe overlap is important. If a key sentence straddles a chunk boundary, having it appear in both the preceding and following chunk ensures at least one embedding captures it intact.
Chunking strategies in practice:
Fixed-size chunking splits on character or token count with overlap. It is fast and predictable but may cut sentences mid-thought.
Sentence chunking splits at sentence boundaries. It preserves semantic units but produces variable-length chunks.
Semantic chunking uses an embedding model to detect where the topic shifts and splits at those boundaries. It produces the most coherent chunks but requires embedding computation during indexing.
For most teams starting out, recursive character splitting with 300 to 500 token chunks and 15 percent overlap is the right default. Tuning chunk size is one of the highest-leverage optimization steps once you have baseline retrieval working.
Step 1.3: Embedding Each Chunk
Each chunk is passed through an embedding model to produce a dense vector. The choice of embedding model determines what "similarity" means for your system.
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed_chunks(chunks: list[str], batch_size: int = 64) -> list[list[float]]:
embeddings = []
for i in tqdm(range(0, len(chunks), batch_size)):
batch = chunks[i : i + batch_size]
batch_embeddings = model.encode(batch, normalize_embeddings=True)
embeddings.extend(batch_embeddings.tolist())
return embeddings
chunk_embeddings = embed_chunks(chunks)
print(f"Produced {len(chunk_embeddings)} embeddings of dimension {len(chunk_embeddings[0])}")
# Produced 47 embeddings of dimension 384Batching is essential for performance. Embedding one chunk at a time is roughly 10 to 30 times slower than batching due to GPU underutilization.
The normalize_embeddings=True flag normalizes each vector to unit length (L2 norm = 1). This makes cosine similarity equivalent to a dot product, which is faster to compute and is the expected input format for most ANN indexes.
Step 1.4: Storing in a Vector Database
Each embedding is stored in a vector database alongside its original text and metadata. The database builds an ANN index over the stored vectors to enable fast similarity search at query time.
from qdrant_client import QdrantClient, models
import uuid
client = QdrantClient(":memory:") # Use host/port for persistent deployments
COLLECTION = "product-docs"
DIMENSIONS = 384 # matches all-MiniLM-L6-v2 output
client.create_collection(
collection_name=COLLECTION,
vectors_config=models.VectorParams(
size=DIMENSIONS,
distance=models.Distance.COSINE,
),
)
points = [
models.PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"text": chunk,
"source": "product-manual.pdf",
"chunk_index": i,
},
)
for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings))
]
client.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks into Qdrant")The metadata payload (source file, chunk index, timestamps, document category) is what enables filtered search later. Without metadata, every query searches the entire index. With it, you can restrict searches to a specific document, date range, or category.
Phase 2: Online Query
This phase runs on every user query. Speed matters here. Users expect sub-second responses.
Step 2.1: Query Embedding
The user's query is passed through the same embedding model used during indexing. This is non-negotiable. Embeddings from different models live in incompatible spaces and cannot be compared.
def embed_query(query: str) -> list[float]:
vector = model.encode([query], normalize_embeddings=True)
return vector[0].tolist()
query = "why does my app crash when the network is slow"
query_vector = embed_query(query)The query embedding is typically computed in under 10ms for a sentence-length input on a modern CPU.
Step 2.2: ANN Search
The query vector is sent to the vector database, which uses its ANN index to find the top-K most similar stored vectors without comparing against every document.
def semantic_search(query: str, top_k: int = 5) -> list[dict]:
query_vector = embed_query(query)
results = client.search(
collection_name=COLLECTION,
query_vector=query_vector,
limit=top_k,
with_payload=True,
score_threshold=0.4, # filter out low-confidence results
)
return [
{
"score": hit.score,
"text": hit.payload["text"],
"source": hit.payload["source"],
"chunk_index": hit.payload["chunk_index"],
}
for hit in results
]
results = semantic_search("why does my app crash when the network is slow")
for r in results:
print(f"Score {r['score']:.4f}: {r['text'][:80]}...")The ANN index (HNSW in Qdrant's case) is what makes this fast at scale. Without it, finding the top-K vectors in a collection of one million documents would require one million cosine similarity computations. HNSW reduces that to a few hundred comparisons by traversing a layered graph structure. The mechanics are covered in the why traditional indexes fail for vector search article.
Step 2.3: Reranking (Optional but High-Impact)
ANN search optimizes for speed. The initial results are highly relevant on average but not perfectly ordered. A reranker takes the top 20 to 100 candidates from the first stage and rescores them using a more powerful model that reads the query and each candidate together.
Bi-encoders (used in ANN search) process query and document independently and compare their outputs. Cross-encoders read the query and document together and output a single relevance score that captures their interaction directly.
According to ZeroEntropy's reranking guide, Databricks research shows reranking can improve retrieval quality by up to 48 percent. The three-stage pipeline of BM25 plus dense retrieval plus reranking maximizes both recall and precision.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_results(query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
"""
Rerank retrieval candidates using a cross-encoder.
Cross-encoders read query and document together for higher accuracy.
"""
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
for candidate, score in zip(candidates, scores):
candidate["rerank_score"] = float(score)
reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_n]
# Get 20 candidates from ANN search, rerank to top 3
candidates = semantic_search(query, top_k=20)
final_results = rerank_results(query, candidates, top_n=3)
for r in final_results:
print(f"Rerank score {r['rerank_score']:.4f}: {r['text'][:80]}...")Reranking adds 50 to 200ms of latency depending on the model size and number of candidates. For most RAG applications where generation latency already exceeds one second, this cost is negligible relative to the precision gain.
Semantic Search vs Keyword Search: The Honest Comparison
The two approaches fail in opposite directions. Understanding those failure modes is more useful than declaring one superior.
Property | Keyword Search | Semantic Search
----------------------+------------------------+---------------------------
Match basis | Exact word overlap | Vector similarity
Handles synonyms | No (without synonyms) | Yes
Handles paraphrases | No | Yes
Handles exact strings | Yes (precisely) | Poorly
Handles rare terms | Yes (exact match) | Poorly (OOV terms)
Index type | Inverted index | ANN vector index
Interpretability | High (shows matches) | Low (opaque scores)
Setup complexity | Low | Higher
Latency | Sub-millisecond | 10 to 100ms per queryAccording to Redis's semantic vs keyword search comparison, semantic search excels when users express intent in natural language, while keyword search delivers precision for exact identifiers and compliance scenarios.
According to Unstructured's retrieval analysis, keyword search fails when the document uses different terminology than the query, when spelling varies, or when key information lives in images or tables that were not extracted into indexable text.
For most production knowledge bases and RAG systems, neither method alone is adequate. The dense vs sparse vectors article covers how hybrid search combines both into a single retrieval pipeline that consistently outperforms either method.
A Complete Semantic Search System
This is a full working example that ties every phase together: ingestion, chunking, embedding, indexing, search, and reranking.
import re
import uuid
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient, models
###############################################################################
# Configuration
###############################################################################
EMBED_MODEL = "all-MiniLM-L6-v2"
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
COLLECTION = "knowledge-base"
DIMENSIONS = 384
CHUNK_SIZE = 400
CHUNK_OVERLAP = 80
###############################################################################
# Models
###############################################################################
embedder = SentenceTransformer(EMBED_MODEL)
reranker = CrossEncoder(RERANK_MODEL)
db = QdrantClient(":memory:")
db.create_collection(
collection_name=COLLECTION,
vectors_config=models.VectorParams(size=DIMENSIONS, distance=models.Distance.COSINE),
)
###############################################################################
# Utility functions
###############################################################################
def chunk_text(text: str) -> list[str]:
"""Simple fixed-size chunking with overlap."""
chunks = []
start = 0
while start < len(text):
end = min(start + CHUNK_SIZE, len(text))
chunks.append(text[start:end].strip())
start += CHUNK_SIZE - CHUNK_OVERLAP
return [c for c in chunks if len(c) > 50] # drop tiny trailing chunks
def embed(texts: list[str]) -> list[list[float]]:
return embedder.encode(texts, normalize_embeddings=True).tolist()
###############################################################################
# Indexing
###############################################################################
def index_document(text: str, doc_id: str, metadata: dict | None = None) -> int:
chunks = chunk_text(text)
vectors = embed(chunks)
points = [
models.PointStruct(
id=str(uuid.uuid4()),
vector=vec,
payload={"text": chunk, "doc_id": doc_id, **(metadata or {})},
)
for chunk, vec in zip(chunks, vectors)
]
db.upsert(collection_name=COLLECTION, points=points)
return len(points)
###############################################################################
# Search
###############################################################################
def search(query: str, top_k: int = 10, rerank_n: int = 3) -> list[dict]:
query_vec = embed([query])[0]
hits = db.search(
collection_name=COLLECTION,
query_vector=query_vec,
limit=top_k,
with_payload=True,
score_threshold=0.30,
)
candidates = [
{"text": h.payload["text"], "doc_id": h.payload["doc_id"], "score": h.score}
for h in hits
]
if not candidates:
return []
# Rerank candidates
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
for c, s in zip(candidates, scores):
c["rerank_score"] = float(s)
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:rerank_n]
###############################################################################
# Demo
###############################################################################
documents = {
"refund-policy": """
Our refund policy allows customers to return any product within 30 days
of purchase for a full refund. Digital downloads are non-refundable.
Refund requests must include the original order number.
Processing takes 5 to 7 business days after approval.
""",
"shipping-guide": """
Standard shipping takes 3 to 5 business days.
Express shipping delivers within 1 to 2 business days at additional cost.
Free shipping is available on orders above 50 dollars.
International orders may be subject to customs delays.
""",
"timeout-errors": """
Connection timeout errors occur when the server does not respond
within the expected time window. This commonly happens on slow or
unreliable network connections. Set a reasonable timeout value in your
HTTP client and implement retry logic with exponential backoff.
""",
}
for doc_id, text in documents.items():
n = index_document(text, doc_id)
print(f"Indexed '{doc_id}': {n} chunks")
print()
queries = [
"how do I get my money back from a purchase",
"why does my app crash when the network is slow",
"how long does delivery take",
]
for query in queries:
results = search(query, top_k=10, rerank_n=1)
print(f"Query: {query}")
if results:
print(f" Best match ({results[0]['doc_id']}): {results[0]['text'][:80].strip()}...")
print()
# Output:
# Query: how do I get my money back from a purchase
# Best match (refund-policy): Our refund policy allows customers to return any product...
#
# Query: why does my app crash when the network is slow
# Best match (timeout-errors): Connection timeout errors occur when the server does not...
#
# Query: how long does delivery take
# Best match (shipping-guide): Standard shipping takes 3 to 5 business days...Every query succeeds without a single shared keyword between query and the relevant document. "Money back" matches "refund policy." "App crash" and "network slow" matches "timeout errors." "How long does delivery take" matches "shipping." The embedding geometry is doing all the work.
What Makes Semantic Search Fail
Semantic search is not universally better than keyword search. It has specific, predictable failure modes.
Chunk size too large. If a chunk contains three unrelated topics, its embedding averages the semantics of all three. The resulting vector is pulled in multiple directions and may not be close to any specific query. Smaller, focused chunks produce sharper embeddings.
Exact identifiers. The string "PROD-SKU-7842X" is not in any embedding model's training data. The model assigns it an embedding near similar-looking strings, which may be completely wrong. BM25 handles this correctly because it matches exact tokens. This is the core argument for hybrid search.
Very short queries. A single word gives the embedding model almost no context. "Python" could refer to the programming language, the snake, or Monty Python. The embedding lands somewhere in the middle of all three, which produces poor results. Query expansion or hybrid search helps here.
Out-of-vocabulary technical terms. New product names, internal acronyms, and recently coined terminology may appear rarely or never in training data. The embedding is unreliable. A combination of keyword search for exact matching and semantic search for meaning-based queries handles this correctly.
Model mismatch. Using a general-purpose embedding model for a highly specialized domain (medical literature, legal documents, financial instruments) produces lower-quality embeddings than a domain-fine-tuned model. According to Sparkco's sentence transformer guide, for domain-specific data, fine-tuning on your own corpus can significantly enhance embedding quality.
According to Unstructured's vector embeddings analysis, chunking determines the granularity of retrieval. Overly large chunks dilute meaning, and overly small chunks drop the context the model needs to disambiguate.
Semantic Search in Production: Real Systems
Semantic search is no longer experimental. It powers retrieval in systems that operate at massive scale.
LinkedIn's job and people search runs a full semantic search stack combining GPU-accelerated exhaustive retrieval, LLM-supervised embedding models, and a small language model reranker under strict latency and throughput constraints. The system processes billions of queries against billion-scale indexes.
Azure AI Search benchmarked hybrid search plus semantic reranking against pure vector search across four customer datasets. Hybrid with reranking consistently outperformed pure vector search across all document types and industry verticals.
Elasticsearch's ELSER adds a learned sparse retriever on top of traditional BM25 infrastructure, giving teams the option to use semantic retrieval without replacing their existing Elasticsearch deployment.
Semantic Search as the Memory Layer for LLMs
The most important application of semantic search in 2025 and 2026 is as the retrieval component of RAG systems. Large language models have a knowledge cutoff and cannot access private data. Semantic search bridges the gap by finding relevant documents from a private knowledge base before the LLM generates a response.
According to RapidSearch's semantic search analysis, semantic search is the engine behind Retrieval-Augmented Generation (RAG), which solves one of the biggest problems with LLMs: hallucinations. In a RAG workflow, the semantic search system first retrieves the most relevant factual data from a private knowledge base. This retrieved context is fed into the LLM alongside the original question, grounding the response in accurate information.
User: "What is our policy on remote work in different time zones?"
Semantic Search:
Embed query → [0.41, -0.22, ..., 0.88]
ANN search in HR policy database
Retrieve: "Employees in different time zones are expected to maintain
a minimum 4-hour overlap with their team's core hours..."
Rerank top 5 candidates → top 2 selected
LLM receives:
Context: [retrieved policy text]
Question: "What is our policy on remote work in different time zones?"
→ Grounded, accurate, specific answerThe vector database article covers how this full RAG architecture is assembled, including indexing pipelines, metadata filtering, and the complete retrieval-generation loop.
The Query Pipeline in One Diagram
OFFLINE (runs once during indexing)
────────────────────────────────────────────────────────────────
Raw documents (PDF, HTML, Markdown, database)
↓
Text extraction and cleaning
↓
Chunking (300 to 500 tokens with 15% overlap)
↓
Embedding model (bi-encoder)
↓
Dense vector per chunk + Sparse vector per chunk (optional BM25)
↓
Vector database (HNSW index for dense, inverted index for sparse)
ONLINE (runs on every query, target < 200ms)
────────────────────────────────────────────────────────────────
User query
↓
Query embedding (same model as indexing, < 10ms)
↓
ANN search in vector database (HNSW traversal, < 50ms)
↓
Optional: BM25 keyword search in parallel
↓
Optional: RRF fusion of dense + sparse results
↓
Optional: Cross-encoder reranker on top 20 to 100 candidates (< 200ms)
↓
Top K most relevant chunks
↓
LLM context OR Search results UIEach stage in this pipeline can be tuned independently. Swap the embedding model, change the chunk size, adjust the reranker model, or add metadata filters without redesigning the entire system.
Choosing an Embedding Model for Semantic Search
The embedding model is the single most important configuration decision in a semantic search system. It defines what similarity means. Two sentences close in the embedding space are retrieved together regardless of what they actually say.
For English text at moderate scale, all-MiniLM-L6-v2 from Sentence Transformers is a strong starting point: 384 dimensions, fast on CPU, good quality. For higher quality at the cost of more compute, all-mpnet-base-v2 (768 dimensions) is preferred. For production RAG at scale with an API budget, OpenAI's text-embedding-3-small (1536 dimensions) consistently ranks among the top performers on semantic benchmarks.
The embeddings article covers the full model comparison table including multilingual options, multimodal models, and domain-specific fine-tuned models.
Summary
Semantic search converts both queries and documents into embedding vectors and returns documents whose vectors are closest to the query vector. It handles synonyms, paraphrases, and natural language intent that keyword search cannot.
The pipeline has two phases. Offline indexing extracts text, splits it into chunks, embeds each chunk, and stores the vectors in a vector database. Online querying embeds the user's query, runs ANN search to find the nearest vectors, and optionally reranks the top candidates using a cross-encoder.
Semantic search fails on exact identifiers, very short queries, out-of-vocabulary terms, and oversized chunks. Hybrid search combining semantic retrieval with BM25 addresses the most critical of those failure modes, which is covered in detail in dense vs sparse vectors.
The downstream destination for the retrieved chunks is the LLM in a RAG pipeline. The infrastructure that makes this work at scale is the vector database, which stores millions of embeddings and returns the nearest ones in milliseconds using ANN indexing.
Sources and Further Reading
- Google Cloud. What Is Semantic Search and How Does It Work? cloud.google.com/discover/what-is-semantic-search
- Redis. Semantic Search vs. Keyword Search: When to Use Each. redis.io/blog/semantic-search-vs-keyword-search
- Elastic. Chunking Strategies for Semantic Search in Elasticsearch. elastic.co/search-labs/blog/chunking-strategies-elasticsearch
- Unstructured. Semantic Search vs. Keyword Search: Key Differences. unstructured.io/insights/semantic-vs-keyword-search-key-differences-for-ai-data
- Unstructured. How Vector Embeddings Improve Search Relevance. unstructured.io/insights/vector-embeddings-the-key-to-better-search-relevance
- Parallel.ai. What Is Semantic Search and How Does It Work? parallel.ai/articles/what-is-semantic-search
- Meilisearch. What Is Semantic Search and How Does It Work? meilisearch.com/blog/semantic-search
- RapidSearch. Semantic Search: How It Works, Why It Matters. rapidsearch.app/blog/semantic-search
- TechTarget. What Is Semantic Search and How Does It Work? techtarget.com/searchenterpriseai/definition/semantic-search
- Elastic. Semantic Reranking Documentation. elastic.co/docs/solutions/search/ranking/semantic-reranking
- ZeroEntropy. Ultimate Guide to Choosing the Best Reranking Model in 2026. zeroentropy.dev/articles/ultimate-guide-to-choosing-the-best-reranking-model-in-2025
- Microsoft Azure. Azure AI Search: Outperforming Vector Search with Hybrid Retrieval and Reranking. techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-ai-search-outperforming-vector-search
- LinkedIn. Semantic Search at LinkedIn. arxiv.org/pdf/2602.07309
- Couchbase. Semantic Search vs. Keyword Search: Key Differences. couchbase.com/blog/semantic-search-vs-keyword-search
- Hugging Face. Sentence Transformers Library. huggingface.co/sentence-transformers
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.