What Are Embeddings? How AI Converts Text Into Numbers
A research-backed explanation of what embeddings are in machine learning. Learn how AI models convert text, images, and audio into numerical vectors, how transformer-based embedding models work, and how embeddings power semantic search, RAG pipelines, and vector databases.
When you type a question into a search engine and it returns a relevant result even though your exact words never appear on that page, something has to bridge the gap between your words and the document's words. That something is an embedding.
Embeddings are not a recent invention. The core idea of representing words as vectors goes back to distributional semantics research in the 1950s. What changed is the quality of the representations and the scale at which they can be produced. Modern transformer models produce embeddings where the full meaning of a 500-word paragraph fits into a list of 1536 numbers, and the geometry of those numbers encodes relationships that feel almost intuitive when you examine them.
This article explains what embeddings are, how they are generated, how the mathematics works, and where they show up in the AI applications you build or use every day. It connects directly to the foundational concepts in vectors in machine learning and is the bridge to understanding vector databases, semantic search, and the difference between dense and sparse representations.
What Is an Embedding?
According to Wikipedia's machine learning embedding entry, embedding is a representation learning technique that maps complex, high-dimensional data into a lower-dimensional vector space of numerical vectors. It also denotes the resulting representation, where meaningful patterns or relationships are preserved.
In practice: an embedding is a list of floating-point numbers produced by a neural network model that captures the meaning or context of an input. Two inputs that mean similar things produce numerically similar embeddings. Two inputs that are unrelated produce numerically distant embeddings.
Input: "How do I reset my password?"
Output: [0.0231, -0.1420, 0.8832, 0.0045, -0.3310, ..., 0.1192]
↑ 1536 floating-point numbers representing the meaning of that questionThe individual numbers in that list do not have a human-readable interpretation. Dimension 47 does not mean "this is a question." The meaning lives in the geometry — in the distances and angles between this vector and others in the same space.
The Problem That Embeddings Solve
Before embeddings became standard, the main way to represent text in machine learning models was one-hot encoding. The concept is simple. You define a vocabulary of every unique word you expect to encounter. Each word gets a unique index. Its representation is a vector where that index is 1 and every other position is 0.
# Vocabulary: ["cat", "dog", "run", "sleep"]
# Indices: 0 1 2 3
one_hot_cat = [1, 0, 0, 0]
one_hot_dog = [0, 1, 0, 0]
one_hot_run = [0, 0, 1, 0]
one_hot_sleep = [0, 0, 0, 1]According to Google's machine learning course on embeddings, this approach has two fundamental problems. First, it creates enormous sparse vectors. A vocabulary of 50,000 words produces 50,000-dimensional vectors that are almost entirely zeros. Second, there is no meaningful relationship between any two vectors. The distance between "cat" and "dog" is mathematically identical to the distance between "cat" and "spaceship." The encoding contains zero information about meaning.
Embeddings solve both problems simultaneously. They produce dense, lower-dimensional vectors where the geometry encodes the semantic relationships that one-hot encoding loses entirely.
# Dense embeddings for the same words (simplified to 4D for illustration)
embed_cat = [ 0.82, 0.51, -0.14, 0.33]
embed_dog = [ 0.79, 0.48, -0.11, 0.31] # close to cat
embed_run = [-0.22, 0.91, 0.44, -0.67] # far from cat
embed_sleep = [-0.18, 0.88, 0.41, -0.72] # close to run, far from cat"Cat" and "dog" are now numerically close. "Run" and "sleep" are now close to each other and far from "cat" and "dog." That is semantic information encoded into geometry.
How a Neural Network Learns Embeddings
Embeddings are not hand-coded. They are learned from data. The network does not start with human-defined relationships between words. It starts with random numbers and adjusts them until the geometry reflects actual semantic relationships, discovered purely from co-occurrence patterns in text.
The classic example is Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013. The core idea is the distributional hypothesis: words that appear in similar contexts carry similar meanings.
Word2Vec trains a shallow neural network on one of two tasks. In the Skip-gram architecture, given a word, predict the words surrounding it. In the CBOW (Continuous Bag of Words) architecture, given the surrounding words, predict the center word. The prediction task itself is not the goal. The goal is the learned internal representation that the network develops to make those predictions accurately.
Training sentence: "The cat sat on the mat"
Skip-gram task:
Input: "sat"
Predict: ["The", "cat", "on", "the"]
After training on millions of sentences, "cat" and "dog" both appear
next to words like "pet", "fur", "vet", "feed" — so their vectors
get pushed close together in the learned vector space.According to Serokell's Word2Vec explainer, the basic idea behind Word2Vec is to represent each word as a multi-dimensional vector where the position of the vector in that high-dimensional space captures the meaning of the word. Word2Vec takes a large corpus of text as input and generates a vector space with hundreds of dimensions.
A well-trained Word2Vec model produces a famous result: vector arithmetic that captures semantic analogy.
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
vector("Paris") - vector("France") + vector("Italy") ≈ vector("Rome")This is not a trick or a cherry-picked result. It demonstrates that the learned geometry encodes relational meaning. The direction from "man" to "woman" in the vector space is the same direction as from "king" to "queen." Relationships become directions.
From Word Embeddings to Sentence Embeddings
Word2Vec produces one vector per word. That creates a problem for sentence-level tasks. The sentence "I went to the bank to deposit money" and "I sat on the bank of the river" use the same word "bank" but in completely different meanings. Word2Vec assigns the same vector to "bank" regardless of context.
Transformer models, starting with the BERT architecture published by Google AI in 2018, solved this by producing context-aware embeddings. Every word's representation is influenced by every other word in the sentence through the self-attention mechanism.
The next step was sentence-level embeddings. According to Pinecone's sentence transformers guide, transformers work using word or token-level embeddings, not sentence-level embeddings. Before sentence transformers, the approach to calculating accurate sentence similarity with BERT was to use a cross-encoder structure, which required passing every pair of sentences through the model together — computationally impractical at scale.
Sentence Transformers, introduced in the paper "Sentence-BERT" (2019), solved this by fine-tuning BERT to produce a single fixed-length vector for an entire sentence. The fine-tuning uses contrastive learning: pairs of semantically similar sentences are pushed together, and pairs of unrelated sentences are pushed apart.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 384) — 3 sentences, each represented as 384 floats
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])The first two sentences about weather have similarity 0.6660. The third sentence about driving is nearly unrelated to both, scoring around 0.10. The model has encoded meaning into numbers, and the numbers reflect human semantic judgment. The code above is taken directly from the Sentence Transformers documentation at Hugging Face.
How Transformer Models Produce Embeddings: Step by Step
Modern embedding models follow a pipeline from raw text to a fixed-length vector. Understanding each step helps you reason about failure modes and model selection.
Step 1: Tokenization
The input text is broken into tokens. A token is not always a full word. Models use subword tokenization (typically BPE or WordPiece) so that unknown words can be represented as combinations of known subwords.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("embeddings are fascinating")
print(tokens)
# ['em', '##bed', '##dings', 'are', 'fascinating']"Embeddings" is split into three subword tokens. This allows the model to handle words it has never seen during training by recognizing familiar subword patterns.
Step 2: Token Embeddings
Each token gets an initial embedding vector from a lookup table. This is the learned embedding layer — a matrix where each row corresponds to a token and contains its dense vector representation. A vocabulary of 30,000 tokens with 768-dimensional embeddings requires a matrix of shape (30000, 768).
Step 3: Self-Attention
The transformer's self-attention mechanism allows each token to incorporate information from every other token in the sequence. The word "bank" produces a different vector depending on whether it appears next to "money" or next to "river," because the surrounding context updates its representation at every attention layer.
According to Airbyte's OpenAI embeddings guide, OpenAI embeddings use transformer-based attention mechanisms to capture context-dependent meaning, so the same word is embedded differently based on surrounding context.
Step 4: Pooling
After all attention layers, the model produces one vector per token. To get a single vector for the whole input, those token vectors are pooled. Common approaches are mean pooling (average all token vectors) and CLS token pooling (use the special [CLS] token's output, which is trained to summarize the input).
Input: "How do I reset my password?"
Tokens: [CLS] how do i reset my password ? [SEP]
After layers: v_cls v1 v2 v3 v4 v5 v6 v7 v8
Mean pooling: average(v1, v2, v3, v4, v5, v6, v7) → single 768-dim vector
CLS pooling: v_cls → single 768-dim vectorThe resulting single vector is the embedding for the entire sentence.
Calling an Embedding Model via API
For most teams building production applications, calling an embedding API is more practical than hosting a model. OpenAI's Embeddings API produces state-of-the-art results with no infrastructure management.
import openai
import numpy as np
client = openai.OpenAI(api_key="your-key-here")
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
text = text.replace("\n", " ")
response = client.embeddings.create(input=[text], model=model)
return response.data[0].embedding
# Embed two sentences
sentence_a = "How do I get a refund?"
sentence_b = "What is the process for returning a product?"
sentence_c = "What is the capital of France?"
emb_a = np.array(get_embedding(sentence_a))
emb_b = np.array(get_embedding(sentence_b))
emb_c = np.array(get_embedding(sentence_c))
def cosine_similarity(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print(f"Refund vs Return: {cosine_similarity(emb_a, emb_b):.4f}") # high ~0.92
print(f"Refund vs France: {cosine_similarity(emb_a, emb_c):.4f}") # low ~0.21The embeddings for "refund" and "return a product" are semantically close even though they share no keywords. A traditional keyword search would fail to connect them. This is the foundation of semantic search.
Choosing an Embedding Model
The model you choose determines the quality of your similarity search. Key dimensions to consider: dimensionality, context window, whether the model is open-source or API-based, and performance on the type of content you are indexing.
Model | Dimensions | Type | Best for
--------------------------------+------------+---------------+---------------------------
text-embedding-3-small (OpenAI) | 1536 | API | General RAG, English text
text-embedding-3-large (OpenAI) | 3072 | API | High precision tasks
all-MiniLM-L6-v2 (SBERT) | 384 | Open source | Low-latency, CPU-friendly
all-mpnet-base-v2 (SBERT) | 768 | Open source | Highest quality local model
paraphrase-multilingual-mpnet | 768 | Open source | 50+ language support
text-embedding-004 (Google) | 768 | API | Vertex AI integration
embed-english-v3.0 (Cohere) | 1024 | API | Enterprise searchAccording to Sparkco's sentence transformer guide, choosing an appropriate model is critical for accuracy and relevance. Models like all-MiniLM-L6-v2 are versatile but may not suffice for niche applications. For domain-specific data, fine-tuning on your own corpus can significantly enhance embedding quality.
One critical rule: embeddings from different models cannot be mixed. An embedding from OpenAI's model and an embedding from Cohere live in completely different vector spaces. Comparing them produces meaningless numbers. Every document in your vector database must be embedded with the same model as your query.
Types of Embeddings
Text is not the only data type that gets embedded. The same principle, encoding meaning into a dense vector, applies to images, audio, graphs, and combinations of data types.
Word Embeddings
Word embeddings assign one vector per word. Word2Vec and GloVe are the classic models. They are fast and lightweight but context-unaware. The word "bank" has one vector regardless of whether it means a financial institution or a riverbank.
Sentence and Document Embeddings
Sentence embeddings assign one vector per sentence or paragraph, capturing the meaning of the full sequence rather than individual words. Sentence Transformers and OpenAI's embedding API both produce sentence-level embeddings. These are what most RAG pipelines use.
Image Embeddings
Convolutional neural networks like ResNet and ViT (Vision Transformer) convert images into dense vectors. According to Labelbox's AI foundations guide, models like AlexNet, VGG, and ResNet revolutionized image processing by creating image embeddings that preserve spatial hierarchies and semantic information.
Multimodal Embeddings
Models like CLIP by OpenAI produce embeddings where text and images share the same vector space. A photo of a dog and the sentence "a golden retriever playing outside" land at similar coordinates. This enables cross-modal search: query with text, retrieve images, or query with an image, retrieve related text.
Graph Embeddings
Graph embeddings represent nodes in a knowledge graph as vectors that encode both the node's attributes and its relationships to neighboring nodes. These are common in recommendation systems and fraud detection, where the network structure itself carries meaning.
Embedding Dimensionality: How Many Numbers Do You Need?
More dimensions allow the model to capture more nuance, but come at the cost of storage, computation, and the curse of dimensionality in nearest-neighbor search. The right dimensionality depends on the task.
According to Wikipedia's embedding article, for high-dimensional vector spaces, vectors tend to converge in distance, so Euclidean distance becomes less reliable for large embedding vectors. This is why cosine similarity, which measures angle rather than absolute distance, is preferred for high-dimensional text embeddings.
OpenAI's text-embedding-3-small and text-embedding-3-large models also support dimension reduction through the Matryoshka Representation Learning technique. You can request 256-dimensional or 512-dimensional versions of a 1536-dimensional embedding with minimal quality loss, which is useful when storage and latency matter more than maximum recall.
# Request a smaller dimension from OpenAI's API
response = client.embeddings.create(
input="What is a vector database?",
model="text-embedding-3-small",
dimensions=512 # reduced from 1536 to 512
)
embedding = response.data[0].embedding
print(len(embedding)) # 512Contextual vs Static Embeddings
The distinction between static and contextual embeddings is important for understanding why modern models outperform older ones.
A static embedding model assigns the same vector to a word regardless of context. Word2Vec, GloVe, and FastText are static. "Bank" always maps to the same vector.
A contextual embedding model produces a different vector for the same word depending on what surrounds it. BERT, GPT, and the OpenAI embedding API are contextual. "Bank" near "money" and "bank" near "river" produce different vectors.
Contextual embeddings are strictly more powerful for tasks involving polysemous words (words with multiple meanings) and nuanced sentence-level comparison. They are also more expensive to compute because the entire input sequence must be processed through multiple attention layers.
How Embeddings Connect to Vector Databases
Once you produce embeddings for a large collection of documents, you need somewhere to store and search them efficiently. A vector database is purpose-built for this: it stores the embedding vectors alongside metadata and uses approximate nearest neighbor algorithms to find the closest vectors to any query in milliseconds.
The full pipeline works as follows:
Offline Indexing Phase
───────────────────────────────────────────────────────────
Document corpus
↓
Chunk into segments (500 tokens with overlap)
↓
Embedding model (text-embedding-3-small)
↓
1536-dimensional float vector per chunk
↓
Vector database (Pinecone / Weaviate / Milvus)
↓
Stored with metadata (source, chunk index, original text)
Online Query Phase
───────────────────────────────────────────────────────────
User query
↓
Same embedding model
↓
Query vector
↓
ANN search in vector database
↓
Top K most similar chunks
↓
Passed to LLM as context
↓
Grounded, accurate responseThis is the RAG (Retrieval Augmented Generation) architecture. The vector database is what makes the retrieval step fast at scale. Without it, you would need to compute the similarity between the query and every stored document on every request, which is not viable for millions of documents.
The dense vectors used in this pipeline are discussed in detail in the dense vs sparse vectors article. The retrieval step relies on the similarity concepts covered in the semantic search article.
Why Embeddings From Different Models Cannot Be Mixed
This deserves its own section because it is a common source of bugs in production systems.
Every embedding model trains its own vector space from scratch. OpenAI's text-embedding-3-small learns a 1536-dimensional space. Cohere's embed-english-v3.0 learns a 1024-dimensional space. The orientation, scale, and geometry of those spaces are completely independent. There is no transformation that reliably maps one into the other.
If you store documents embedded with one model and then query with a different model, the similarity scores are meaningless. The vectors point in incompatible directions in incompatible spaces.
The practical rule: pick one model and use it for both indexing and querying. If you change the model, re-embed your entire document corpus.
Practical Example: Semantic Deduplication
One underused application of embeddings is finding near-duplicate content in large datasets. Two support tickets that say the same thing in different words will have high cosine similarity even though they share no keywords.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
tickets = [
"My account is locked and I can't log in.",
"I am unable to access my account. It seems locked.",
"How do I enable two-factor authentication?",
"How can I turn on 2FA for my account?",
"I want to delete my account permanently.",
]
embeddings = model.encode(tickets)
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print("Near-duplicate detection:")
for i in range(len(tickets)):
for j in range(i + 1, len(tickets)):
sim = cosine_sim(embeddings[i], embeddings[j])
if sim > 0.70:
print(f"Similarity {sim:.3f}:")
print(f" [{i}] {tickets[i]}")
print(f" [{j}] {tickets[j]}")
print()
# Output:
# Similarity 0.891:
# [0] My account is locked and I can't log in.
# [1] I am unable to access my account. It seems locked.
#
# Similarity 0.823:
# [2] How do I enable two-factor authentication?
# [3] How can I turn on 2FA for my account?Tickets 0 and 1 are duplicates, as are 2 and 3. The embedding model found both pairs without any keyword overlap between "locked"/"unable to access" and without knowing that "2FA" is an abbreviation of "two-factor authentication."
Embeddings and the Latent Space
When a model produces an embedding, it places the input somewhere in what researchers call the latent space. This is a high-dimensional mathematical space where the coordinates are not pixel values or word counts, but learned abstract features.
According to AWS's embedding explainer, embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data. The entire process is automated, with AI systems self-creating embeddings during training.
The geometry of this latent space is what makes embeddings powerful. Analogies become vector arithmetic. Categories become clusters. The direction from "positive sentiment" to "negative sentiment" is a direction you can apply to any review embedding to predict its tone. This is covered in depth in the latent space article.
Summary
An embedding is a dense numerical vector produced by a neural network that encodes the meaning of its input. Words, sentences, images, and audio can all be embedded. Similar inputs produce numerically similar embeddings, so finding related content becomes a geometry problem rather than a keyword matching problem.
The generation process goes through tokenization, learned token embeddings, self-attention across the full input, and pooling to a single fixed-length vector. Modern transformer-based embedding models like OpenAI's text-embedding-3-small and the Sentence Transformers library produce contextual embeddings that handle polysemy and sentence-level meaning far better than older static approaches like Word2Vec.
Embeddings are the input to vector databases, the engine behind semantic search, and the bridge between raw unstructured data and AI applications that understand what the data means. The vector database article covers what happens after embeddings are generated and how ANN search retrieves the right ones at scale.
Sources and Further Reading
- AWS. What Is Embedding in Machine Learning? aws.amazon.com/what-is/embeddings-in-machine-learning
- Cloudflare. What Are Embeddings? cloudflare.com/learning/ai/what-are-embeddings
- Google for Developers. Embeddings — Machine Learning Crash Course. developers.google.com/machine-learning/crash-course/embeddings
- IBM. What Is Embedding? ibm.com/think/topics/embedding
- Wikipedia. Embedding (Machine Learning). en.wikipedia.org/wiki/Embedding_(machine_learning)
- Wikipedia. Word2Vec. en.wikipedia.org/wiki/Word2vec
- Pinecone. Sentence Transformers: Meanings in Disguise. pinecone.io/learn/series/nlp/sentence-embeddings
- Hugging Face. Sentence Transformers Documentation. huggingface.co/sentence-transformers
- OpenAI. Embeddings API Guide. platform.openai.com/docs/guides/embeddings
- Airbyte. OpenAI Embeddings 101. airbyte.com/data-engineering-resources/openai-embeddings
- Serokell. Word2Vec: Explanation and Examples. serokell.io/blog/word2vec
- Labelbox. AI Foundations: Understanding Embeddings. labelbox.com/guides/ai-foundations-understanding-embeddings
- Lightly.ai. Embeddings in Machine Learning: An Overview. lightly.ai/blog/embeddings
- GeeksforGeeks. Text Embeddings Using OpenAI. geeksforgeeks.org/nlp/text-embeddings-using-openai
- Mikolov et al. Distributed Representations of Words and Phrases. arxiv.org/abs/1310.4546
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.