What is an embedding in machine learning?

An embedding is a dense vector representation of discrete data - a word, a token, an image patch - that lives in a continuous vector space. Instead of representing a token as a one-hot vector with 50,000 dimensions (one per vocabulary item), an embedding represents it as a compact vector with 128 or 768 dimensions. The values in that vector are learned by backpropagation, and semantically similar tokens end up with similar vectors.

What is the difference between Word2Vec and modern LLM embeddings?

Word2Vec produces static embeddings - the word 'bank' always maps to the same vector regardless of context. Modern LLM embeddings are contextual - the vector for 'bank' changes depending on surrounding words. 'River bank' and 'bank account' produce different embeddings because the entire sequence is processed through attention layers that mix token representations together. The static era ran from 2013 to 2018. We are now in the contextual era.

What is positional embedding and why do transformers need it?

Attention is permutation-invariant - if you shuffle the tokens in a sequence, the attention scores are the same. The model has no idea which token came first. Positional embeddings fix this by injecting information about each token's position in the sequence. Without them, the model would treat 'dog bites man' and 'man bites dog' identically.

What is RoPE and why do modern LLMs use it?

RoPE (Rotary Position Embedding) encodes position by rotating the query and key vectors by an angle proportional to their position in the sequence. When you then compute the dot product between a rotated query and a rotated key, the result naturally depends on the relative distance between them. RoPE is used in LLaMA, Mistral, Qwen, DeepSeek, and most major open LLMs because it handles long sequences better than learned positional embeddings and generalizes more cleanly to sequence lengths beyond training.

What is cosine similarity and how is it used with embeddings?

Cosine similarity measures the angle between two vectors. If two vectors point in the same direction, cosine similarity is 1. If they are perpendicular, it is 0. If they point opposite directions, it is -1. In embedding space, semantically similar tokens and sentences have high cosine similarity. This is used directly in RAG (retrieval-augmented generation) to find which documents are most relevant to a query.

What is the embedding dimension and how does it affect model quality?

The embedding dimension is the number of values in each embedding vector. GPT-2 Small uses 768, LLaMA-7B uses 4096. Larger dimensions give the model more capacity to encode nuanced meaning, but also increase memory and compute cost quadratically in attention. You cannot change the embedding dimension after training - everything in the architecture is sized around it.

Token & Positional Embeddings (RoPE Explained)

When I first looked at how token embeddings work in a transformer, I expected something complicated. It is not.

nn.Embedding in PyTorch is a weight matrix. Shape: (vocab_size, embed_dim). A forward pass is a row lookup. Token ID 42 gives you row 42 of the matrix. That row is a vector of 128 or 768 or 4096 floating-point numbers, and it gets updated by backpropagation like any other weight in the network.

The reason embeddings feel magical is not the mechanism. The mechanism is trivial. The reason is what backpropagation does to that matrix over millions of training steps. Tokens that appear in similar contexts get pulled toward similar regions of the vector space. By the end of training, the geometry of that space encodes real semantic relationships, not because anyone programmed it to, but because the loss function pushed it there.

This article is about how a transformer represents tokens inside the model: the token embedding table, positional embeddings, and RoPE. If you want the beginner explanation of what embeddings are and how they power semantic search and RAG, read What Are Embeddings first. This one goes a level deeper into the mechanics.

This is Article 4 in the series on AI and ML fundamentals. Article 3 covered how neural networks and backpropagation work. Everything there applies here directly - an embedding layer is trained the same way. Article 5 will cover sequence modeling and RNNs, where these embeddings become the input to a model that processes them one step at a time. Understanding what an embedding is and what makes a good one is what you need before that.

The lookup table that learns

Start with the mechanics before the intuition.

python

import torch
import torch.nn as nn

vocab_size = 50_257   # GPT-2 vocabulary
embed_dim  = 128      # NanoGPT scale

embedding = nn.Embedding(vocab_size, embed_dim)

# token IDs - what the tokenizer produces
token_ids = torch.tensor([15496, 995, 11])  # "Hello world !"

# lookup: each ID becomes its corresponding row vector
vectors = embedding(token_ids)
print(vectors.shape)   # torch.Size([3, 128])
print(vectors[0])      # the 128-dimensional vector for token 15496

Three token IDs go in. Three 128-dimensional vectors come out. That is the entire forward pass of an embedding layer.

Before training, those vectors are random noise - initialized from a normal distribution, typically N(0, 0.02). After training on enough text, they are not random anymore. The model has shaped them into a space where nearby vectors mean related things.

python

# weight matrix lives in embedding.weight
print(embedding.weight.shape)   # torch.Size([50257, 128])

# the embedding is just a row lookup
manual_lookup = embedding.weight[15496]
assert torch.allclose(vectors[0], manual_lookup)
# True - nn.Embedding is literally indexing a matrix

Because it is just matrix rows, backpropagation only updates the rows that were actually used in the current batch. A token that appears once in the training data gets one gradient update. A common word like "the" gets millions. This is part of why rare token embeddings are worse than common ones - they have seen fewer gradient signals.

What the geometry looks like after training

The classic demonstration uses Word2Vec analogies. Train on enough text and the vector arithmetic works out:

python

# using a pre-trained embedding model to show the geometry
# (requires: pip install gensim)
import gensim.downloader as api

model = api.load("word2vec-google-news-300")

# vector("king") - vector("man") + vector("woman") ≈ vector("queen")
result = model.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=3
)
print(result)
# [('queen', 0.7118), ('monarch', 0.6189), ('princess', 0.5902)]

"King minus man plus woman equals queen." That relationship was not programmed. The model learned it from patterns in billions of words of text: kings and queens appear in similar contexts, men and women appear in similar contexts, and the offset between them is consistent.

This is what representation learning means. The model does not store facts. It learns a geometric space where relationships between concepts are encoded as directions and distances.

For LLMs, the same thing happens at a much larger scale. Research comparing classical Word2Vec embeddings with LLM-induced embeddings found that LLMs cluster semantically related words more tightly and perform better on analogy tasks, though the underlying principle is the same - meaning emerges from the geometry of the space.

The three eras of text embeddings

Understanding where we are requires knowing where embeddings came from.

Static embeddings (2013 to 2018): Word2Vec, GloVe, FastText. One vector per word, fixed regardless of context. "Bank" always maps to the same vector whether it means a riverbank or a financial institution. These were groundbreaking when they appeared. They are inadequate for modern LLMs.

Contextual embeddings (2018 onward): BERT, GPT, and everything after. The embedding for "bank" depends on the surrounding sentence because attention layers mix token representations together. The same token in different contexts produces different vectors. This is what transformers do that earlier models could not.

LLM-infused embeddings (2023 onward): We are now entering a third era, that of LLM-infused text embedding models. Rather than being trained independently, the best embedding models today are seeded from and trained on data synthesized by LLMs. Models like E5, BGE, and LLM2Vec fine-tune LLaMA or Mistral specifically to produce high-quality sentence-level embeddings for retrieval tasks.

For building products, the third era is where you are operating. When you call an embedding API to power semantic search or RAG, you are using a model from this era.

Cosine similarity: measuring meaning in vector space

Once you have embedding vectors, you need a way to compare them. Cosine similarity is the standard.

python

import torch
import torch.nn.functional as F

def cosine_sim(a: torch.Tensor, b: torch.Tensor) -> float:
    return F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0)).item()

# using a small embedding table to demonstrate
embedding = nn.Embedding(10_000, 64)
nn.init.normal_(embedding.weight, mean=0, std=0.02)

# before training: random vectors, similarity is near zero
vec_cat = embedding(torch.tensor(4149))   # token for "cat"
vec_dog = embedding(torch.tensor(3360))   # token for "dog"
vec_car = embedding(torch.tensor(3116))   # token for "car"

print(f"cat vs dog (before training): {cosine_sim(vec_cat, vec_dog):.4f}")
print(f"cat vs car (before training): {cosine_sim(vec_cat, vec_car):.4f}")
# both close to 0 - random vectors are nearly orthogonal

After training on language data, the cosine similarity between "cat" and "dog" would be much higher than between "cat" and "car" - they appear in more similar contexts. That signal is what RAG systems use when matching a user query to a relevant document chunk.

python

# what RAG is doing under the hood
def find_most_relevant(query_vec, doc_vecs, doc_texts):
    similarities = [
        (cosine_sim(query_vec, dv), text)
        for dv, text in zip(doc_vecs, doc_texts)
    ]
    return sorted(similarities, reverse=True)[0]

# real systems use approximate nearest neighbor search
# (FAISS, Hnswlib) for million-scale document collections
# but the core operation is this cosine similarity

Article 5 will cover how sequences of these vectors are processed over time. The quality of the embeddings feeding into that sequence model directly determines what patterns the model can learn.

Positional embeddings: fixing a fundamental problem

Here is something that tripped me up early. Attention is permutation-invariant.

If you give a transformer the tokens ["The", "cat", "sat"] in that order, or shuffle them to ["sat", "The", "cat"], the attention scores are identical. The mechanism computes dot products between all pairs of tokens. It does not know which one came first.

python

import torch
import torch.nn.functional as F

# simulate what attention sees without positional information
torch.manual_seed(42)

# two sequences: same tokens, different order
seq1 = torch.randn(3, 8)   # ["The", "cat", "sat"]
seq2 = seq1[[2, 0, 1]]     # ["sat", "The", "cat"] - shuffled

Q = torch.randn(8, 8)
K = torch.randn(8, 8)

att1 = F.softmax((seq1 @ Q) @ (seq1 @ K).T / 8**0.5, dim=-1)
att2 = F.softmax((seq2 @ Q) @ (seq2 @ K).T / 8**0.5, dim=-1)

# the attention patterns differ only because the token vectors differ
# if you sort both by the same permutation, they'd match
# there's no notion of "first" vs "second" built in
print("attention scores are based on content, not order")

Positional embeddings fix this by adding position information to each token's vector before the attention layers see it. The original transformer paper used sinusoidal functions - fixed mathematical waves of different frequencies.

python

import torch
import math

def sinusoidal_encoding(seq_len: int, embed_dim: int) -> torch.Tensor:
    pe = torch.zeros(seq_len, embed_dim)
    position = torch.arange(seq_len).unsqueeze(1).float()

    div_term = torch.exp(
        torch.arange(0, embed_dim, 2).float()
        * (-math.log(10000.0) / embed_dim)
    )

    pe[:, 0::2] = torch.sin(position * div_term)   # even dims: sine
    pe[:, 1::2] = torch.cos(position * div_term)   # odd dims: cosine

    return pe

pe = sinusoidal_encoding(seq_len=10, embed_dim=16)
print(pe.shape)   # torch.Size([10, 16])

# each row is a unique positional fingerprint
# position 0 and position 9 have very different patterns
print("pos 0:", pe[0, :4].tolist())
print("pos 9:", pe[9, :4].tolist())

This worked well enough for the original transformer. But it has a hard ceiling: the model cannot handle sequences longer than the maximum position it was trained on. Token at position 1025 in a model trained on sequences of 1024 is genuinely out of distribution.

RoPE: what modern LLMs actually use

LLaMA, Mistral, Qwen, DeepSeek - almost every major open LLM uses RoPE (Rotary Position Embedding). Understanding why it replaced learned and sinusoidal encodings matters.

The core insight: instead of adding position information to the token vectors, RoPE rotates the query and key vectors by an angle that depends on their position. When you then compute the dot product q · k, the result depends on the relative distance between the two tokens - not their absolute positions.

python

import torch
import math

def apply_rope(x: torch.Tensor, position: int) -> torch.Tensor:
    """
    Apply RoPE to a single vector at a given position.
    x: (embed_dim,) - must have even dim
    """
    d = x.shape[-1]
    device = x.device

    # rotation angles - different frequency for each pair of dimensions
    theta = torch.tensor([
        1.0 / (10000 ** (2 * i / d))
        for i in range(d // 2)
    ], device=device)

    angles = position * theta   # scale by position

    # rotate each (x[2i], x[2i+1]) pair by angle theta_i
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    x_pairs = x.view(-1, 2)   # group into pairs
    x1, x2 = x_pairs[:, 0], x_pairs[:, 1]

    rotated = torch.stack([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos
    ], dim=1).view(-1)

    return rotated

# token at position 0 vs position 5 get different rotations
embed_dim = 8
vec = torch.randn(embed_dim)

at_pos_0 = apply_rope(vec, position=0)
at_pos_5 = apply_rope(vec, position=5)

# the angle between them encodes relative distance
similarity = torch.nn.functional.cosine_similarity(
    at_pos_0.unsqueeze(0), at_pos_5.unsqueeze(0)
)
print(f"cosine similarity between pos 0 and pos 5: {similarity.item():.4f}")
# will be less than 1 - they have been rotated away from each other

Why does this matter? RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Because the attention dot product captures relative distance rather than absolute position, RoPE generalizes better to longer sequences than learned positional embeddings, which memorize specific position indices and fail completely beyond their training length.

Both RoPE and ALiBi work on the philosophy that positional information and semantic information represent different things and should not be mixed - they modify the attention weights rather than the token vectors directly.

Here is how RoPE looks in practice inside a transformer layer using PyTorch's built-in support:

python

import torch
import torch.nn as nn

class RoPEAttention(nn.Module):
    """Simplified single-head attention with RoPE positional encoding."""

    def __init__(self, embed_dim: int, max_seq_len: int = 2048):
        super().__init__()
        self.embed_dim = embed_dim
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)

        # precompute RoPE frequencies
        theta = 1.0 / (10000 ** (
            torch.arange(0, embed_dim, 2).float() / embed_dim
        ))
        positions = torch.arange(max_seq_len).float()
        freqs = torch.outer(positions, theta)             # (max_seq_len, dim/2)
        self.register_buffer("cos_cache", freqs.cos())
        self.register_buffer("sin_cache", freqs.sin())

    def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
        x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
        return torch.cat([-x2, x1], dim=-1)

    def apply_rope(self, x: torch.Tensor, seq_len: int) -> torch.Tensor:
        cos = self.cos_cache[:seq_len].unsqueeze(0)   # (1, T, dim/2)
        sin = self.sin_cache[:seq_len].unsqueeze(0)
        cos = torch.cat([cos, cos], dim=-1)           # (1, T, dim)
        sin = torch.cat([sin, sin], dim=-1)
        return x * cos + self.rotate_half(x) * sin

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)

        # apply RoPE to queries and keys only - not values
        q = self.apply_rope(q, T)
        k = self.apply_rope(k, T)

        att = (q @ k.transpose(-2, -1)) * (C ** -0.5)
        att = torch.nn.functional.softmax(att, dim=-1)
        return self.out_proj(att @ v)

# test
attn = RoPEAttention(embed_dim=64)
x = torch.randn(2, 16, 64)   # batch=2, seq_len=16, dim=64
out = attn(x)
print(out.shape)   # torch.Size([2, 16, 64])

The values do not get RoPE applied - only queries and keys. This is intentional. RoPE is about how tokens relate to each other (captured in attention scores), not about what information each token carries (captured in values).

Learned vs RoPE vs ALiBi: which to use

The three main approaches you will encounter when reading transformer code:

Method	How it works	Used in	Context extrapolation
Learned (absolute)	Trainable vector per position, added to token embeddings	Original BERT, GPT-2	Poor - fails beyond training length
Sinusoidal	Fixed sin/cos waves, added to token embeddings	Original transformer (2017)	Limited
RoPE	Rotates Q and K vectors by position-dependent angle	LLaMA, Mistral, Qwen, DeepSeek	Good, especially with YaRN scaling
ALiBi	Subtracts linear penalty from attention scores based on token distance	MPT, older Bloom-style models	Excellent extrapolation, fewer parameters

Most modern LLMs today such as Qwen, DeepSeek, LLaMA are fine-tuned using YaRN to enable context length expansion - an extension of RoPE that allows models trained on 4K tokens to handle 128K or more without retraining from scratch.

If you are reading the LLaMA or Mistral source code and see apply_rotary_emb, that is RoPE. If you see attention scores being modified with a distance-based penalty matrix, that is ALiBi.

The full embedding pipeline in a transformer

Putting it together: what actually happens between "token ID" and "the model's first attention layer."

python

import torch
import torch.nn as nn

class TransformerEmbeddings(nn.Module):
    """
    The embedding block that sits at the front of every transformer.
    Handles both token embeddings and positional information.
    """
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        max_seq_len: int,
        dropout: float = 0.1
    ):
        super().__init__()
        # token embedding: the big lookup table
        self.token_emb = nn.Embedding(vocab_size, embed_dim)

        # positional embedding: learned, one vector per position
        # (in real LLMs this would be RoPE applied in attention instead)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)

        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        B, T = token_ids.shape

        # token embeddings: row lookup
        tok = self.token_emb(token_ids)           # (B, T, embed_dim)

        # positional embeddings: position 0, 1, 2, ... T-1
        pos = torch.arange(T, device=token_ids.device)
        pos = self.pos_emb(pos)                   # (T, embed_dim)

        # add them together - this is where "the token is at position N" gets encoded
        x = tok + pos                             # broadcasting handles batch dim

        return self.dropout(self.ln(x))           # normalize + dropout

# numbers from NanoGPT config
emb = TransformerEmbeddings(
    vocab_size=50_257,
    embed_dim=128,
    max_seq_len=256
)

token_ids = torch.randint(0, 50_257, (4, 32))   # batch=4, seq_len=32
out = emb(token_ids)
print(out.shape)   # torch.Size([4, 32, 128])

# parameter count
total = sum(p.numel() for p in emb.parameters())
print(f"embedding parameters: {total:,}")
# ≈ 50,257 * 128 + 256 * 128 = 6,464,896

About 6.4 million parameters just for the embeddings in a NanoGPT-scale model. In GPT-2 Small with embed_dim=768 and vocab_size=50,257, the token embedding alone has 38.6 million parameters - about a third of the entire model.

Weight tying - sharing weights between the token embedding and the final output projection - halves this cost and consistently improves performance. It is in every serious LLM codebase.

Why embedding quality directly affects Article 5

Article 5 covers sequence modeling and RNNs. The specific problem it addresses is: how do you process a sequence of these embedding vectors over time, maintaining memory of what came earlier?

The quality of the embeddings feeding into that sequence model determines what the model can learn. If "cat" and "dog" are close in embedding space and "cat" and "skyscraper" are far, the sequence model has useful geometric structure to work with. If all embeddings are random noise, the sequence model has to learn meaning from scratch - it cannot.

This is why modern LLMs with good pre-trained embeddings transfer so well. The embedding space encodes general knowledge about how language works. Fine-tuning on a specific task then shapes the rest of the network around that already-structured space.

Autoregressive LLMs are not inherently optimized for producing high-quality representations because they are trained with a next-token prediction objective, not a sequence-level semantic objective. This is why specialized embedding models exist alongside generation models, and why you generally should not use a GPT-style model's hidden states directly for retrieval without some fine-tuning.

The practical rule for embeddings in products

For text generation and reasoning tasks, use the LLM's own token embeddings - they are trained end-to-end with the rest of the model. For retrieval, semantic search, and RAG, use a dedicated embedding model (BGE, E5, Nomic, or an API like OpenAI's text-embedding-3-small). The objectives are different and the models are optimized differently.

Next in the series

Article 5 covers sequence modeling and RNNs. You will see how a sequence of embedding vectors gets processed one step at a time, why maintaining memory across long sequences is hard, and why the vanishing gradient problem that Article 3 introduced becomes a serious engineering problem at scale. The reason transformers with attention replaced RNNs - which Article 6 covers - only makes sense after you understand what RNNs were trying to do and where they broke down.

When I first looked at how token embeddings work in a transformer, I expected something complicated. It is not.

The lookup table that learns

Start with the mechanics before the intuition.

python

import torch
import torch.nn as nn

vocab_size = 50_257   # GPT-2 vocabulary
embed_dim  = 128      # NanoGPT scale

embedding = nn.Embedding(vocab_size, embed_dim)

# token IDs - what the tokenizer produces
token_ids = torch.tensor([15496, 995, 11])  # "Hello world !"

# lookup: each ID becomes its corresponding row vector
vectors = embedding(token_ids)
print(vectors.shape)   # torch.Size([3, 128])
print(vectors[0])      # the 128-dimensional vector for token 15496

Three token IDs go in. Three 128-dimensional vectors come out. That is the entire forward pass of an embedding layer.

python

# weight matrix lives in embedding.weight
print(embedding.weight.shape)   # torch.Size([50257, 128])

# the embedding is just a row lookup
manual_lookup = embedding.weight[15496]
assert torch.allclose(vectors[0], manual_lookup)
# True - nn.Embedding is literally indexing a matrix

What the geometry looks like after training

The classic demonstration uses Word2Vec analogies. Train on enough text and the vector arithmetic works out:

python

# using a pre-trained embedding model to show the geometry
# (requires: pip install gensim)
import gensim.downloader as api

model = api.load("word2vec-google-news-300")

# vector("king") - vector("man") + vector("woman") ≈ vector("queen")
result = model.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=3
)
print(result)
# [('queen', 0.7118), ('monarch', 0.6189), ('princess', 0.5902)]

This is what representation learning means. The model does not store facts. It learns a geometric space where relationships between concepts are encoded as directions and distances.

The three eras of text embeddings

Understanding where we are requires knowing where embeddings came from.

For building products, the third era is where you are operating. When you call an embedding API to power semantic search or RAG, you are using a model from this era.

Cosine similarity: measuring meaning in vector space

Once you have embedding vectors, you need a way to compare them. Cosine similarity is the standard.

python

import torch
import torch.nn.functional as F

def cosine_sim(a: torch.Tensor, b: torch.Tensor) -> float:
    return F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0)).item()

# using a small embedding table to demonstrate
embedding = nn.Embedding(10_000, 64)
nn.init.normal_(embedding.weight, mean=0, std=0.02)

# before training: random vectors, similarity is near zero
vec_cat = embedding(torch.tensor(4149))   # token for "cat"
vec_dog = embedding(torch.tensor(3360))   # token for "dog"
vec_car = embedding(torch.tensor(3116))   # token for "car"

print(f"cat vs dog (before training): {cosine_sim(vec_cat, vec_dog):.4f}")
print(f"cat vs car (before training): {cosine_sim(vec_cat, vec_car):.4f}")
# both close to 0 - random vectors are nearly orthogonal

python

# what RAG is doing under the hood
def find_most_relevant(query_vec, doc_vecs, doc_texts):
    similarities = [
        (cosine_sim(query_vec, dv), text)
        for dv, text in zip(doc_vecs, doc_texts)
    ]
    return sorted(similarities, reverse=True)[0]

# real systems use approximate nearest neighbor search
# (FAISS, Hnswlib) for million-scale document collections
# but the core operation is this cosine similarity

Article 5 will cover how sequences of these vectors are processed over time. The quality of the embeddings feeding into that sequence model directly determines what patterns the model can learn.

Positional embeddings: fixing a fundamental problem

Here is something that tripped me up early. Attention is permutation-invariant.

python

import torch
import torch.nn.functional as F

# simulate what attention sees without positional information
torch.manual_seed(42)

# two sequences: same tokens, different order
seq1 = torch.randn(3, 8)   # ["The", "cat", "sat"]
seq2 = seq1[[2, 0, 1]]     # ["sat", "The", "cat"] - shuffled

Q = torch.randn(8, 8)
K = torch.randn(8, 8)

att1 = F.softmax((seq1 @ Q) @ (seq1 @ K).T / 8**0.5, dim=-1)
att2 = F.softmax((seq2 @ Q) @ (seq2 @ K).T / 8**0.5, dim=-1)

# the attention patterns differ only because the token vectors differ
# if you sort both by the same permutation, they'd match
# there's no notion of "first" vs "second" built in
print("attention scores are based on content, not order")

python

import torch
import math

def sinusoidal_encoding(seq_len: int, embed_dim: int) -> torch.Tensor:
    pe = torch.zeros(seq_len, embed_dim)
    position = torch.arange(seq_len).unsqueeze(1).float()

    div_term = torch.exp(
        torch.arange(0, embed_dim, 2).float()
        * (-math.log(10000.0) / embed_dim)
    )

    pe[:, 0::2] = torch.sin(position * div_term)   # even dims: sine
    pe[:, 1::2] = torch.cos(position * div_term)   # odd dims: cosine

    return pe

pe = sinusoidal_encoding(seq_len=10, embed_dim=16)
print(pe.shape)   # torch.Size([10, 16])

# each row is a unique positional fingerprint
# position 0 and position 9 have very different patterns
print("pos 0:", pe[0, :4].tolist())
print("pos 9:", pe[9, :4].tolist())

RoPE: what modern LLMs actually use

LLaMA, Mistral, Qwen, DeepSeek - almost every major open LLM uses RoPE (Rotary Position Embedding). Understanding why it replaced learned and sinusoidal encodings matters.

python

import torch
import math

def apply_rope(x: torch.Tensor, position: int) -> torch.Tensor:
    """
    Apply RoPE to a single vector at a given position.
    x: (embed_dim,) - must have even dim
    """
    d = x.shape[-1]
    device = x.device

    # rotation angles - different frequency for each pair of dimensions
    theta = torch.tensor([
        1.0 / (10000 ** (2 * i / d))
        for i in range(d // 2)
    ], device=device)

    angles = position * theta   # scale by position

    # rotate each (x[2i], x[2i+1]) pair by angle theta_i
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    x_pairs = x.view(-1, 2)   # group into pairs
    x1, x2 = x_pairs[:, 0], x_pairs[:, 1]

    rotated = torch.stack([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos
    ], dim=1).view(-1)

    return rotated

# token at position 0 vs position 5 get different rotations
embed_dim = 8
vec = torch.randn(embed_dim)

at_pos_0 = apply_rope(vec, position=0)
at_pos_5 = apply_rope(vec, position=5)

# the angle between them encodes relative distance
similarity = torch.nn.functional.cosine_similarity(
    at_pos_0.unsqueeze(0), at_pos_5.unsqueeze(0)
)
print(f"cosine similarity between pos 0 and pos 5: {similarity.item():.4f}")
# will be less than 1 - they have been rotated away from each other

Here is how RoPE looks in practice inside a transformer layer using PyTorch's built-in support:

python

import torch
import torch.nn as nn

class RoPEAttention(nn.Module):
    """Simplified single-head attention with RoPE positional encoding."""

    def __init__(self, embed_dim: int, max_seq_len: int = 2048):
        super().__init__()
        self.embed_dim = embed_dim
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)

        # precompute RoPE frequencies
        theta = 1.0 / (10000 ** (
            torch.arange(0, embed_dim, 2).float() / embed_dim
        ))
        positions = torch.arange(max_seq_len).float()
        freqs = torch.outer(positions, theta)             # (max_seq_len, dim/2)
        self.register_buffer("cos_cache", freqs.cos())
        self.register_buffer("sin_cache", freqs.sin())

    def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
        x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
        return torch.cat([-x2, x1], dim=-1)

    def apply_rope(self, x: torch.Tensor, seq_len: int) -> torch.Tensor:
        cos = self.cos_cache[:seq_len].unsqueeze(0)   # (1, T, dim/2)
        sin = self.sin_cache[:seq_len].unsqueeze(0)
        cos = torch.cat([cos, cos], dim=-1)           # (1, T, dim)
        sin = torch.cat([sin, sin], dim=-1)
        return x * cos + self.rotate_half(x) * sin

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)

        # apply RoPE to queries and keys only - not values
        q = self.apply_rope(q, T)
        k = self.apply_rope(k, T)

        att = (q @ k.transpose(-2, -1)) * (C ** -0.5)
        att = torch.nn.functional.softmax(att, dim=-1)
        return self.out_proj(att @ v)

# test
attn = RoPEAttention(embed_dim=64)
x = torch.randn(2, 16, 64)   # batch=2, seq_len=16, dim=64
out = attn(x)
print(out.shape)   # torch.Size([2, 16, 64])

Learned vs RoPE vs ALiBi: which to use

The three main approaches you will encounter when reading transformer code:

Method	How it works	Used in	Context extrapolation
Learned (absolute)	Trainable vector per position, added to token embeddings	Original BERT, GPT-2	Poor - fails beyond training length
Sinusoidal	Fixed sin/cos waves, added to token embeddings	Original transformer (2017)	Limited
RoPE	Rotates Q and K vectors by position-dependent angle	LLaMA, Mistral, Qwen, DeepSeek	Good, especially with YaRN scaling
ALiBi	Subtracts linear penalty from attention scores based on token distance	MPT, older Bloom-style models	Excellent extrapolation, fewer parameters

If you are reading the LLaMA or Mistral source code and see apply_rotary_emb, that is RoPE. If you see attention scores being modified with a distance-based penalty matrix, that is ALiBi.

The full embedding pipeline in a transformer

Putting it together: what actually happens between "token ID" and "the model's first attention layer."

python

import torch
import torch.nn as nn

class TransformerEmbeddings(nn.Module):
    """
    The embedding block that sits at the front of every transformer.
    Handles both token embeddings and positional information.
    """
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        max_seq_len: int,
        dropout: float = 0.1
    ):
        super().__init__()
        # token embedding: the big lookup table
        self.token_emb = nn.Embedding(vocab_size, embed_dim)

        # positional embedding: learned, one vector per position
        # (in real LLMs this would be RoPE applied in attention instead)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)

        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(embed_dim)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        B, T = token_ids.shape

        # token embeddings: row lookup
        tok = self.token_emb(token_ids)           # (B, T, embed_dim)

        # positional embeddings: position 0, 1, 2, ... T-1
        pos = torch.arange(T, device=token_ids.device)
        pos = self.pos_emb(pos)                   # (T, embed_dim)

        # add them together - this is where "the token is at position N" gets encoded
        x = tok + pos                             # broadcasting handles batch dim

        return self.dropout(self.ln(x))           # normalize + dropout

# numbers from NanoGPT config
emb = TransformerEmbeddings(
    vocab_size=50_257,
    embed_dim=128,
    max_seq_len=256
)

token_ids = torch.randint(0, 50_257, (4, 32))   # batch=4, seq_len=32
out = emb(token_ids)
print(out.shape)   # torch.Size([4, 32, 128])

# parameter count
total = sum(p.numel() for p in emb.parameters())
print(f"embedding parameters: {total:,}")
# ≈ 50,257 * 128 + 256 * 128 = 6,464,896

Weight tying - sharing weights between the token embedding and the final output projection - halves this cost and consistently improves performance. It is in every serious LLM codebase.

Why embedding quality directly affects Article 5

Article 5 covers sequence modeling and RNNs. The specific problem it addresses is: how do you process a sequence of these embedding vectors over time, maintaining memory of what came earlier?

The practical rule for embeddings in products

Token and Positional Embeddings: How Transformers Represent Words (RoPE Explained)

The lookup table that learns

What the geometry looks like after training

The three eras of text embeddings

Cosine similarity: measuring meaning in vector space

Positional embeddings: fixing a fundamental problem

RoPE: what modern LLMs actually use

Learned vs RoPE vs ALiBi: which to use

The full embedding pipeline in a transformer

Why embedding quality directly affects Article 5

Next in the series

Krunal Kanojiya

Related Posts

Token and Positional Embeddings: How Transformers Represent Words (RoPE Explained)

The lookup table that learns

What the geometry looks like after training

The three eras of text embeddings

Cosine similarity: measuring meaning in vector space

Positional embeddings: fixing a fundamental problem

RoPE: what modern LLMs actually use

Learned vs RoPE vs ALiBi: which to use

The full embedding pipeline in a transformer

Why embedding quality directly affects Article 5

Next in the series

Krunal Kanojiya

Related Posts