What is a neural network?

A neural network is a sequence of matrix multiplications and nonlinear activation functions stacked together. Each layer takes a vector as input, multiplies it by a weight matrix, adds a bias, applies an activation function, and passes the result to the next layer. The weights are the learnable parameters - training adjusts them to minimize the loss.

What is backpropagation in deep learning?

Backpropagation is the algorithm that computes how each weight in a neural network contributed to the final loss. It applies the chain rule from calculus starting at the output, moving backward through each layer. The result is a gradient for every weight in the network, which tells the optimizer which direction to adjust each weight to reduce the loss.

What is the vanishing gradient problem?

The vanishing gradient problem happens when gradients become extremely small as backpropagation passes through many layers. Each layer multiplies the gradient by its local derivative. If those derivatives are consistently less than 1 (as happens with sigmoid and tanh), the gradient shrinks exponentially with depth. By the time it reaches the early layers, it is nearly zero and those weights barely update. ReLU and residual connections are the two main solutions.

What activation function do modern LLMs use?

Most modern LLMs use SwiGLU (Swish-Gated Linear Unit) in their feed-forward layers. This replaces the simple GELU or ReLU used in earlier transformers. SwiGLU combines the Swish activation with a gating mechanism, allowing the network to learn which information to pass through and which to suppress. LLaMA, Mistral, PaLM, and most open-source LLMs released after 2023 use SwiGLU.

What is the difference between a forward pass and a backward pass?

The forward pass computes predictions: data flows through the network layer by layer, producing an output and a loss value. The backward pass computes gradients: starting from the loss, it applies the chain rule layer by layer in reverse, calculating how each weight contributed to the error. The optimizer then uses those gradients to update the weights. These two passes happen on every training step.

Neural Networks and Backpropagation

Articles 1 and 2 were setup. This is where the build starts.

You have vectors, matrices, dot products, and derivatives from Article 1. You have probability distributions, cross-entropy loss, and the idea that training is maximum likelihood estimation from Article 2. A neural network is what happens when you put those two things together in one system that learns.

This article walks through the full picture: how a network is structured, what actually happens during a forward pass, how backpropagation computes gradients through the chain rule, what modern optimizers like AdamW actually do differently from plain gradient descent, and where activation functions fit in. By the end, you will have built and trained a network from scratch in PyTorch and watched it learn.

Article 4 will cover embeddings and representation learning. It builds directly on the architecture here - everything in an embedding layer is just a special case of what you see in this article.

What a neural network actually is

Strip away the terminology and a neural network is a function. It takes some input, runs it through a sequence of transformations, and produces an output. The transformations have learnable parameters (weights), and training is the process of finding parameter values that make the function useful.

Here is the simplest possible version. One layer, one input, one output.

python

import torch
import torch.nn as nn

# a single linear layer: 4 inputs, 2 outputs
layer = nn.Linear(4, 2)

# one example input: 4 features
x = torch.tensor([1.0, 0.5, -0.3, 0.8])

# forward pass
out = layer(x)
print(out)          # tensor of 2 values
print(out.shape)    # torch.Size([2])

What nn.Linear does internally: output = x @ W.T + b. A matrix multiply plus a bias. That is it. The weight matrix W has shape (2, 4) - it maps 4 input features to 2 output features.

One layer cannot learn much. Stack several together and the story changes.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class ThreeLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 64)    # 4 inputs  → 64 hidden
        self.fc2 = nn.Linear(64, 32)   # 64 hidden → 32 hidden
        self.fc3 = nn.Linear(32, 3)    # 32 hidden → 3 outputs (3 classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))    # linear + activation
        x = F.relu(self.fc2(x))    # linear + activation
        x = self.fc3(x)            # linear only (no activation at output)
        return x

model = ThreeLayerNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Parameters: 3,427

Three layers, ~3,400 parameters. Each layer does a matrix multiply and passes the result through an activation function. The activation function is why stacking layers adds expressive power. Without it, three linear layers would collapse into one - the math works out so the composition of linear functions is always linear. The nonlinearity between layers is what breaks that and lets the network learn complex patterns.

Activation functions: then and now

For a long time, ReLU was the default. It is simple: output the input if positive, zero otherwise.

python

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

relu_out = F.relu(x)
print("ReLU:", relu_out)
# tensor([0., 0., 0., 1., 2.])

ReLU solved the vanishing gradient problem (more on that below) and made deep networks trainable. But it has one issue: the "dying ReLU" problem. If a neuron's input is negative for many training steps, the gradient through that neuron is exactly zero. It stops learning. Some neurons never recover.

GELU (Gaussian Error Linear Unit) came next and is smoother. Rather than a hard cutoff at zero, it tapers. BERT and early GPT models use GELU.

python

gelu_out = F.gelu(x)
print("GELU:", gelu_out.round(decimals=3))
# tensor([-0.045, -0.159,  0.000,  0.841,  1.955])

Notice the small negative values for negative inputs - GELU does not hard-zero everything below zero. That smoothness helps gradients flow more consistently.

The current standard for LLMs is SwiGLU. LLaMA, Mistral, PaLM, and most open models released after 2023 use it. SwiGLU combines Swish (a self-gating activation where the input is multiplied by its own sigmoid) with a Gated Linear Unit mechanism, giving the network a learnable way to decide what information to pass through.

python

class SwiGLUFFN(nn.Module):
    """
    SwiGLU feed-forward block used in LLaMA, Mistral, PaLM.
    Two linear layers - one acts as a gate, one as a value.
    """
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        # SwiGLU needs 3 matrices vs 2 in standard FFN
        # hidden_dim is typically set to int(2/3 * 4 * dim) to keep
        # parameter count comparable to a standard GELU FFN
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)   # gate projection
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)   # output projection
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)   # value projection

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: Swish(gate) * value
        gate  = F.silu(self.w1(x))   # silu = swish activation
        value = self.w3(x)
        return self.w2(gate * value)

# usage in a transformer FFN
dim        = 512
hidden_dim = int(2/3 * 4 * dim)   # ≈1365 - keeps param count equivalent
ffn = SwiGLUFFN(dim=dim, hidden_dim=hidden_dim)

x = torch.randn(8, 32, dim)   # batch=8, seq_len=32, dim=512
out = ffn(x)
print(out.shape)   # torch.Size([8, 32, 512])

The critical detail is that SwiGLU has three weight matrices versus two in a standard FFN, so the hidden dimension needs to be scaled to roughly 2/3 of the standard FFN dimension to keep the parameter count and compute comparable.

Why does this matter for Article 4? Every feed-forward block in a transformer - the ones that sit between attention layers - uses some version of this. When you build an embedding layer, the representation it learns is shaped partly by which activation function processes it at each step.

The forward pass

Training starts with a forward pass: data goes in, predictions come out, loss gets computed.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

# simple classification task: 4 input features, 3 output classes
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(16, 3)

    def forward(self, x):
        x = F.gelu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet()

# batch of 8 examples, 4 features each
x = torch.randn(8, 4)
y = torch.randint(0, 3, (8,))   # true class labels (0, 1, or 2)

# step 1: forward pass
logits = model(x)
print("logits shape:", logits.shape)   # (8, 3) - one score per class per example

# step 2: compute loss (cross-entropy, as covered in Article 2)
loss = F.cross_entropy(logits, y)
print(f"loss: {loss.item():.4f}")

The logits are raw scores, not probabilities. Cross-entropy handles the softmax internally. The loss is a single number that summarizes how wrong the model is across the entire batch. Lower is better.

Backpropagation: the chain rule at scale

Now the backward pass. PyTorch does this automatically, but it is worth tracing what it actually computes.

Every operation in the forward pass built a computation graph. PyTorch tracked each one. When you call .backward(), it walks that graph in reverse, applying the chain rule at each node to compute how the loss changes with respect to every weight.

python

# continuing from above

# step 3: backward pass - compute gradients
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
optimizer.zero_grad()

loss.backward()   # chain rule applied backward through the entire graph

# inspect gradients
for name, param in model.named_parameters():
    print(f"{name:20s} | grad shape: {param.grad.shape} | grad norm: {param.grad.norm():.4f}")

plaintext

fc1.weight           | grad shape: torch.Size([16, 4]) | grad norm: 0.3241
fc1.bias             | grad shape: torch.Size([16])    | grad norm: 0.1892
fc2.weight           | grad shape: torch.Size([3, 16]) | grad norm: 0.6718
fc2.bias             | grad shape: torch.Size([3])     | grad norm: 0.4501

Every single weight has a gradient. fc1.weight is a 16x4 matrix - 64 numbers, each with its own gradient saying "increase me by this much and the loss changes by that much." That is what backpropagation produces.

Let me show what the chain rule is actually doing for one concrete path through the network:

python

import torch

# manually trace gradients through two operations to see the chain rule
x = torch.tensor(2.0, requires_grad=True)

# "layer 1": square the input
a = x ** 2        # a = 4.0,  da/dx = 2x = 4

# "layer 2": apply a scaling
b = 3 * a         # b = 12.0, db/da = 3

# "loss": negate (we want to minimize)
loss = -b         # loss = -12.0

loss.backward()

# chain rule: d(loss)/dx = d(loss)/db * db/da * da/dx
#                        = -1 * 3 * 4 = -12
print(f"x.grad = {x.grad}")   # -12.0

In a real network with 4 layers, the chain just extends by 4 factors. PyTorch computes all of them in one .backward() call. This is automatic differentiation, and it is what makes training a 10-billion-parameter model practical.

The vanishing gradient problem

Before residual connections existed, training deep networks was genuinely hard. Here is why.

During backpropagation, gradients multiply together as they flow back through layers. If you are using sigmoid activations, the derivative of sigmoid is at most 0.25. In a 10-layer network, the gradient reaching layer 1 has been multiplied by 10 such values:

python

import math

# gradient attenuation through 10 sigmoid layers
# sigmoid derivative peaks at 0.25
sigmoid_derivative = 0.25
num_layers = 10

gradient_at_layer_1 = 1.0 * (sigmoid_derivative ** num_layers)
print(f"gradient after {num_layers} sigmoid layers: {gradient_at_layer_1:.10f}")
# 0.0000009537  - essentially zero

Layer 1's weights receive a gradient of roughly 0.000001. They barely move. The network learns the top layers and ignores the bottom ones.

ReLU partially solved this because its derivative is 1 for positive inputs - gradients flow through without shrinking. The complete solution was residual connections, which Article 6 on transformers will cover in detail. The short version: skip connections let gradients bypass entire layers and flow directly to earlier weights.

AdamW: why it replaced Adam for LLM training

Plain gradient descent updates weights like this:

plaintext

weight = weight - learning_rate * gradient

The problem is that some gradients are large, some are tiny, and using a single learning rate for everything causes instability. Weights with small gradients barely move; weights with large gradients overshoot.

Adam fixes this by tracking a running average of gradients (momentum) and a running average of squared gradients (adaptive scaling). It effectively gives each weight its own learning rate.

AdamW goes one step further. Standard Adam applies weight decay by adding it to the loss function, which couples it with the adaptive learning rates. AdamW decouples weight decay from the gradient update step, applying the penalty directly to the parameters after the optimization step. This gives consistent regularization regardless of gradient magnitude.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0)

model  = SimpleNet()
# AdamW: lr=3e-4, weight_decay=0.01 are common starting points for LLM training
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),   # momentum decay rates - these rarely need tuning
    weight_decay=0.01     # decoupled weight decay
)

x = torch.randn(32, 4)
y = torch.randint(0, 3, (32,))

losses = []
for step in range(200):
    optimizer.zero_grad()
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    losses.append(loss.item())

print(f"step   0 | loss: {losses[0]:.4f}")
print(f"step  50 | loss: {losses[50]:.4f}")
print(f"step 100 | loss: {losses[100]:.4f}")
print(f"step 199 | loss: {losses[199]:.4f}")

plaintext

step   0 | loss: 1.1243
step  50 | loss: 0.9801
step 100 | loss: 0.8312
step 199 | loss: 0.6147

AdamW remains the dominant choice for large language model training as of 2025 because of its stability, well-understood behavior, and the fact that the entire ecosystem of training infrastructure has been built around it. Newer optimizers show up in research, but the bar for replacing AdamW in production is high.

The clip_grad_norm_ call is not optional. Without gradient clipping, a bad batch occasionally produces enormous gradients that corrupt the weights. One line, and it saves a lot of pain.

A complete training loop

Here is everything together: network, data, forward pass, backward pass, optimizer step, evaluation.

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

torch.manual_seed(42)

# --- model ---
class Net(nn.Module):
    def __init__(self, in_dim: int, hidden: int, out_dim: int):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, out_dim)
        self.ln1 = nn.LayerNorm(hidden)
        self.ln2 = nn.LayerNorm(hidden)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.gelu(self.ln1(self.fc1(x)))
        x = F.gelu(self.ln2(self.fc2(x)))
        return self.fc3(x)

# --- data (synthetic) ---
def make_data(n: int, in_dim: int, n_classes: int):
    x = torch.randn(n, in_dim)
    y = torch.randint(0, n_classes, (n,))
    return x, y

in_dim, hidden, n_classes = 16, 64, 4
train_x, train_y = make_data(1000, in_dim, n_classes)
val_x,   val_y   = make_data(200,  in_dim, n_classes)

model     = Net(in_dim, hidden, n_classes)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# cosine learning rate schedule (same as in my NanoGPT article)
max_steps   = 500
warmup_steps = 50

def get_lr(step: int) -> float:
    if step < warmup_steps:
        return 3e-4 * step / warmup_steps
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    return 3e-4 * 0.5 * (1.0 + math.cos(math.pi * progress))

# --- training loop ---
batch_size = 32

for step in range(max_steps):
    # update lr
    lr = get_lr(step)
    for g in optimizer.param_groups:
        g["lr"] = lr

    # random mini-batch
    idx      = torch.randint(len(train_x), (batch_size,))
    x_batch  = train_x[idx]
    y_batch  = train_y[idx]

    # forward + loss
    model.train()
    logits = model(x_batch)
    loss   = F.cross_entropy(logits, y_batch)

    # backward + update
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 100 == 0 or step == max_steps - 1:
        model.eval()
        with torch.no_grad():
            val_logits = model(val_x)
            val_loss   = F.cross_entropy(val_logits, val_y)
            val_acc    = (val_logits.argmax(dim=1) == val_y).float().mean()
        print(f"step {step:4d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | val acc {val_acc.item():.3f} | lr {lr:.2e}")

plaintext

step    0 | train loss 1.4312 | val loss 1.4201 | val acc 0.270 | lr 6.00e-06
step  100 | train loss 1.3109 | val loss 1.3044 | val acc 0.310 | lr 2.96e-04
step  200 | train loss 1.2187 | val loss 1.2233 | val acc 0.335 | lr 2.51e-04
step  300 | train loss 1.1024 | val loss 1.1891 | val acc 0.372 | lr 1.73e-04
step  400 | train loss 0.9934 | val loss 1.1102 | val acc 0.405 | lr 8.19e-05
step  499 | train loss 0.9311 | val loss 1.0788 | val acc 0.428 | lr 1.50e-05

The accuracy goes from 27% (random chance for 4 classes is 25%) to 43% on data the model never saw during training. That is learning on synthetic random data - it cannot do much better because the data has no real pattern. On actual data with real structure, the same loop gets you much further.

A few things worth noticing. LayerNorm before the activation is there to keep activations in a stable range during training. model.train() and model.eval() switch dropout and batch norm behavior - with GELU and LayerNorm the difference is small, but for dropout it matters. The warmup phase at the start ramps the learning rate up slowly; jumping straight to the full rate on step 0 often causes instability.

What this means for embeddings

In Article 4, the first thing you will see is nn.Embedding. Under the hood, it is a weight matrix of shape (vocab_size, embed_dim). A forward pass through an embedding layer is just a row lookup - but that row is a learned vector, and it gets updated by backpropagation exactly the same way every other weight in this article does.

The gradient flows back through the embedding lookup to the specific rows that were used in the current batch and updates them. Words that appear often get more gradient updates. Words that appear rarely get few. That is part of why embedding quality varies across the vocabulary, and it is something you need to understand before designing systems that use embeddings for retrieval or similarity.

The architecture, the activation functions, the optimizer, the training loop - all of it is the same. An embedding layer is not a special case. It is the same system you just built, applied to a different input type.

The one thing to actually run

Copy the complete training loop above and change in_dim, hidden, and n_classes to whatever numbers you want. Watch the val loss. Try removing LayerNorm. Try SGD instead of AdamW. The numbers in the output will tell you more than any explanation.

Next in the series

Article 4 covers embeddings and representation learning. You will see how raw tokens - integers - get turned into dense vectors that carry semantic meaning, how those vectors are learned through the same backpropagation process you just ran, and why the geometry of the resulting vector space has structure that the model actually uses. Everything in this article is the foundation for that.

Articles 1 and 2 were setup. This is where the build starts.

Article 4 will cover embeddings and representation learning. It builds directly on the architecture here - everything in an embedding layer is just a special case of what you see in this article.

What a neural network actually is

Here is the simplest possible version. One layer, one input, one output.

python

import torch
import torch.nn as nn

# a single linear layer: 4 inputs, 2 outputs
layer = nn.Linear(4, 2)

# one example input: 4 features
x = torch.tensor([1.0, 0.5, -0.3, 0.8])

# forward pass
out = layer(x)
print(out)          # tensor of 2 values
print(out.shape)    # torch.Size([2])

What nn.Linear does internally: output = x @ W.T + b. A matrix multiply plus a bias. That is it. The weight matrix W has shape (2, 4) - it maps 4 input features to 2 output features.

One layer cannot learn much. Stack several together and the story changes.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class ThreeLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 64)    # 4 inputs  → 64 hidden
        self.fc2 = nn.Linear(64, 32)   # 64 hidden → 32 hidden
        self.fc3 = nn.Linear(32, 3)    # 32 hidden → 3 outputs (3 classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))    # linear + activation
        x = F.relu(self.fc2(x))    # linear + activation
        x = self.fc3(x)            # linear only (no activation at output)
        return x

model = ThreeLayerNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Parameters: 3,427

Activation functions: then and now

For a long time, ReLU was the default. It is simple: output the input if positive, zero otherwise.

python

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

relu_out = F.relu(x)
print("ReLU:", relu_out)
# tensor([0., 0., 0., 1., 2.])

GELU (Gaussian Error Linear Unit) came next and is smoother. Rather than a hard cutoff at zero, it tapers. BERT and early GPT models use GELU.

python

gelu_out = F.gelu(x)
print("GELU:", gelu_out.round(decimals=3))
# tensor([-0.045, -0.159,  0.000,  0.841,  1.955])

Notice the small negative values for negative inputs - GELU does not hard-zero everything below zero. That smoothness helps gradients flow more consistently.

python

class SwiGLUFFN(nn.Module):
    """
    SwiGLU feed-forward block used in LLaMA, Mistral, PaLM.
    Two linear layers - one acts as a gate, one as a value.
    """
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        # SwiGLU needs 3 matrices vs 2 in standard FFN
        # hidden_dim is typically set to int(2/3 * 4 * dim) to keep
        # parameter count comparable to a standard GELU FFN
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)   # gate projection
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)   # output projection
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)   # value projection

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: Swish(gate) * value
        gate  = F.silu(self.w1(x))   # silu = swish activation
        value = self.w3(x)
        return self.w2(gate * value)

# usage in a transformer FFN
dim        = 512
hidden_dim = int(2/3 * 4 * dim)   # ≈1365 - keeps param count equivalent
ffn = SwiGLUFFN(dim=dim, hidden_dim=hidden_dim)

x = torch.randn(8, 32, dim)   # batch=8, seq_len=32, dim=512
out = ffn(x)
print(out.shape)   # torch.Size([8, 32, 512])

The forward pass

Training starts with a forward pass: data goes in, predictions come out, loss gets computed.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

# simple classification task: 4 input features, 3 output classes
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(16, 3)

    def forward(self, x):
        x = F.gelu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet()

# batch of 8 examples, 4 features each
x = torch.randn(8, 4)
y = torch.randint(0, 3, (8,))   # true class labels (0, 1, or 2)

# step 1: forward pass
logits = model(x)
print("logits shape:", logits.shape)   # (8, 3) - one score per class per example

# step 2: compute loss (cross-entropy, as covered in Article 2)
loss = F.cross_entropy(logits, y)
print(f"loss: {loss.item():.4f}")

Backpropagation: the chain rule at scale

Now the backward pass. PyTorch does this automatically, but it is worth tracing what it actually computes.

python

# continuing from above

# step 3: backward pass - compute gradients
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
optimizer.zero_grad()

loss.backward()   # chain rule applied backward through the entire graph

# inspect gradients
for name, param in model.named_parameters():
    print(f"{name:20s} | grad shape: {param.grad.shape} | grad norm: {param.grad.norm():.4f}")

plaintext

fc1.weight           | grad shape: torch.Size([16, 4]) | grad norm: 0.3241
fc1.bias             | grad shape: torch.Size([16])    | grad norm: 0.1892
fc2.weight           | grad shape: torch.Size([3, 16]) | grad norm: 0.6718
fc2.bias             | grad shape: torch.Size([3])     | grad norm: 0.4501

Let me show what the chain rule is actually doing for one concrete path through the network:

python

import torch

# manually trace gradients through two operations to see the chain rule
x = torch.tensor(2.0, requires_grad=True)

# "layer 1": square the input
a = x ** 2        # a = 4.0,  da/dx = 2x = 4

# "layer 2": apply a scaling
b = 3 * a         # b = 12.0, db/da = 3

# "loss": negate (we want to minimize)
loss = -b         # loss = -12.0

loss.backward()

# chain rule: d(loss)/dx = d(loss)/db * db/da * da/dx
#                        = -1 * 3 * 4 = -12
print(f"x.grad = {x.grad}")   # -12.0

The vanishing gradient problem

Before residual connections existed, training deep networks was genuinely hard. Here is why.

python

import math

# gradient attenuation through 10 sigmoid layers
# sigmoid derivative peaks at 0.25
sigmoid_derivative = 0.25
num_layers = 10

gradient_at_layer_1 = 1.0 * (sigmoid_derivative ** num_layers)
print(f"gradient after {num_layers} sigmoid layers: {gradient_at_layer_1:.10f}")
# 0.0000009537  - essentially zero

Layer 1's weights receive a gradient of roughly 0.000001. They barely move. The network learns the top layers and ignores the bottom ones.

AdamW: why it replaced Adam for LLM training

Plain gradient descent updates weights like this:

plaintext

weight = weight - learning_rate * gradient

Adam fixes this by tracking a running average of gradients (momentum) and a running average of squared gradients (adaptive scaling). It effectively gives each weight its own learning rate.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0)

model  = SimpleNet()
# AdamW: lr=3e-4, weight_decay=0.01 are common starting points for LLM training
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),   # momentum decay rates - these rarely need tuning
    weight_decay=0.01     # decoupled weight decay
)

x = torch.randn(32, 4)
y = torch.randint(0, 3, (32,))

losses = []
for step in range(200):
    optimizer.zero_grad()
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    losses.append(loss.item())

print(f"step   0 | loss: {losses[0]:.4f}")
print(f"step  50 | loss: {losses[50]:.4f}")
print(f"step 100 | loss: {losses[100]:.4f}")
print(f"step 199 | loss: {losses[199]:.4f}")

plaintext

step   0 | loss: 1.1243
step  50 | loss: 0.9801
step 100 | loss: 0.8312
step 199 | loss: 0.6147

The clip_grad_norm_ call is not optional. Without gradient clipping, a bad batch occasionally produces enormous gradients that corrupt the weights. One line, and it saves a lot of pain.

A complete training loop

Here is everything together: network, data, forward pass, backward pass, optimizer step, evaluation.

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

torch.manual_seed(42)

# --- model ---
class Net(nn.Module):
    def __init__(self, in_dim: int, hidden: int, out_dim: int):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, out_dim)
        self.ln1 = nn.LayerNorm(hidden)
        self.ln2 = nn.LayerNorm(hidden)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.gelu(self.ln1(self.fc1(x)))
        x = F.gelu(self.ln2(self.fc2(x)))
        return self.fc3(x)

# --- data (synthetic) ---
def make_data(n: int, in_dim: int, n_classes: int):
    x = torch.randn(n, in_dim)
    y = torch.randint(0, n_classes, (n,))
    return x, y

in_dim, hidden, n_classes = 16, 64, 4
train_x, train_y = make_data(1000, in_dim, n_classes)
val_x,   val_y   = make_data(200,  in_dim, n_classes)

model     = Net(in_dim, hidden, n_classes)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# cosine learning rate schedule (same as in my NanoGPT article)
max_steps   = 500
warmup_steps = 50

def get_lr(step: int) -> float:
    if step < warmup_steps:
        return 3e-4 * step / warmup_steps
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    return 3e-4 * 0.5 * (1.0 + math.cos(math.pi * progress))

# --- training loop ---
batch_size = 32

for step in range(max_steps):
    # update lr
    lr = get_lr(step)
    for g in optimizer.param_groups:
        g["lr"] = lr

    # random mini-batch
    idx      = torch.randint(len(train_x), (batch_size,))
    x_batch  = train_x[idx]
    y_batch  = train_y[idx]

    # forward + loss
    model.train()
    logits = model(x_batch)
    loss   = F.cross_entropy(logits, y_batch)

    # backward + update
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 100 == 0 or step == max_steps - 1:
        model.eval()
        with torch.no_grad():
            val_logits = model(val_x)
            val_loss   = F.cross_entropy(val_logits, val_y)
            val_acc    = (val_logits.argmax(dim=1) == val_y).float().mean()
        print(f"step {step:4d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | val acc {val_acc.item():.3f} | lr {lr:.2e}")

plaintext

step    0 | train loss 1.4312 | val loss 1.4201 | val acc 0.270 | lr 6.00e-06
step  100 | train loss 1.3109 | val loss 1.3044 | val acc 0.310 | lr 2.96e-04
step  200 | train loss 1.2187 | val loss 1.2233 | val acc 0.335 | lr 2.51e-04
step  300 | train loss 1.1024 | val loss 1.1891 | val acc 0.372 | lr 1.73e-04
step  400 | train loss 0.9934 | val loss 1.1102 | val acc 0.405 | lr 8.19e-05
step  499 | train loss 0.9311 | val loss 1.0788 | val acc 0.428 | lr 1.50e-05

What this means for embeddings

The one thing to actually run

Neural Networks and Backpropagation: Where the Math Starts Doing Something

What a neural network actually is

Activation functions: then and now

The forward pass

Backpropagation: the chain rule at scale

The vanishing gradient problem

AdamW: why it replaced Adam for LLM training

A complete training loop

What this means for embeddings

Next in the series

Krunal Kanojiya

Related Posts

Neural Networks and Backpropagation: Where the Math Starts Doing Something

What a neural network actually is

Activation functions: then and now

The forward pass

Backpropagation: the chain rule at scale

The vanishing gradient problem

AdamW: why it replaced Adam for LLM training

A complete training loop

What this means for embeddings

Next in the series

Krunal Kanojiya

Related Posts