Tech14 min read2,760 words

Neural Networks and Backpropagation: Where the Math Starts Doing Something

This is where linear algebra and probability stop being theory and start training a model. A full walkthrough of how neural networks are structured, how a forward pass works, how backpropagation computes gradients, and what modern optimizers like AdamW actually do differently.

K

Krunal Kanojiya

Share:
#neural-networks#backpropagation#deep-learning#pytorch#adamw#activation-functions#machine-learning

Articles 1 and 2 were setup. This is where the build starts.

You have vectors, matrices, dot products, and derivatives from Article 1. You have probability distributions, cross-entropy loss, and the idea that training is maximum likelihood estimation from Article 2. A neural network is what happens when you put those two things together in one system that learns.

This article walks through the full picture: how a network is structured, what actually happens during a forward pass, how backpropagation computes gradients through the chain rule, what modern optimizers like AdamW actually do differently from plain gradient descent, and where activation functions fit in. By the end, you will have built and trained a network from scratch in PyTorch and watched it learn.

Article 4 will cover embeddings and representation learning. It builds directly on the architecture here — everything in an embedding layer is just a special case of what you see in this article.


What a neural network actually is

Strip away the terminology and a neural network is a function. It takes some input, runs it through a sequence of transformations, and produces an output. The transformations have learnable parameters (weights), and training is the process of finding parameter values that make the function useful.

Here is the simplest possible version. One layer, one input, one output.

python
import torch
import torch.nn as nn

# a single linear layer: 4 inputs, 2 outputs
layer = nn.Linear(4, 2)

# one example input: 4 features
x = torch.tensor([1.0, 0.5, -0.3, 0.8])

# forward pass
out = layer(x)
print(out)          # tensor of 2 values
print(out.shape)    # torch.Size([2])

What nn.Linear does internally: output = x @ W.T + b. A matrix multiply plus a bias. That is it. The weight matrix W has shape (2, 4) — it maps 4 input features to 2 output features.

One layer cannot learn much. Stack several together and the story changes.

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class ThreeLayerNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 64)    # 4 inputs  → 64 hidden
        self.fc2 = nn.Linear(64, 32)   # 64 hidden → 32 hidden
        self.fc3 = nn.Linear(32, 3)    # 32 hidden → 3 outputs (3 classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))    # linear + activation
        x = F.relu(self.fc2(x))    # linear + activation
        x = self.fc3(x)            # linear only (no activation at output)
        return x

model = ThreeLayerNet()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Parameters: 3,427

Three layers, ~3,400 parameters. Each layer does a matrix multiply and passes the result through an activation function. The activation function is why stacking layers adds expressive power. Without it, three linear layers would collapse into one — the math works out so the composition of linear functions is always linear. The nonlinearity between layers is what breaks that and lets the network learn complex patterns.


Activation functions: then and now

For a long time, ReLU was the default. It is simple: output the input if positive, zero otherwise.

python
import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

relu_out = F.relu(x)
print("ReLU:", relu_out)
# tensor([0., 0., 0., 1., 2.])

ReLU solved the vanishing gradient problem (more on that below) and made deep networks trainable. But it has one issue: the "dying ReLU" problem. If a neuron's input is negative for many training steps, the gradient through that neuron is exactly zero. It stops learning. Some neurons never recover.

GELU (Gaussian Error Linear Unit) came next and is smoother. Rather than a hard cutoff at zero, it tapers. BERT and early GPT models use GELU.

python
gelu_out = F.gelu(x)
print("GELU:", gelu_out.round(decimals=3))
# tensor([-0.045, -0.159,  0.000,  0.841,  1.955])

Notice the small negative values for negative inputs — GELU does not hard-zero everything below zero. That smoothness helps gradients flow more consistently.

The current standard for LLMs is SwiGLU. LLaMA, Mistral, PaLM, and most open models released after 2023 use it. SwiGLU combines Swish (a self-gating activation where the input is multiplied by its own sigmoid) with a Gated Linear Unit mechanism, giving the network a learnable way to decide what information to pass through.

python
class SwiGLUFFN(nn.Module):
    """
    SwiGLU feed-forward block used in LLaMA, Mistral, PaLM.
    Two linear layers — one acts as a gate, one as a value.
    """
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        # SwiGLU needs 3 matrices vs 2 in standard FFN
        # hidden_dim is typically set to int(2/3 * 4 * dim) to keep
        # parameter count comparable to a standard GELU FFN
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)   # gate projection
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)   # output projection
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)   # value projection

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: Swish(gate) * value
        gate  = F.silu(self.w1(x))   # silu = swish activation
        value = self.w3(x)
        return self.w2(gate * value)

# usage in a transformer FFN
dim        = 512
hidden_dim = int(2/3 * 4 * dim)   # ≈1365 — keeps param count equivalent
ffn = SwiGLUFFN(dim=dim, hidden_dim=hidden_dim)

x = torch.randn(8, 32, dim)   # batch=8, seq_len=32, dim=512
out = ffn(x)
print(out.shape)   # torch.Size([8, 32, 512])

The critical detail is that SwiGLU has three weight matrices versus two in a standard FFN, so the hidden dimension needs to be scaled to roughly 2/3 of the standard FFN dimension to keep the parameter count and compute comparable.

Why does this matter for Article 4? Every feed-forward block in a transformer — the ones that sit between attention layers — uses some version of this. When you build an embedding layer, the representation it learns is shaped partly by which activation function processes it at each step.


The forward pass

Training starts with a forward pass: data goes in, predictions come out, loss gets computed.

python
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

# simple classification task: 4 input features, 3 output classes
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(16, 3)

    def forward(self, x):
        x = F.gelu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet()

# batch of 8 examples, 4 features each
x = torch.randn(8, 4)
y = torch.randint(0, 3, (8,))   # true class labels (0, 1, or 2)

# step 1: forward pass
logits = model(x)
print("logits shape:", logits.shape)   # (8, 3) — one score per class per example

# step 2: compute loss (cross-entropy, as covered in Article 2)
loss = F.cross_entropy(logits, y)
print(f"loss: {loss.item():.4f}")

The logits are raw scores, not probabilities. Cross-entropy handles the softmax internally. The loss is a single number that summarizes how wrong the model is across the entire batch. Lower is better.


Backpropagation: the chain rule at scale

Now the backward pass. PyTorch does this automatically, but it is worth tracing what it actually computes.

Every operation in the forward pass built a computation graph. PyTorch tracked each one. When you call .backward(), it walks that graph in reverse, applying the chain rule at each node to compute how the loss changes with respect to every weight.

python
# continuing from above

# step 3: backward pass — compute gradients
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
optimizer.zero_grad()

loss.backward()   # chain rule applied backward through the entire graph

# inspect gradients
for name, param in model.named_parameters():
    print(f"{name:20s} | grad shape: {param.grad.shape} | grad norm: {param.grad.norm():.4f}")
plaintext
fc1.weight           | grad shape: torch.Size([16, 4]) | grad norm: 0.3241
fc1.bias             | grad shape: torch.Size([16])    | grad norm: 0.1892
fc2.weight           | grad shape: torch.Size([3, 16]) | grad norm: 0.6718
fc2.bias             | grad shape: torch.Size([3])     | grad norm: 0.4501

Every single weight has a gradient. fc1.weight is a 16x4 matrix — 64 numbers, each with its own gradient saying "increase me by this much and the loss changes by that much." That is what backpropagation produces.

Let me show what the chain rule is actually doing for one concrete path through the network:

python
import torch

# manually trace gradients through two operations to see the chain rule
x = torch.tensor(2.0, requires_grad=True)

# "layer 1": square the input
a = x ** 2        # a = 4.0,  da/dx = 2x = 4

# "layer 2": apply a scaling
b = 3 * a         # b = 12.0, db/da = 3

# "loss": negate (we want to minimize)
loss = -b         # loss = -12.0

loss.backward()

# chain rule: d(loss)/dx = d(loss)/db * db/da * da/dx
#                        = -1 * 3 * 4 = -12
print(f"x.grad = {x.grad}")   # -12.0

In a real network with 4 layers, the chain just extends by 4 factors. PyTorch computes all of them in one .backward() call. This is automatic differentiation, and it is what makes training a 10-billion-parameter model practical.


The vanishing gradient problem

Before residual connections existed, training deep networks was genuinely hard. Here is why.

During backpropagation, gradients multiply together as they flow back through layers. If you are using sigmoid activations, the derivative of sigmoid is at most 0.25. In a 10-layer network, the gradient reaching layer 1 has been multiplied by 10 such values:

python
import math

# gradient attenuation through 10 sigmoid layers
# sigmoid derivative peaks at 0.25
sigmoid_derivative = 0.25
num_layers = 10

gradient_at_layer_1 = 1.0 * (sigmoid_derivative ** num_layers)
print(f"gradient after {num_layers} sigmoid layers: {gradient_at_layer_1:.10f}")
# 0.0000009537  — essentially zero

Layer 1's weights receive a gradient of roughly 0.000001. They barely move. The network learns the top layers and ignores the bottom ones.

ReLU partially solved this because its derivative is 1 for positive inputs — gradients flow through without shrinking. The complete solution was residual connections, which Article 6 on transformers will cover in detail. The short version: skip connections let gradients bypass entire layers and flow directly to earlier weights.


AdamW: why it replaced Adam for LLM training

Plain gradient descent updates weights like this:

plaintext
weight = weight - learning_rate * gradient

The problem is that some gradients are large, some are tiny, and using a single learning rate for everything causes instability. Weights with small gradients barely move; weights with large gradients overshoot.

Adam fixes this by tracking a running average of gradients (momentum) and a running average of squared gradients (adaptive scaling). It effectively gives each weight its own learning rate.

AdamW goes one step further. Standard Adam applies weight decay by adding it to the loss function, which couples it with the adaptive learning rates. AdamW decouples weight decay from the gradient update step, applying the penalty directly to the parameters after the optimization step. This gives consistent regularization regardless of gradient magnitude.

python
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0)

model  = SimpleNet()
# AdamW: lr=3e-4, weight_decay=0.01 are common starting points for LLM training
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),   # momentum decay rates — these rarely need tuning
    weight_decay=0.01     # decoupled weight decay
)

x = torch.randn(32, 4)
y = torch.randint(0, 3, (32,))

losses = []
for step in range(200):
    optimizer.zero_grad()
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    losses.append(loss.item())

print(f"step   0 | loss: {losses[0]:.4f}")
print(f"step  50 | loss: {losses[50]:.4f}")
print(f"step 100 | loss: {losses[100]:.4f}")
print(f"step 199 | loss: {losses[199]:.4f}")
plaintext
step   0 | loss: 1.1243
step  50 | loss: 0.9801
step 100 | loss: 0.8312
step 199 | loss: 0.6147

AdamW remains the dominant choice for large language model training as of 2025 because of its stability, well-understood behavior, and the fact that the entire ecosystem of training infrastructure has been built around it. Newer optimizers show up in research, but the bar for replacing AdamW in production is high.

The clip_grad_norm_ call is not optional. Without gradient clipping, a bad batch occasionally produces enormous gradients that corrupt the weights. One line, and it saves a lot of pain.


A complete training loop

Here is everything together: network, data, forward pass, backward pass, optimizer step, evaluation.

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

torch.manual_seed(42)

# --- model ---
class Net(nn.Module):
    def __init__(self, in_dim: int, hidden: int, out_dim: int):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, out_dim)
        self.ln1 = nn.LayerNorm(hidden)
        self.ln2 = nn.LayerNorm(hidden)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.gelu(self.ln1(self.fc1(x)))
        x = F.gelu(self.ln2(self.fc2(x)))
        return self.fc3(x)

# --- data (synthetic) ---
def make_data(n: int, in_dim: int, n_classes: int):
    x = torch.randn(n, in_dim)
    y = torch.randint(0, n_classes, (n,))
    return x, y

in_dim, hidden, n_classes = 16, 64, 4
train_x, train_y = make_data(1000, in_dim, n_classes)
val_x,   val_y   = make_data(200,  in_dim, n_classes)

model     = Net(in_dim, hidden, n_classes)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# cosine learning rate schedule (same as in my NanoGPT article)
max_steps   = 500
warmup_steps = 50

def get_lr(step: int) -> float:
    if step < warmup_steps:
        return 3e-4 * step / warmup_steps
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    return 3e-4 * 0.5 * (1.0 + math.cos(math.pi * progress))

# --- training loop ---
batch_size = 32

for step in range(max_steps):
    # update lr
    lr = get_lr(step)
    for g in optimizer.param_groups:
        g["lr"] = lr

    # random mini-batch
    idx      = torch.randint(len(train_x), (batch_size,))
    x_batch  = train_x[idx]
    y_batch  = train_y[idx]

    # forward + loss
    model.train()
    logits = model(x_batch)
    loss   = F.cross_entropy(logits, y_batch)

    # backward + update
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 100 == 0 or step == max_steps - 1:
        model.eval()
        with torch.no_grad():
            val_logits = model(val_x)
            val_loss   = F.cross_entropy(val_logits, val_y)
            val_acc    = (val_logits.argmax(dim=1) == val_y).float().mean()
        print(f"step {step:4d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | val acc {val_acc.item():.3f} | lr {lr:.2e}")
plaintext
step    0 | train loss 1.4312 | val loss 1.4201 | val acc 0.270 | lr 6.00e-06
step  100 | train loss 1.3109 | val loss 1.3044 | val acc 0.310 | lr 2.96e-04
step  200 | train loss 1.2187 | val loss 1.2233 | val acc 0.335 | lr 2.51e-04
step  300 | train loss 1.1024 | val loss 1.1891 | val acc 0.372 | lr 1.73e-04
step  400 | train loss 0.9934 | val loss 1.1102 | val acc 0.405 | lr 8.19e-05
step  499 | train loss 0.9311 | val loss 1.0788 | val acc 0.428 | lr 1.50e-05

The accuracy goes from 27% (random chance for 4 classes is 25%) to 43% on data the model never saw during training. That is learning on synthetic random data — it cannot do much better because the data has no real pattern. On actual data with real structure, the same loop gets you much further.

A few things worth noticing. LayerNorm before the activation is there to keep activations in a stable range during training. model.train() and model.eval() switch dropout and batch norm behavior — with GELU and LayerNorm the difference is small, but for dropout it matters. The warmup phase at the start ramps the learning rate up slowly; jumping straight to the full rate on step 0 often causes instability.


What this means for embeddings

In Article 4, the first thing you will see is nn.Embedding. Under the hood, it is a weight matrix of shape (vocab_size, embed_dim). A forward pass through an embedding layer is just a row lookup — but that row is a learned vector, and it gets updated by backpropagation exactly the same way every other weight in this article does.

The gradient flows back through the embedding lookup to the specific rows that were used in the current batch and updates them. Words that appear often get more gradient updates. Words that appear rarely get few. That is part of why embedding quality varies across the vocabulary, and it is something you need to understand before designing systems that use embeddings for retrieval or similarity.

The architecture, the activation functions, the optimizer, the training loop — all of it is the same. An embedding layer is not a special case. It is the same system you just built, applied to a different input type.

The one thing to actually run

Copy the complete training loop above and change in_dim, hidden, and n_classes to whatever numbers you want. Watch the val loss. Try removing LayerNorm. Try SGD instead of AdamW. The numbers in the output will tell you more than any explanation.


Next in the series

Article 4 covers embeddings and representation learning. You will see how raw tokens — integers — get turned into dense vectors that carry semantic meaning, how those vectors are learned through the same backpropagation process you just ran, and why the geometry of the resulting vector space has structure that the model actually uses. Everything in this article is the foundation for that.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

Related Posts