Why do I need linear algebra for machine learning?

Every operation in a neural network is a matrix multiplication or a vector addition. Weights are stored as matrices. Inputs are vectors. The forward pass is a chain of dot products. You can use PyTorch without understanding this, but you will not understand what the model is actually doing, or why it breaks when it does.

What is a vector in machine learning?

A vector is an ordered list of numbers. In ML, it usually represents something: a word embedding is a vector, an image pixel row is a vector, a model's hidden state is a vector. The dimension of the vector is how many numbers are in it. A 128-dimensional vector has 128 numbers.

What is a dot product and why does it matter in AI?

A dot product multiplies two vectors element-wise and sums the results. It measures how similar two vectors are. Attention in transformers is built on dot products between query and key vectors. If two vectors point in the same direction, their dot product is large. If they are perpendicular, it is zero.

What is a gradient in machine learning?

A gradient is the derivative of the loss function with respect to the model's weights. It tells you in which direction each weight needs to change to reduce the loss. Gradient descent updates weights by moving opposite to the gradient, which is how models learn.

Do I need calculus to use PyTorch or TensorFlow?

No. The frameworks handle differentiation automatically. But if you want to debug training runs, design custom loss functions, understand why your model is not converging, or read a research paper, you need the calculus. The frameworks do the computation. You still need to understand what it is computing.

What is the chain rule in deep learning?

The chain rule is a calculus rule for differentiating composite functions. In deep learning, a neural network is a composition of many functions stacked together. Backpropagation applies the chain rule repeatedly from the output back to the input to compute how each weight contributed to the loss.

Linear Algebra & Calculus: The Math Behind AI

I kept bouncing off AI papers.

Not because the ideas were too hard. The ideas were fine. I would follow the concept, understand the intuition, nod along. Then I would hit an equation with a matrix transpose or a partial derivative and my brain would just skip it. I told myself it was notation. It was not notation.

The math is load-bearing. Every operation that runs inside a neural network, every weight update during training, every attention score in a transformer, it is all linear algebra and calculus. If you have been treating those sections as decorations you can scroll past, this article is for you.

This is Article 1 in a series on AI and ML fundamentals you need before building real LLM products. We start here because everything else depends on it.

Why this math specifically

Machine learning leans on two branches of math.

Linear algebra handles the structure: how data is represented, stored, and transformed. Neural networks are, at a mechanical level, sequences of matrix operations applied to vectors. That is what the "computation" actually is.

Calculus handles the learning: how models figure out which direction to adjust their weights to get better. Specifically you need derivatives and the chain rule.

You do not need a full university course in either. You need the subset that shows up in neural networks. This article covers that subset.

Scalars, vectors, and matrices

These are the three basic objects.

A scalar is a single number. A learning rate of 0.001 is a scalar. A temperature of 0.8 during text generation is a scalar.

A vector is an ordered list of numbers. In ML this usually represents something concrete:

python

word_embedding = [0.12, -0.45, 0.87, 0.03, ...]  # 128 numbers for one word
pixel_row      = [255, 128, 64, 200, ...]          # one row of an image
hidden_state   = [0.91, -0.23, 0.44, ...]         # a neuron's activations

The length of the vector is its dimension. A 128-dimensional word embedding has 128 numbers in it. That number matters because every weight matrix in the model is sized to match it.

A matrix is a 2D grid of numbers, organized in rows and columns.

python

import numpy as np

W = np.array([
    [0.2, -0.5,  0.8],
    [0.1,  0.9, -0.3],
    [-0.4, 0.3,  0.7],
    [0.6, -0.1,  0.2]
])

print(W.shape)  # (4, 3) - 4 rows, 3 columns

The weights of a neural network layer live in matrices like this. The shape tells you the input and output dimensions of that layer.

Dot products: the operation that runs everything

The dot product is the most important operation in all of neural network math. You need to genuinely understand it, not just know the formula.

Take two vectors of the same length. Multiply each pair of corresponding elements, then add all those products together.

python

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot = np.dot(a, b)
# = (1*4) + (2*5) + (3*6)
# = 4 + 10 + 18
# = 32

print(dot)  # 32

What this measures is similarity. If two vectors point in the same direction in space, their dot product is large and positive. If they are perpendicular, it is zero. If they point in opposite directions, it is negative.

This is exactly why attention in transformers works. When the model computes attention scores, it takes the dot product of a query vector against a set of key vectors. High dot product means high relevance. The model literally measures "how aligned is what this token is looking for with what each other token has to offer."

Matrix multiplication

Matrix multiplication extends the dot product to entire collections of vectors at once. Each output element is the dot product of a row from the left matrix and a column from the right matrix.

python

import numpy as np

# input: batch of 3 tokens, each 4-dimensional
X = np.random.randn(3, 4)

# weight matrix: transforms from 4 dimensions to 5 dimensions
W = np.random.randn(4, 5)

# output: batch of 3 tokens, each now 5-dimensional
output = X @ W  # @ is the matrix multiply operator in Python
print(output.shape)  # (3, 5)

The shape rule: (m, n) @ (n, p) gives (m, p). The inner dimensions must match, and they cancel out.

Every layer in a neural network does this. The input goes in, the weight matrix transforms it, the output comes out. Stack twelve of these transformations with some nonlinearities between them and you have something like GPT.

python

import torch
import torch.nn as nn

# a single linear layer: 128 in, 256 out
layer = nn.Linear(128, 256)

# input: batch of 32 tokens, each 128-dimensional
x = torch.randn(32, 128)

# forward pass: matrix multiply + bias
output = layer(x)
print(output.shape)  # torch.Size([32, 256])

Transpose

The transpose of a matrix flips rows and columns.

python

A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

print(A.shape)    # (2, 3)
print(A.T.shape)  # (3, 2) - rows became columns

You will see transposes everywhere in transformer code. The attention formula computes Q @ K.T - the queries dot-producted against the keys, where the transpose is needed to get the shapes to line up for the matrix multiply. That one .T in the code represents the entire mechanism of comparing tokens to each other.

What a derivative actually is

Calculus enters when the model needs to learn. Derivatives are the tool.

A derivative measures how much a function's output changes when you slightly change its input. If f(x) = x², then at x = 3:

python

# the derivative of x^2 is 2x
# at x=3: derivative = 2*3 = 6

# verify numerically
h = 0.0001
x = 3.0
numerical_deriv = (x + h)**2 - (x - h)**2) / (2 * h)
print(numerical_deriv)  # approximately 6.0

The derivative tells you the slope at that point. Slope of 6 means: if you increase x by a tiny amount, f(x) increases by roughly 6 times that amount.

In neural networks, the function is the loss (how wrong the model is) and the inputs are the weights. The derivative of the loss with respect to each weight tells you how adjusting that weight affects the error. That is the gradient.

Gradients: derivatives for things with many inputs

A neural network has millions of weights. You need the derivative of the loss with respect to every single one. That collection of derivatives is called the gradient.

python

import torch

# a simple function with two inputs
def f(x, y):
    return x**2 + 3*x*y + y**2

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(1.0, requires_grad=True)

z = f(x, y)
z.backward()  # compute gradients

print(x.grad)  # df/dx = 2x + 3y = 2*2 + 3*1 = 7.0
print(y.grad)  # df/dy = 3x + 2y = 3*2 + 2*1 = 8.0

requires_grad=True tells PyTorch to track operations on this tensor so it can compute gradients later. The .backward() call triggers that computation. This is what automatic differentiation does. You write the forward pass, PyTorch figures out the backward pass.

The chain rule: how backpropagation works

A neural network is a composition of functions. Layer 1 produces something, Layer 2 takes that as input and produces something else, and so on.

If you want the derivative of the final loss with respect to a weight in Layer 1, you need to account for every intermediate transformation between that weight and the loss. The chain rule handles this.

For two chained functions z = f(g(x)), the derivative is:

plaintext

dz/dx = (dz/dg) * (dg/dx)

Multiply the outer derivative by the inner derivative. Stack another function on and the chain extends by one more factor.

python

import torch

x = torch.tensor(3.0, requires_grad=True)

# chained operations: two "layers"
a = x ** 2        # first transformation
b = torch.sin(a)  # second transformation
c = b * 2         # third transformation

c.backward()

# PyTorch computed this using the chain rule automatically
print(x.grad)  # dc/dx = 2 * cos(x^2) * 2x
               # = 2 * cos(9) * 6
               # ≈ -5.467

PyTorch builds a computation graph as you do the forward pass, then walks it backwards applying the chain rule at each step. That walk is backpropagation. When people say "backprop" that is what they mean. It is just the chain rule applied to a graph.

Gradient descent: learning in practice

Once you have the gradients, you update the weights. The basic rule is:

plaintext

weight = weight - learning_rate * gradient

If the gradient is positive, the loss goes up when you increase the weight, so you decrease it. If the gradient is negative, you increase the weight. The learning rate controls how big each step is.

python

import torch
import torch.nn as nn

# simple model: one linear layer
model = nn.Linear(4, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(10, 4)
y = torch.randn(10, 1)

for step in range(100):
    # forward pass
    pred = model(x)
    loss = nn.MSELoss()(pred, y)

    # backward pass: compute gradients
    optimizer.zero_grad()
    loss.backward()

    # update step: move weights opposite to gradient
    optimizer.step()

    if step % 20 == 0:
        print(f"step {step} | loss {loss.item():.4f}")

plaintext

step  0 | loss 1.2341
step 20 | loss 0.9812
step 40 | loss 0.7023
step 60 | loss 0.4914
step 80 | loss 0.3201

That falling loss number is the math working. Every step the model looks at the gradient, figures out which direction makes it worse, and moves the opposite way. Repeat enough times and the loss converges.

The learning rate matters more than it looks. Too high and the updates overshoot and the loss bounces or explodes. Too low and training takes forever or gets stuck. Most practical training uses adaptive optimizers like AdamW that adjust the effective learning rate per parameter, but they are still doing gradient descent underneath.

Putting it together: one forward and backward pass

Here is the full picture in one block of code.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

# toy model: two layers
class TwoLayerNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))  # layer 1: linear + activation
        x = self.fc2(x)          # layer 2: linear
        return x

model = TwoLayerNet(input_dim=8, hidden_dim=16, output_dim=3)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.randn(32, 8)   # batch of 32 examples
y = torch.randint(0, 3, (32,))  # 3-class labels

# step 1: forward pass - matrix multiplies through both layers
logits = model(x)

# step 2: compute loss - how wrong are the predictions
loss = F.cross_entropy(logits, y)

# step 3: backward pass - chain rule from loss back through both layers
optimizer.zero_grad()
loss.backward()

# step 4: update - move weights opposite to their gradients
optimizer.step()

print(f"loss: {loss.item():.4f}")

Every weight in every layer has a gradient after .backward(). Every gradient was computed by chaining derivatives through the graph from the loss back to that weight. Every weight update moves in the direction that reduces the loss. This cycle repeating thousands or millions of times is how a language model goes from random noise to something that can write.

The specific things you will see in transformer code

You have seen the math. Here is where it shows up when you start reading actual LLM code.

Operation	Where it appears
Matrix multiply `@`	Every linear projection in the model
Dot product	Attention score computation (`Q @ K.T`)
Transpose `.T`	Reshaping tensors for attention
Softmax	Converting attention scores to probabilities
Cross-entropy loss	Training objective for next-token prediction
`.backward()`	Computing gradients for all weights
Gradient clipping	Preventing exploding gradients during training

When you read transformer code and hit something like att = (q @ k.transpose(-2, -1)) * scale, you now know what that is: a scaled dot product between queries and keys, measuring similarity between every pair of tokens. The math we covered is that line.

What you can skip for now

Eigenvalues and eigenvectors come up in some optimization research but not in standard neural network training code. Singular value decomposition is relevant for some compression techniques but not for building from scratch. Fourier transforms appear in some positional encoding schemes but the standard learned positional embeddings do not need them.

Learn those when you need them. Right now you need vectors, matrices, dot products, derivatives, and the chain rule. That is the core.

If you want to go deeper

3Blue1Brown's "Essence of Linear Algebra" series on YouTube is the best visual intuition for this material. For calculus, his "Essence of Calculus" series covers derivatives and the chain rule in a way that actually makes sense. Both are free and short.

Next in the series

Article 2 covers probability and statistics. You will need it for understanding why neural networks use cross-entropy loss, what "distributions" mean in the context of language models, and how uncertainty is handled during training. The math from this article feeds directly into it.

I kept bouncing off AI papers.

This is Article 1 in a series on AI and ML fundamentals you need before building real LLM products. We start here because everything else depends on it.

Why this math specifically

Machine learning leans on two branches of math.

Calculus handles the learning: how models figure out which direction to adjust their weights to get better. Specifically you need derivatives and the chain rule.

You do not need a full university course in either. You need the subset that shows up in neural networks. This article covers that subset.

Scalars, vectors, and matrices

These are the three basic objects.

A scalar is a single number. A learning rate of 0.001 is a scalar. A temperature of 0.8 during text generation is a scalar.

A vector is an ordered list of numbers. In ML this usually represents something concrete:

python

word_embedding = [0.12, -0.45, 0.87, 0.03, ...]  # 128 numbers for one word
pixel_row      = [255, 128, 64, 200, ...]          # one row of an image
hidden_state   = [0.91, -0.23, 0.44, ...]         # a neuron's activations

The length of the vector is its dimension. A 128-dimensional word embedding has 128 numbers in it. That number matters because every weight matrix in the model is sized to match it.

A matrix is a 2D grid of numbers, organized in rows and columns.

python

import numpy as np

W = np.array([
    [0.2, -0.5,  0.8],
    [0.1,  0.9, -0.3],
    [-0.4, 0.3,  0.7],
    [0.6, -0.1,  0.2]
])

print(W.shape)  # (4, 3) - 4 rows, 3 columns

The weights of a neural network layer live in matrices like this. The shape tells you the input and output dimensions of that layer.

Dot products: the operation that runs everything

The dot product is the most important operation in all of neural network math. You need to genuinely understand it, not just know the formula.

Take two vectors of the same length. Multiply each pair of corresponding elements, then add all those products together.

python

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot = np.dot(a, b)
# = (1*4) + (2*5) + (3*6)
# = 4 + 10 + 18
# = 32

print(dot)  # 32

Matrix multiplication

Matrix multiplication extends the dot product to entire collections of vectors at once. Each output element is the dot product of a row from the left matrix and a column from the right matrix.

python

import numpy as np

# input: batch of 3 tokens, each 4-dimensional
X = np.random.randn(3, 4)

# weight matrix: transforms from 4 dimensions to 5 dimensions
W = np.random.randn(4, 5)

# output: batch of 3 tokens, each now 5-dimensional
output = X @ W  # @ is the matrix multiply operator in Python
print(output.shape)  # (3, 5)

The shape rule: (m, n) @ (n, p) gives (m, p). The inner dimensions must match, and they cancel out.

python

import torch
import torch.nn as nn

# a single linear layer: 128 in, 256 out
layer = nn.Linear(128, 256)

# input: batch of 32 tokens, each 128-dimensional
x = torch.randn(32, 128)

# forward pass: matrix multiply + bias
output = layer(x)
print(output.shape)  # torch.Size([32, 256])

Transpose

The transpose of a matrix flips rows and columns.

python

A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

print(A.shape)    # (2, 3)
print(A.T.shape)  # (3, 2) - rows became columns

What a derivative actually is

Calculus enters when the model needs to learn. Derivatives are the tool.

A derivative measures how much a function's output changes when you slightly change its input. If f(x) = x², then at x = 3:

python

# the derivative of x^2 is 2x
# at x=3: derivative = 2*3 = 6

# verify numerically
h = 0.0001
x = 3.0
numerical_deriv = (x + h)**2 - (x - h)**2) / (2 * h)
print(numerical_deriv)  # approximately 6.0

The derivative tells you the slope at that point. Slope of 6 means: if you increase x by a tiny amount, f(x) increases by roughly 6 times that amount.

Gradients: derivatives for things with many inputs

A neural network has millions of weights. You need the derivative of the loss with respect to every single one. That collection of derivatives is called the gradient.

python

import torch

# a simple function with two inputs
def f(x, y):
    return x**2 + 3*x*y + y**2

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(1.0, requires_grad=True)

z = f(x, y)
z.backward()  # compute gradients

print(x.grad)  # df/dx = 2x + 3y = 2*2 + 3*1 = 7.0
print(y.grad)  # df/dy = 3x + 2y = 3*2 + 2*1 = 8.0

The chain rule: how backpropagation works

A neural network is a composition of functions. Layer 1 produces something, Layer 2 takes that as input and produces something else, and so on.

For two chained functions z = f(g(x)), the derivative is:

plaintext

dz/dx = (dz/dg) * (dg/dx)

Multiply the outer derivative by the inner derivative. Stack another function on and the chain extends by one more factor.

python

import torch

x = torch.tensor(3.0, requires_grad=True)

# chained operations: two "layers"
a = x ** 2        # first transformation
b = torch.sin(a)  # second transformation
c = b * 2         # third transformation

c.backward()

# PyTorch computed this using the chain rule automatically
print(x.grad)  # dc/dx = 2 * cos(x^2) * 2x
               # = 2 * cos(9) * 6
               # ≈ -5.467

Gradient descent: learning in practice

Once you have the gradients, you update the weights. The basic rule is:

plaintext

weight = weight - learning_rate * gradient

If the gradient is positive, the loss goes up when you increase the weight, so you decrease it. If the gradient is negative, you increase the weight. The learning rate controls how big each step is.

python

import torch
import torch.nn as nn

# simple model: one linear layer
model = nn.Linear(4, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(10, 4)
y = torch.randn(10, 1)

for step in range(100):
    # forward pass
    pred = model(x)
    loss = nn.MSELoss()(pred, y)

    # backward pass: compute gradients
    optimizer.zero_grad()
    loss.backward()

    # update step: move weights opposite to gradient
    optimizer.step()

    if step % 20 == 0:
        print(f"step {step} | loss {loss.item():.4f}")

plaintext

step  0 | loss 1.2341
step 20 | loss 0.9812
step 40 | loss 0.7023
step 60 | loss 0.4914
step 80 | loss 0.3201

Putting it together: one forward and backward pass

Here is the full picture in one block of code.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

# toy model: two layers
class TwoLayerNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))  # layer 1: linear + activation
        x = self.fc2(x)          # layer 2: linear
        return x

model = TwoLayerNet(input_dim=8, hidden_dim=16, output_dim=3)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.randn(32, 8)   # batch of 32 examples
y = torch.randint(0, 3, (32,))  # 3-class labels

# step 1: forward pass - matrix multiplies through both layers
logits = model(x)

# step 2: compute loss - how wrong are the predictions
loss = F.cross_entropy(logits, y)

# step 3: backward pass - chain rule from loss back through both layers
optimizer.zero_grad()
loss.backward()

# step 4: update - move weights opposite to their gradients
optimizer.step()

print(f"loss: {loss.item():.4f}")

The specific things you will see in transformer code

You have seen the math. Here is where it shows up when you start reading actual LLM code.

Operation	Where it appears
Matrix multiply `@`	Every linear projection in the model
Dot product	Attention score computation (`Q @ K.T`)
Transpose `.T`	Reshaping tensors for attention
Softmax	Converting attention scores to probabilities
Cross-entropy loss	Training objective for next-token prediction
`.backward()`	Computing gradients for all weights
Gradient clipping	Preventing exploding gradients during training

What you can skip for now

Learn those when you need them. Right now you need vectors, matrices, dot products, derivatives, and the chain rule. That is the core.

If you want to go deeper

Linear Algebra and Calculus: The Math Your AI Model Runs On

Why this math specifically

Scalars, vectors, and matrices

Dot products: the operation that runs everything

Matrix multiplication

Transpose

What a derivative actually is

Gradients: derivatives for things with many inputs

The chain rule: how backpropagation works

Gradient descent: learning in practice

Putting it together: one forward and backward pass

The specific things you will see in transformer code

What you can skip for now

Next in the series

Krunal Kanojiya

Related Posts

Linear Algebra and Calculus: The Math Your AI Model Runs On

Why this math specifically

Scalars, vectors, and matrices

Dot products: the operation that runs everything

Matrix multiplication

Transpose

What a derivative actually is

Gradients: derivatives for things with many inputs

The chain rule: how backpropagation works

Gradient descent: learning in practice

Putting it together: one forward and backward pass

The specific things you will see in transformer code

What you can skip for now

Next in the series

Krunal Kanojiya

Related Posts