Tech12 min read2,209 words

Linear Algebra and Calculus: The Math Your AI Model Runs On

Before you build LLMs or AI products, you need to understand the math underneath. This article breaks down vectors, matrices, dot products, derivatives, and gradients — the exact operations that run inside every neural network.

K

Krunal Kanojiya

Share:
#linear-algebra#calculus#machine-learning#ai#deep-learning#mathematics#neural-networks

I kept bouncing off AI papers.

Not because the ideas were too hard. The ideas were fine. I would follow the concept, understand the intuition, nod along. Then I would hit an equation with a matrix transpose or a partial derivative and my brain would just skip it. I told myself it was notation. It was not notation.

The math is load-bearing. Every operation that runs inside a neural network, every weight update during training, every attention score in a transformer, it is all linear algebra and calculus. If you have been treating those sections as decorations you can scroll past, this article is for you.

This is Article 1 in a series on AI and ML fundamentals you need before building real LLM products. We start here because everything else depends on it.


Why this math specifically

Machine learning leans on two branches of math.

Linear algebra handles the structure: how data is represented, stored, and transformed. Neural networks are, at a mechanical level, sequences of matrix operations applied to vectors. That is what the "computation" actually is.

Calculus handles the learning: how models figure out which direction to adjust their weights to get better. Specifically you need derivatives and the chain rule.

You do not need a full university course in either. You need the subset that shows up in neural networks. This article covers that subset.


Scalars, vectors, and matrices

These are the three basic objects.

A scalar is a single number. A learning rate of 0.001 is a scalar. A temperature of 0.8 during text generation is a scalar.

A vector is an ordered list of numbers. In ML this usually represents something concrete:

python
word_embedding = [0.12, -0.45, 0.87, 0.03, ...]  # 128 numbers for one word
pixel_row      = [255, 128, 64, 200, ...]          # one row of an image
hidden_state   = [0.91, -0.23, 0.44, ...]         # a neuron's activations

The length of the vector is its dimension. A 128-dimensional word embedding has 128 numbers in it. That number matters because every weight matrix in the model is sized to match it.

A matrix is a 2D grid of numbers, organized in rows and columns.

python
import numpy as np

W = np.array([
    [0.2, -0.5,  0.8],
    [0.1,  0.9, -0.3],
    [-0.4, 0.3,  0.7],
    [0.6, -0.1,  0.2]
])

print(W.shape)  # (4, 3) — 4 rows, 3 columns

The weights of a neural network layer live in matrices like this. The shape tells you the input and output dimensions of that layer.


Dot products: the operation that runs everything

The dot product is the most important operation in all of neural network math. You need to genuinely understand it, not just know the formula.

Take two vectors of the same length. Multiply each pair of corresponding elements, then add all those products together.

python
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot = np.dot(a, b)
# = (1*4) + (2*5) + (3*6)
# = 4 + 10 + 18
# = 32

print(dot)  # 32

What this measures is similarity. If two vectors point in the same direction in space, their dot product is large and positive. If they are perpendicular, it is zero. If they point in opposite directions, it is negative.

This is exactly why attention in transformers works. When the model computes attention scores, it takes the dot product of a query vector against a set of key vectors. High dot product means high relevance. The model literally measures "how aligned is what this token is looking for with what each other token has to offer."


Matrix multiplication

Matrix multiplication extends the dot product to entire collections of vectors at once. Each output element is the dot product of a row from the left matrix and a column from the right matrix.

python
import numpy as np

# input: batch of 3 tokens, each 4-dimensional
X = np.random.randn(3, 4)

# weight matrix: transforms from 4 dimensions to 5 dimensions
W = np.random.randn(4, 5)

# output: batch of 3 tokens, each now 5-dimensional
output = X @ W  # @ is the matrix multiply operator in Python
print(output.shape)  # (3, 5)

The shape rule: (m, n) @ (n, p) gives (m, p). The inner dimensions must match, and they cancel out.

Every layer in a neural network does this. The input goes in, the weight matrix transforms it, the output comes out. Stack twelve of these transformations with some nonlinearities between them and you have something like GPT.

python
import torch
import torch.nn as nn

# a single linear layer: 128 in, 256 out
layer = nn.Linear(128, 256)

# input: batch of 32 tokens, each 128-dimensional
x = torch.randn(32, 128)

# forward pass: matrix multiply + bias
output = layer(x)
print(output.shape)  # torch.Size([32, 256])

Transpose

The transpose of a matrix flips rows and columns.

python
A = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

print(A.shape)    # (2, 3)
print(A.T.shape)  # (3, 2) — rows became columns

You will see transposes everywhere in transformer code. The attention formula computes Q @ K.T — the queries dot-producted against the keys, where the transpose is needed to get the shapes to line up for the matrix multiply. That one .T in the code represents the entire mechanism of comparing tokens to each other.


What a derivative actually is

Calculus enters when the model needs to learn. Derivatives are the tool.

A derivative measures how much a function's output changes when you slightly change its input. If f(x) = x², then at x = 3:

python
# the derivative of x^2 is 2x
# at x=3: derivative = 2*3 = 6

# verify numerically
h = 0.0001
x = 3.0
numerical_deriv = (x + h)**2 - (x - h)**2) / (2 * h)
print(numerical_deriv)  # approximately 6.0

The derivative tells you the slope at that point. Slope of 6 means: if you increase x by a tiny amount, f(x) increases by roughly 6 times that amount.

In neural networks, the function is the loss (how wrong the model is) and the inputs are the weights. The derivative of the loss with respect to each weight tells you how adjusting that weight affects the error. That is the gradient.


Gradients: derivatives for things with many inputs

A neural network has millions of weights. You need the derivative of the loss with respect to every single one. That collection of derivatives is called the gradient.

python
import torch

# a simple function with two inputs
def f(x, y):
    return x**2 + 3*x*y + y**2

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(1.0, requires_grad=True)

z = f(x, y)
z.backward()  # compute gradients

print(x.grad)  # df/dx = 2x + 3y = 2*2 + 3*1 = 7.0
print(y.grad)  # df/dy = 3x + 2y = 3*2 + 2*1 = 8.0

requires_grad=True tells PyTorch to track operations on this tensor so it can compute gradients later. The .backward() call triggers that computation. This is what automatic differentiation does. You write the forward pass, PyTorch figures out the backward pass.


The chain rule: how backpropagation works

A neural network is a composition of functions. Layer 1 produces something, Layer 2 takes that as input and produces something else, and so on.

If you want the derivative of the final loss with respect to a weight in Layer 1, you need to account for every intermediate transformation between that weight and the loss. The chain rule handles this.

For two chained functions z = f(g(x)), the derivative is:

plaintext
dz/dx = (dz/dg) * (dg/dx)

Multiply the outer derivative by the inner derivative. Stack another function on and the chain extends by one more factor.

python
import torch

x = torch.tensor(3.0, requires_grad=True)

# chained operations: two "layers"
a = x ** 2        # first transformation
b = torch.sin(a)  # second transformation
c = b * 2         # third transformation

c.backward()

# PyTorch computed this using the chain rule automatically
print(x.grad)  # dc/dx = 2 * cos(x^2) * 2x
               # = 2 * cos(9) * 6
               # ≈ -5.467

PyTorch builds a computation graph as you do the forward pass, then walks it backwards applying the chain rule at each step. That walk is backpropagation. When people say "backprop" that is what they mean. It is just the chain rule applied to a graph.


Gradient descent: learning in practice

Once you have the gradients, you update the weights. The basic rule is:

plaintext
weight = weight - learning_rate * gradient

If the gradient is positive, the loss goes up when you increase the weight, so you decrease it. If the gradient is negative, you increase the weight. The learning rate controls how big each step is.

python
import torch
import torch.nn as nn

# simple model: one linear layer
model = nn.Linear(4, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(10, 4)
y = torch.randn(10, 1)

for step in range(100):
    # forward pass
    pred = model(x)
    loss = nn.MSELoss()(pred, y)

    # backward pass: compute gradients
    optimizer.zero_grad()
    loss.backward()

    # update step: move weights opposite to gradient
    optimizer.step()

    if step % 20 == 0:
        print(f"step {step} | loss {loss.item():.4f}")
plaintext
step  0 | loss 1.2341
step 20 | loss 0.9812
step 40 | loss 0.7023
step 60 | loss 0.4914
step 80 | loss 0.3201

That falling loss number is the math working. Every step the model looks at the gradient, figures out which direction makes it worse, and moves the opposite way. Repeat enough times and the loss converges.

The learning rate matters more than it looks. Too high and the updates overshoot and the loss bounces or explodes. Too low and training takes forever or gets stuck. Most practical training uses adaptive optimizers like AdamW that adjust the effective learning rate per parameter, but they are still doing gradient descent underneath.


Putting it together: one forward and backward pass

Here is the full picture in one block of code.

python
import torch
import torch.nn as nn
import torch.nn.functional as F

# toy model: two layers
class TwoLayerNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))  # layer 1: linear + activation
        x = self.fc2(x)          # layer 2: linear
        return x

model = TwoLayerNet(input_dim=8, hidden_dim=16, output_dim=3)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.randn(32, 8)   # batch of 32 examples
y = torch.randint(0, 3, (32,))  # 3-class labels

# step 1: forward pass — matrix multiplies through both layers
logits = model(x)

# step 2: compute loss — how wrong are the predictions
loss = F.cross_entropy(logits, y)

# step 3: backward pass — chain rule from loss back through both layers
optimizer.zero_grad()
loss.backward()

# step 4: update — move weights opposite to their gradients
optimizer.step()

print(f"loss: {loss.item():.4f}")

Every weight in every layer has a gradient after .backward(). Every gradient was computed by chaining derivatives through the graph from the loss back to that weight. Every weight update moves in the direction that reduces the loss. This cycle repeating thousands or millions of times is how a language model goes from random noise to something that can write.


The specific things you will see in transformer code

You have seen the math. Here is where it shows up when you start reading actual LLM code.

OperationWhere it appears
Matrix multiply @Every linear projection in the model
Dot productAttention score computation (Q @ K.T)
Transpose .TReshaping tensors for attention
SoftmaxConverting attention scores to probabilities
Cross-entropy lossTraining objective for next-token prediction
.backward()Computing gradients for all weights
Gradient clippingPreventing exploding gradients during training

When you read transformer code and hit something like att = (q @ k.transpose(-2, -1)) * scale, you now know what that is: a scaled dot product between queries and keys, measuring similarity between every pair of tokens. The math we covered is that line.


What you can skip for now

Eigenvalues and eigenvectors come up in some optimization research but not in standard neural network training code. Singular value decomposition is relevant for some compression techniques but not for building from scratch. Fourier transforms appear in some positional encoding schemes but the standard learned positional embeddings do not need them.

Learn those when you need them. Right now you need vectors, matrices, dot products, derivatives, and the chain rule. That is the core.

If you want to go deeper

3Blue1Brown's "Essence of Linear Algebra" series on YouTube is the best visual intuition for this material. For calculus, his "Essence of Calculus" series covers derivatives and the chain rule in a way that actually makes sense. Both are free and short.


Next in the series

Article 2 covers probability and statistics. You will need it for understanding why neural networks use cross-entropy loss, what "distributions" mean in the context of language models, and how uncertainty is handled during training. The math from this article feeds directly into it.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

Related Posts