Why do I need probability and statistics for machine learning?

Neural networks do not produce facts. They produce probability distributions over possible outputs. The training objective, the loss function, the sampling during generation - all of it is probability. If you do not understand what a distribution is, you cannot understand what your model is actually doing when it makes a prediction.

What is a probability distribution in machine learning?

A probability distribution assigns a probability to every possible outcome. In language models, after every forward pass, the output is a distribution over the entire vocabulary - each token gets a score representing how likely it is to come next. Softmax converts raw scores into a valid probability distribution where all values are between 0 and 1 and sum to 1.

What is cross-entropy loss and why is it used in language models?

Cross-entropy measures how different a predicted probability distribution is from the true distribution. In language modeling, the true distribution for each token position is a one-hot vector - probability 1 on the correct next token, 0 on everything else. Cross-entropy penalizes the model for assigning low probability to the correct token. Lower cross-entropy means the model is more confident and correct.

What does entropy mean in AI?

Entropy measures uncertainty or randomness in a distribution. A uniform distribution over 50,000 tokens has maximum entropy - the model has no idea what comes next. A peaked distribution with most probability on one token has low entropy - the model is confident. During training, you want the model's entropy over the correct token to decrease.

What is Bayes' theorem and does it apply to neural networks?

Bayes' theorem says that the probability of a hypothesis given evidence equals the prior probability of the hypothesis times the likelihood of the evidence, divided by the total probability of the evidence. Neural networks do not explicitly apply Bayes' theorem, but Bayesian thinking is used to understand what a model learns: the prior is what the model believes before seeing data, and training updates those beliefs based on the evidence in the dataset.

What is the difference between likelihood and probability?

Probability describes how likely an outcome is given fixed parameters. Likelihood describes how well fixed parameters explain an observed outcome. In training, you observe the data (fixed) and ask which model parameters make that data most likely - this is maximum likelihood estimation, which is exactly what gradient descent on cross-entropy loss is doing.

Probability & Statistics for AI

I used to skip the statistics sections.

Not just in ML papers. In university too. Probability felt like a detour from the real programming. I figured I could come back to it later, after I understood how things actually worked.

That was backwards. The statistics is not a detour. It is how the whole thing works.

Every prediction a neural network makes is a probability distribution. The loss function is measuring the distance between two distributions. The training process is doing maximum likelihood estimation whether it calls it that or not. When a language model hallucinates, it is producing a token with high predicted probability that happens to be wrong. Understanding any of that requires this article.

This is Article 2 in a series on AI and ML fundamentals. Article 1 covered linear algebra and calculus - the math of structure and gradients. This one covers probability and statistics - the math of uncertainty. Article 3 will bring both together to build and train a neural network.

What probability actually measures

Probability is a number between 0 and 1 that describes how likely something is to happen. Zero means impossible. One means certain. Everything else is in between.

The thing most people get wrong about probability in ML is thinking the model is producing answers. It is not. It is producing probabilities over possible answers, and then you (or the sampling code) pick one.

python

import torch
import torch.nn.functional as F

# raw model output (logits) for a 5-token vocabulary
logits = torch.tensor([2.1, 0.5, -1.3, 0.8, 1.7])

# convert to probabilities with softmax
probs = F.softmax(logits, dim=0)

print(probs)
# tensor([0.4051, 0.0819, 0.0133, 0.1097, 0.3899])

print(probs.sum())
# tensor(1.)  - all probabilities sum to 1

That output [0.40, 0.08, 0.01, 0.11, 0.39] is the model saying: "Token 0 is most likely, token 4 is close behind, token 2 is almost impossible." The model never says "the answer is token 0." It says "here is my confidence in each option."

This is not a minor implementation detail. It changes how you think about what models are doing. When GPT generates text, it is sampling from this distribution at every single step. The creativity, the randomness, the occasional weirdness - all of it comes from this probability layer.

Random variables and distributions

A random variable is a variable whose value depends on a random process. When you roll a die, the outcome is a random variable. When a language model predicts the next token, that token is a random variable.

A distribution describes the probability of each possible value.

Discrete distributions are for things that take specific countable values - like which token comes next out of a vocabulary of 50,257 options.

Continuous distributions are for things that can take any value in a range - like the weights in a neural network before training.

The one you will see most often in ML is the normal distribution (also called Gaussian). It is bell-shaped, symmetric, and described by two numbers: mean (center) and standard deviation (spread).

python

import numpy as np
import torch

# sample 1000 values from a normal distribution
# mean=0, std=1 (standard normal)
samples = np.random.normal(loc=0.0, scale=1.0, size=1000)

print(f"mean: {samples.mean():.3f}")    # close to 0
print(f"std:  {samples.std():.3f}")     # close to 1
print(f"min:  {samples.min():.3f}")
print(f"max:  {samples.max():.3f}")

plaintext

mean: 0.012
std:  0.998
min: -3.241
max:  3.018

Why does this matter for ML? Because PyTorch initializes most weight matrices from a normal distribution. The model starts with weights drawn from roughly N(0, 0.02). If you initialize weights too large or too small, training breaks before it starts. The distribution you initialize from has a real effect on whether the model learns at all.

Expected value: what an average really means

The expected value of a random variable is its probability-weighted average. If you repeat something many times, the expected value is what you would converge to.

python

# rolling a fair die: outcomes 1-6, each with probability 1/6
outcomes = [1, 2, 3, 4, 5, 6]
probs    = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

expected = sum(o * p for o, p in zip(outcomes, probs))
print(f"expected value: {expected:.2f}")  # 3.50

In ML, loss functions are expected values. When you compute the loss over a batch of 32 examples, you are computing the average loss across those examples. Training minimizes the expected loss over the data distribution, which is another way of saying: you want the model to do well on average across all possible inputs, not just the specific batch you are looking at right now.

Entropy: measuring uncertainty

Entropy measures how uncertain or spread out a distribution is. High entropy means the distribution is flat and unpredictable. Low entropy means it is peaked and confident.

python

import torch
import torch.nn.functional as F

def entropy(probs):
    # H(p) = -sum(p * log(p))
    # add small epsilon to avoid log(0)
    return -(probs * torch.log(probs + 1e-9)).sum()

# uniform distribution over 4 tokens - maximum uncertainty
uniform = torch.tensor([0.25, 0.25, 0.25, 0.25])
print(f"uniform entropy: {entropy(uniform):.4f}")   # 1.3863

# peaked distribution - model is very confident about token 0
peaked = torch.tensor([0.97, 0.01, 0.01, 0.01])
print(f"peaked entropy:  {entropy(peaked):.4f}")    # 0.1510

# somewhere in between
mixed = torch.tensor([0.60, 0.20, 0.15, 0.05])
print(f"mixed entropy:   {entropy(mixed):.4f}")     # 1.0114

This is not abstract. When a language model is early in training, its output distribution looks roughly uniform - high entropy, no confidence. As training progresses and the model learns real patterns, the distribution peaks on likely tokens and entropy drops.

When you see the loss going down during training, entropy going down is a big part of what is happening. The model is getting more decisive.

Real example: the temperature parameter in text generation directly controls entropy. Lower temperature sharpens the distribution (lower entropy, more predictable output). Higher temperature flattens it (higher entropy, more variety).

python

logits = torch.tensor([3.0, 1.5, 0.5, -1.0])

temps = [0.5, 1.0, 2.0]
for t in temps:
    probs = F.softmax(logits / t, dim=0)
    print(f"temp={t}: probs={probs.numpy().round(3)} | entropy={entropy(probs):.3f}")

plaintext

temp=0.5: probs=[0.938 0.059 0.003 0.  ] | entropy=0.248
temp=1.0: probs=[0.718 0.228 0.044 0.009] | entropy=0.735
temp=2.0: probs=[0.484 0.290 0.159 0.067] | entropy=1.248

Temperature 0.5 makes the model commit hard to token 0. Temperature 2.0 gives the other tokens a real chance. The temperature parameter in your generate() function is just dividing logits before softmax. That is the whole implementation.

Conditional probability

Conditional probability asks: given that something has already happened, what is the probability of something else?

Written as P(A | B) - probability of A given B.

This is the mathematical foundation of language modeling. A language model is computing P(next_token | all_previous_tokens). Every single forward pass. Given the sentence so far, what is the probability distribution over what comes next?

python

# a toy example with word frequencies
# counts of word pairs in a tiny corpus
bigram_counts = {
    ("the", "cat"): 15,
    ("the", "dog"): 10,
    ("the", "bird"): 5,
    ("the", "house"): 20,
}

total_after_the = sum(bigram_counts.values())  # 50

# P(word | "the") - what comes after "the"?
for (prev, next_word), count in bigram_counts.items():
    prob = count / total_after_the
    print(f"P({next_word!r} | 'the') = {prob:.2f}")

plaintext

P('cat'   | 'the') = 0.30
P('dog'   | 'the') = 0.20
P('bird'  | 'the') = 0.10
P('house' | 'the') = 0.40

A bigram model like this is a language model. A terrible one, but the same idea. GPT-4 is computing the same conditional probability - P(next_token | all_previous_tokens) - just with 1.8 trillion parameters instead of a lookup table.

Bayes' theorem: updating beliefs with evidence

Bayes' theorem is the formula for updating your probability estimate when you see new evidence.

plaintext

P(hypothesis | evidence) = P(evidence | hypothesis) * P(hypothesis) / P(evidence)

The components:

P(hypothesis) is the prior - what you believed before seeing evidence
P(evidence | hypothesis) is the likelihood - how probable is the evidence if the hypothesis is true
P(hypothesis | evidence) is the posterior - updated belief after seeing evidence

Real example. You have a spam filter.

python

# prior: 30% of emails are spam based on history
p_spam = 0.30
p_not_spam = 0.70

# likelihood: "FREE" appears in 80% of spam, 5% of normal email
p_free_given_spam     = 0.80
p_free_given_not_spam = 0.05

# total probability of seeing "FREE" in any email
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)

# posterior: given we see "FREE", what's the probability it's spam?
p_spam_given_free = (p_free_given_spam * p_spam) / p_free

print(f"P(spam | 'FREE') = {p_spam_given_free:.3f}")  # 0.873

plaintext

P(spam | 'FREE') = 0.873

Seeing "FREE" pushes the spam probability from 30% to 87%. That is Bayesian updating.

Neural networks do not literally apply Bayes' theorem on every forward pass. But training is doing something Bayesian at a high level: you start with a random prior (initial weights), observe evidence (training data), and the loss function pushes the model toward weights that maximize the likelihood of that data. The posterior is the trained model.

Likelihood and maximum likelihood estimation

Likelihood is the probability of your observed data given your model's current parameters. Training a neural network is finding the parameters that maximize this likelihood.

Maximum likelihood estimation (MLE) sounds academic but you are already using it. Cross-entropy loss is the negative log likelihood. Minimizing cross-entropy is the same as maximizing likelihood. They are the same thing.

python

import torch
import torch.nn.functional as F

# true label: token index 2 is the correct next token
target = torch.tensor([2])

# model's predicted probabilities over 5-token vocabulary
logits = torch.tensor([[1.2, 0.5, 2.8, 0.3, -0.4]])
probs = F.softmax(logits, dim=1)

print("probabilities:", probs.round(decimals=3))
# tensor([[0.088, 0.043, 0.583, 0.035, 0.017]])  - model gives token 2 a 58% chance

# cross-entropy loss = -log(probability of the correct token)
loss = F.cross_entropy(logits, target)
print(f"cross-entropy loss: {loss.item():.4f}")   # 0.5389

# manually: -log(0.583)
import math
manual_loss = -math.log(0.583)
print(f"manual -log(p):     {manual_loss:.4f}")   # 0.5390 - same thing

The loss is literally just the negative log of the probability the model assigned to the correct answer. If the model was 58% confident and correct, the loss is 0.54. If the model was only 5% confident, the loss would be −log(0.05) = 3.0. High confidence on the right answer = low loss. That is the entire training objective.

Why log? Two reasons. First, log turns multiplications into additions, which is computationally easier. Second, it penalizes overconfident wrong answers severely - if the model assigns 99% probability to the wrong token, the loss is enormous.

Cross-entropy loss in practice

Cross-entropy is how every classification task and every language model trains. It is worth seeing how it scales.

python

import torch
import torch.nn.functional as F

torch.manual_seed(42)
vocab_size = 50_257   # GPT-2 vocabulary size
seq_len    = 128
batch_size = 8

# simulated model output: raw logits
logits = torch.randn(batch_size, seq_len, vocab_size)

# simulated targets: correct next token at each position
targets = torch.randint(0, vocab_size, (batch_size, seq_len))

# reshape for cross-entropy: (B*T, vocab) vs (B*T,)
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

print(f"loss: {loss.item():.4f}")
# random model: loss ≈ ln(50257) ≈ 10.82
print(f"ln(vocab_size): {torch.log(torch.tensor(50257.0)):.4f}")

plaintext

loss:           10.8193
ln(vocab_size): 10.8245

A randomly initialized model starts with a loss of roughly ln(50257) ≈ 10.82. That is what maximum uncertainty looks like in numbers. Every token gets roughly equal probability, so the correct token gets roughly 1/50257 probability, and −log(1/50257) ≈ 10.82.

When I built NanoGPT and watched the loss start at 10.8, I finally understood what that number meant. It was not just a metric. It was the model starting from complete ignorance about which of 50,257 tokens might come next.

By the end of training, the loss dropped to 1.59. That means the model had narrowed its effective vocabulary at each step from 50,000 tokens to roughly e^1.59 ≈ 4.9 plausible options. Not one right answer, but a small cluster of likely ones. That is real learning.

The connection to Article 3: neural networks

Everything in this article feeds directly into how neural networks train.

Concept	Where it appears in neural network training
Probability distribution	Output of every forward pass (after softmax)
Conditional probability	The entire language modeling objective
Entropy	Measures model confidence; tracked alongside loss
Cross-entropy	The loss function for classification and language modeling
Maximum likelihood	What gradient descent on cross-entropy is actually doing
Normal distribution	Weight initialization
Expected value	Loss averaged across a batch

Article 3 will take a neural network and walk through a full training loop. When I write loss = F.cross_entropy(logits, targets) there, you will know what that line is computing and why. When the model runs F.softmax(logits, dim=-1) during generation, you will know what comes out and what the numbers mean.

The math in Article 1 told you how the operations run. The math in this article tells you what the outputs represent. Together they are the foundation for everything that comes after.

A quick sanity check

If someone shows you a model with a validation loss of 2.3 and asks if that is good, you can now answer. e^2.3 ≈ 10. That means the model has narrowed the next token down to roughly 10 plausible options from a vocabulary of 50,000. Whether that is good depends on the task, but you are no longer just looking at an abstract number.

Next in the series

Article 3 covers neural networks and backpropagation. You will see how a network is structured, how a forward pass moves data through it using the matrix math from Article 1, and how the loss computed using the probability math from this article gets turned into gradients that update the weights. It is the first article where training actually happens.

I used to skip the statistics sections.

Not just in ML papers. In university too. Probability felt like a detour from the real programming. I figured I could come back to it later, after I understood how things actually worked.

That was backwards. The statistics is not a detour. It is how the whole thing works.

What probability actually measures

Probability is a number between 0 and 1 that describes how likely something is to happen. Zero means impossible. One means certain. Everything else is in between.

python

import torch
import torch.nn.functional as F

# raw model output (logits) for a 5-token vocabulary
logits = torch.tensor([2.1, 0.5, -1.3, 0.8, 1.7])

# convert to probabilities with softmax
probs = F.softmax(logits, dim=0)

print(probs)
# tensor([0.4051, 0.0819, 0.0133, 0.1097, 0.3899])

print(probs.sum())
# tensor(1.)  - all probabilities sum to 1

Random variables and distributions

A distribution describes the probability of each possible value.

Discrete distributions are for things that take specific countable values - like which token comes next out of a vocabulary of 50,257 options.

Continuous distributions are for things that can take any value in a range - like the weights in a neural network before training.

The one you will see most often in ML is the normal distribution (also called Gaussian). It is bell-shaped, symmetric, and described by two numbers: mean (center) and standard deviation (spread).

python

import numpy as np
import torch

# sample 1000 values from a normal distribution
# mean=0, std=1 (standard normal)
samples = np.random.normal(loc=0.0, scale=1.0, size=1000)

print(f"mean: {samples.mean():.3f}")    # close to 0
print(f"std:  {samples.std():.3f}")     # close to 1
print(f"min:  {samples.min():.3f}")
print(f"max:  {samples.max():.3f}")

plaintext

mean: 0.012
std:  0.998
min: -3.241
max:  3.018

Expected value: what an average really means

The expected value of a random variable is its probability-weighted average. If you repeat something many times, the expected value is what you would converge to.

python

# rolling a fair die: outcomes 1-6, each with probability 1/6
outcomes = [1, 2, 3, 4, 5, 6]
probs    = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

expected = sum(o * p for o, p in zip(outcomes, probs))
print(f"expected value: {expected:.2f}")  # 3.50

Entropy: measuring uncertainty

Entropy measures how uncertain or spread out a distribution is. High entropy means the distribution is flat and unpredictable. Low entropy means it is peaked and confident.

python

import torch
import torch.nn.functional as F

def entropy(probs):
    # H(p) = -sum(p * log(p))
    # add small epsilon to avoid log(0)
    return -(probs * torch.log(probs + 1e-9)).sum()

# uniform distribution over 4 tokens - maximum uncertainty
uniform = torch.tensor([0.25, 0.25, 0.25, 0.25])
print(f"uniform entropy: {entropy(uniform):.4f}")   # 1.3863

# peaked distribution - model is very confident about token 0
peaked = torch.tensor([0.97, 0.01, 0.01, 0.01])
print(f"peaked entropy:  {entropy(peaked):.4f}")    # 0.1510

# somewhere in between
mixed = torch.tensor([0.60, 0.20, 0.15, 0.05])
print(f"mixed entropy:   {entropy(mixed):.4f}")     # 1.0114

When you see the loss going down during training, entropy going down is a big part of what is happening. The model is getting more decisive.

python

logits = torch.tensor([3.0, 1.5, 0.5, -1.0])

temps = [0.5, 1.0, 2.0]
for t in temps:
    probs = F.softmax(logits / t, dim=0)
    print(f"temp={t}: probs={probs.numpy().round(3)} | entropy={entropy(probs):.3f}")

plaintext

temp=0.5: probs=[0.938 0.059 0.003 0.  ] | entropy=0.248
temp=1.0: probs=[0.718 0.228 0.044 0.009] | entropy=0.735
temp=2.0: probs=[0.484 0.290 0.159 0.067] | entropy=1.248

Conditional probability

Conditional probability asks: given that something has already happened, what is the probability of something else?

Written as P(A | B) - probability of A given B.

python

# a toy example with word frequencies
# counts of word pairs in a tiny corpus
bigram_counts = {
    ("the", "cat"): 15,
    ("the", "dog"): 10,
    ("the", "bird"): 5,
    ("the", "house"): 20,
}

total_after_the = sum(bigram_counts.values())  # 50

# P(word | "the") - what comes after "the"?
for (prev, next_word), count in bigram_counts.items():
    prob = count / total_after_the
    print(f"P({next_word!r} | 'the') = {prob:.2f}")

plaintext

P('cat'   | 'the') = 0.30
P('dog'   | 'the') = 0.20
P('bird'  | 'the') = 0.10
P('house' | 'the') = 0.40

Bayes' theorem: updating beliefs with evidence

Bayes' theorem is the formula for updating your probability estimate when you see new evidence.

plaintext

P(hypothesis | evidence) = P(evidence | hypothesis) * P(hypothesis) / P(evidence)

The components:

P(hypothesis) is the prior - what you believed before seeing evidence
P(evidence | hypothesis) is the likelihood - how probable is the evidence if the hypothesis is true
P(hypothesis | evidence) is the posterior - updated belief after seeing evidence

Real example. You have a spam filter.

python

# prior: 30% of emails are spam based on history
p_spam = 0.30
p_not_spam = 0.70

# likelihood: "FREE" appears in 80% of spam, 5% of normal email
p_free_given_spam     = 0.80
p_free_given_not_spam = 0.05

# total probability of seeing "FREE" in any email
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)

# posterior: given we see "FREE", what's the probability it's spam?
p_spam_given_free = (p_free_given_spam * p_spam) / p_free

print(f"P(spam | 'FREE') = {p_spam_given_free:.3f}")  # 0.873

plaintext

P(spam | 'FREE') = 0.873

Seeing "FREE" pushes the spam probability from 30% to 87%. That is Bayesian updating.

Likelihood and maximum likelihood estimation

Likelihood is the probability of your observed data given your model's current parameters. Training a neural network is finding the parameters that maximize this likelihood.

python

import torch
import torch.nn.functional as F

# true label: token index 2 is the correct next token
target = torch.tensor([2])

# model's predicted probabilities over 5-token vocabulary
logits = torch.tensor([[1.2, 0.5, 2.8, 0.3, -0.4]])
probs = F.softmax(logits, dim=1)

print("probabilities:", probs.round(decimals=3))
# tensor([[0.088, 0.043, 0.583, 0.035, 0.017]])  - model gives token 2 a 58% chance

# cross-entropy loss = -log(probability of the correct token)
loss = F.cross_entropy(logits, target)
print(f"cross-entropy loss: {loss.item():.4f}")   # 0.5389

# manually: -log(0.583)
import math
manual_loss = -math.log(0.583)
print(f"manual -log(p):     {manual_loss:.4f}")   # 0.5390 - same thing

Cross-entropy loss in practice

Cross-entropy is how every classification task and every language model trains. It is worth seeing how it scales.

python

import torch
import torch.nn.functional as F

torch.manual_seed(42)
vocab_size = 50_257   # GPT-2 vocabulary size
seq_len    = 128
batch_size = 8

# simulated model output: raw logits
logits = torch.randn(batch_size, seq_len, vocab_size)

# simulated targets: correct next token at each position
targets = torch.randint(0, vocab_size, (batch_size, seq_len))

# reshape for cross-entropy: (B*T, vocab) vs (B*T,)
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

print(f"loss: {loss.item():.4f}")
# random model: loss ≈ ln(50257) ≈ 10.82
print(f"ln(vocab_size): {torch.log(torch.tensor(50257.0)):.4f}")

plaintext

loss:           10.8193
ln(vocab_size): 10.8245

The connection to Article 3: neural networks

Everything in this article feeds directly into how neural networks train.

Concept	Where it appears in neural network training
Probability distribution	Output of every forward pass (after softmax)
Conditional probability	The entire language modeling objective
Entropy	Measures model confidence; tracked alongside loss
Cross-entropy	The loss function for classification and language modeling
Maximum likelihood	What gradient descent on cross-entropy is actually doing
Normal distribution	Weight initialization
Expected value	Loss averaged across a batch

The math in Article 1 told you how the operations run. The math in this article tells you what the outputs represent. Together they are the foundation for everything that comes after.

A quick sanity check

Probability and Statistics: What AI Actually Means by Confidence

What probability actually measures

Random variables and distributions

Expected value: what an average really means

Entropy: measuring uncertainty

Conditional probability

Bayes' theorem: updating beliefs with evidence

Likelihood and maximum likelihood estimation

Cross-entropy loss in practice

The connection to Article 3: neural networks

Next in the series

Krunal Kanojiya

Related Posts

Probability and Statistics: What AI Actually Means by Confidence

What probability actually measures

Random variables and distributions

Expected value: what an average really means

Entropy: measuring uncertainty

Conditional probability

Bayes' theorem: updating beliefs with evidence

Likelihood and maximum likelihood estimation

Cross-entropy loss in practice

The connection to Article 3: neural networks

Next in the series

Krunal Kanojiya

Related Posts