Tech12 min read2,302 words

Probability and Statistics: What AI Actually Means by Confidence

Before you can understand how LLMs train or why they hallucinate, you need probability and statistics. This article covers distributions, entropy, Bayes' theorem, and cross-entropy loss with real code examples — the exact concepts that show up in neural network training.

K

Krunal Kanojiya

Share:
#probability#statistics#machine-learning#ai#deep-learning#cross-entropy#bayes

I used to skip the statistics sections.

Not just in ML papers. In university too. Probability felt like a detour from the real programming. I figured I could come back to it later, after I understood how things actually worked.

That was backwards. The statistics is not a detour. It is how the whole thing works.

Every prediction a neural network makes is a probability distribution. The loss function is measuring the distance between two distributions. The training process is doing maximum likelihood estimation whether it calls it that or not. When a language model hallucinates, it is producing a token with high predicted probability that happens to be wrong. Understanding any of that requires this article.

This is Article 2 in a series on AI and ML fundamentals. Article 1 covered linear algebra and calculus — the math of structure and gradients. This one covers probability and statistics — the math of uncertainty. Article 3 will bring both together to build and train a neural network.


What probability actually measures

Probability is a number between 0 and 1 that describes how likely something is to happen. Zero means impossible. One means certain. Everything else is in between.

The thing most people get wrong about probability in ML is thinking the model is producing answers. It is not. It is producing probabilities over possible answers, and then you (or the sampling code) pick one.

python
import torch
import torch.nn.functional as F

# raw model output (logits) for a 5-token vocabulary
logits = torch.tensor([2.1, 0.5, -1.3, 0.8, 1.7])

# convert to probabilities with softmax
probs = F.softmax(logits, dim=0)

print(probs)
# tensor([0.4051, 0.0819, 0.0133, 0.1097, 0.3899])

print(probs.sum())
# tensor(1.)  — all probabilities sum to 1

That output [0.40, 0.08, 0.01, 0.11, 0.39] is the model saying: "Token 0 is most likely, token 4 is close behind, token 2 is almost impossible." The model never says "the answer is token 0." It says "here is my confidence in each option."

This is not a minor implementation detail. It changes how you think about what models are doing. When GPT generates text, it is sampling from this distribution at every single step. The creativity, the randomness, the occasional weirdness — all of it comes from this probability layer.


Random variables and distributions

A random variable is a variable whose value depends on a random process. When you roll a die, the outcome is a random variable. When a language model predicts the next token, that token is a random variable.

A distribution describes the probability of each possible value.

Discrete distributions are for things that take specific countable values — like which token comes next out of a vocabulary of 50,257 options.

Continuous distributions are for things that can take any value in a range — like the weights in a neural network before training.

The one you will see most often in ML is the normal distribution (also called Gaussian). It is bell-shaped, symmetric, and described by two numbers: mean (center) and standard deviation (spread).

python
import numpy as np
import torch

# sample 1000 values from a normal distribution
# mean=0, std=1 (standard normal)
samples = np.random.normal(loc=0.0, scale=1.0, size=1000)

print(f"mean: {samples.mean():.3f}")    # close to 0
print(f"std:  {samples.std():.3f}")     # close to 1
print(f"min:  {samples.min():.3f}")
print(f"max:  {samples.max():.3f}")
plaintext
mean: 0.012
std:  0.998
min: -3.241
max:  3.018

Why does this matter for ML? Because PyTorch initializes most weight matrices from a normal distribution. The model starts with weights drawn from roughly N(0, 0.02). If you initialize weights too large or too small, training breaks before it starts. The distribution you initialize from has a real effect on whether the model learns at all.


Expected value: what an average really means

The expected value of a random variable is its probability-weighted average. If you repeat something many times, the expected value is what you would converge to.

python
# rolling a fair die: outcomes 1-6, each with probability 1/6
outcomes = [1, 2, 3, 4, 5, 6]
probs    = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

expected = sum(o * p for o, p in zip(outcomes, probs))
print(f"expected value: {expected:.2f}")  # 3.50

In ML, loss functions are expected values. When you compute the loss over a batch of 32 examples, you are computing the average loss across those examples. Training minimizes the expected loss over the data distribution, which is another way of saying: you want the model to do well on average across all possible inputs, not just the specific batch you are looking at right now.


Entropy: measuring uncertainty

Entropy measures how uncertain or spread out a distribution is. High entropy means the distribution is flat and unpredictable. Low entropy means it is peaked and confident.

python
import torch
import torch.nn.functional as F

def entropy(probs):
    # H(p) = -sum(p * log(p))
    # add small epsilon to avoid log(0)
    return -(probs * torch.log(probs + 1e-9)).sum()

# uniform distribution over 4 tokens — maximum uncertainty
uniform = torch.tensor([0.25, 0.25, 0.25, 0.25])
print(f"uniform entropy: {entropy(uniform):.4f}")   # 1.3863

# peaked distribution — model is very confident about token 0
peaked = torch.tensor([0.97, 0.01, 0.01, 0.01])
print(f"peaked entropy:  {entropy(peaked):.4f}")    # 0.1510

# somewhere in between
mixed = torch.tensor([0.60, 0.20, 0.15, 0.05])
print(f"mixed entropy:   {entropy(mixed):.4f}")     # 1.0114

This is not abstract. When a language model is early in training, its output distribution looks roughly uniform — high entropy, no confidence. As training progresses and the model learns real patterns, the distribution peaks on likely tokens and entropy drops.

When you see the loss going down during training, entropy going down is a big part of what is happening. The model is getting more decisive.

Real example: the temperature parameter in text generation directly controls entropy. Lower temperature sharpens the distribution (lower entropy, more predictable output). Higher temperature flattens it (higher entropy, more variety).

python
logits = torch.tensor([3.0, 1.5, 0.5, -1.0])

temps = [0.5, 1.0, 2.0]
for t in temps:
    probs = F.softmax(logits / t, dim=0)
    print(f"temp={t}: probs={probs.numpy().round(3)} | entropy={entropy(probs):.3f}")
plaintext
temp=0.5: probs=[0.938 0.059 0.003 0.  ] | entropy=0.248
temp=1.0: probs=[0.718 0.228 0.044 0.009] | entropy=0.735
temp=2.0: probs=[0.484 0.290 0.159 0.067] | entropy=1.248

Temperature 0.5 makes the model commit hard to token 0. Temperature 2.0 gives the other tokens a real chance. The temperature parameter in your generate() function is just dividing logits before softmax. That is the whole implementation.


Conditional probability

Conditional probability asks: given that something has already happened, what is the probability of something else?

Written as P(A | B) — probability of A given B.

This is the mathematical foundation of language modeling. A language model is computing P(next_token | all_previous_tokens). Every single forward pass. Given the sentence so far, what is the probability distribution over what comes next?

python
# a toy example with word frequencies
# counts of word pairs in a tiny corpus
bigram_counts = {
    ("the", "cat"): 15,
    ("the", "dog"): 10,
    ("the", "bird"): 5,
    ("the", "house"): 20,
}

total_after_the = sum(bigram_counts.values())  # 50

# P(word | "the") — what comes after "the"?
for (prev, next_word), count in bigram_counts.items():
    prob = count / total_after_the
    print(f"P({next_word!r} | 'the') = {prob:.2f}")
plaintext
P('cat'   | 'the') = 0.30
P('dog'   | 'the') = 0.20
P('bird'  | 'the') = 0.10
P('house' | 'the') = 0.40

A bigram model like this is a language model. A terrible one, but the same idea. GPT-4 is computing the same conditional probability — P(next_token | all_previous_tokens) — just with 1.8 trillion parameters instead of a lookup table.


Bayes' theorem: updating beliefs with evidence

Bayes' theorem is the formula for updating your probability estimate when you see new evidence.

plaintext
P(hypothesis | evidence) = P(evidence | hypothesis) * P(hypothesis) / P(evidence)

The components:

  • P(hypothesis) is the prior — what you believed before seeing evidence
  • P(evidence | hypothesis) is the likelihood — how probable is the evidence if the hypothesis is true
  • P(hypothesis | evidence) is the posterior — updated belief after seeing evidence

Real example. You have a spam filter.

python
# prior: 30% of emails are spam based on history
p_spam = 0.30
p_not_spam = 0.70

# likelihood: "FREE" appears in 80% of spam, 5% of normal email
p_free_given_spam     = 0.80
p_free_given_not_spam = 0.05

# total probability of seeing "FREE" in any email
p_free = (p_free_given_spam * p_spam) + (p_free_given_not_spam * p_not_spam)

# posterior: given we see "FREE", what's the probability it's spam?
p_spam_given_free = (p_free_given_spam * p_spam) / p_free

print(f"P(spam | 'FREE') = {p_spam_given_free:.3f}")  # 0.873
plaintext
P(spam | 'FREE') = 0.873

Seeing "FREE" pushes the spam probability from 30% to 87%. That is Bayesian updating.

Neural networks do not literally apply Bayes' theorem on every forward pass. But training is doing something Bayesian at a high level: you start with a random prior (initial weights), observe evidence (training data), and the loss function pushes the model toward weights that maximize the likelihood of that data. The posterior is the trained model.


Likelihood and maximum likelihood estimation

Likelihood is the probability of your observed data given your model's current parameters. Training a neural network is finding the parameters that maximize this likelihood.

Maximum likelihood estimation (MLE) sounds academic but you are already using it. Cross-entropy loss is the negative log likelihood. Minimizing cross-entropy is the same as maximizing likelihood. They are the same thing.

python
import torch
import torch.nn.functional as F

# true label: token index 2 is the correct next token
target = torch.tensor([2])

# model's predicted probabilities over 5-token vocabulary
logits = torch.tensor([[1.2, 0.5, 2.8, 0.3, -0.4]])
probs = F.softmax(logits, dim=1)

print("probabilities:", probs.round(decimals=3))
# tensor([[0.088, 0.043, 0.583, 0.035, 0.017]])  — model gives token 2 a 58% chance

# cross-entropy loss = -log(probability of the correct token)
loss = F.cross_entropy(logits, target)
print(f"cross-entropy loss: {loss.item():.4f}")   # 0.5389

# manually: -log(0.583)
import math
manual_loss = -math.log(0.583)
print(f"manual -log(p):     {manual_loss:.4f}")   # 0.5390 — same thing

The loss is literally just the negative log of the probability the model assigned to the correct answer. If the model was 58% confident and correct, the loss is 0.54. If the model was only 5% confident, the loss would be −log(0.05) = 3.0. High confidence on the right answer = low loss. That is the entire training objective.

Why log? Two reasons. First, log turns multiplications into additions, which is computationally easier. Second, it penalizes overconfident wrong answers severely — if the model assigns 99% probability to the wrong token, the loss is enormous.


Cross-entropy loss in practice

Cross-entropy is how every classification task and every language model trains. It is worth seeing how it scales.

python
import torch
import torch.nn.functional as F

torch.manual_seed(42)
vocab_size = 50_257   # GPT-2 vocabulary size
seq_len    = 128
batch_size = 8

# simulated model output: raw logits
logits = torch.randn(batch_size, seq_len, vocab_size)

# simulated targets: correct next token at each position
targets = torch.randint(0, vocab_size, (batch_size, seq_len))

# reshape for cross-entropy: (B*T, vocab) vs (B*T,)
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

print(f"loss: {loss.item():.4f}")
# random model: loss ≈ ln(50257) ≈ 10.82
print(f"ln(vocab_size): {torch.log(torch.tensor(50257.0)):.4f}")
plaintext
loss:           10.8193
ln(vocab_size): 10.8245

A randomly initialized model starts with a loss of roughly ln(50257) ≈ 10.82. That is what maximum uncertainty looks like in numbers. Every token gets roughly equal probability, so the correct token gets roughly 1/50257 probability, and −log(1/50257) ≈ 10.82.

When I built NanoGPT and watched the loss start at 10.8, I finally understood what that number meant. It was not just a metric. It was the model starting from complete ignorance about which of 50,257 tokens might come next.

By the end of training, the loss dropped to 1.59. That means the model had narrowed its effective vocabulary at each step from 50,000 tokens to roughly e^1.59 ≈ 4.9 plausible options. Not one right answer, but a small cluster of likely ones. That is real learning.


The connection to Article 3: neural networks

Everything in this article feeds directly into how neural networks train.

ConceptWhere it appears in neural network training
Probability distributionOutput of every forward pass (after softmax)
Conditional probabilityThe entire language modeling objective
EntropyMeasures model confidence; tracked alongside loss
Cross-entropyThe loss function for classification and language modeling
Maximum likelihoodWhat gradient descent on cross-entropy is actually doing
Normal distributionWeight initialization
Expected valueLoss averaged across a batch

Article 3 will take a neural network and walk through a full training loop. When I write loss = F.cross_entropy(logits, targets) there, you will know what that line is computing and why. When the model runs F.softmax(logits, dim=-1) during generation, you will know what comes out and what the numbers mean.

The math in Article 1 told you how the operations run. The math in this article tells you what the outputs represent. Together they are the foundation for everything that comes after.

A quick sanity check

If someone shows you a model with a validation loss of 2.3 and asks if that is good, you can now answer. e^2.3 ≈ 10. That means the model has narrowed the next token down to roughly 10 plausible options from a vocabulary of 50,000. Whether that is good depends on the task, but you are no longer just looking at an abstract number.


Next in the series

Article 3 covers neural networks and backpropagation. You will see how a network is structured, how a forward pass moves data through it using the matrix math from Article 1, and how the loss computed using the probability math from this article gets turned into gradients that update the weights. It is the first article where training actually happens.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

Related Posts