Probability & Statistics

A complete interactive guide — from coin flips to ML loss functions to quantum measurement.
Estimated reading time: 2-3 hours.

← Back to all guides

1. What Is Probability?

Probability measures how likely something is to happen. It's a number between 0 and 1:

$$P(\text{event}) = \frac{\text{number of favorable outcomes}}{\text{total number of possible outcomes}}$$

$P = 0$: impossible. $P = 1$: certain. $P = 0.5$: 50-50 coin flip.

That's the classical definition — works when all outcomes are equally likely (dice, coins, cards).

Key terminology:

Three interpretations of probability:
Classical: Count equally likely outcomes (dice, cards).
Frequentist: Long-run frequency of observation (93% of deploys succeed).
Bayesian: Degree of belief, updated with evidence (I'm 80% confident this model is correct).
Worked Example: Drawing Cards

A standard deck has 52 cards. What's $P(\text{heart})$?

There are 13 hearts in a 52-card deck.
$P(\text{heart}) = \frac{13}{52} = \frac{1}{4} = 0.25 = 25\%$

What's $P(\text{face card})$? There are 12 face cards (J, Q, K in 4 suits).

$P(\text{face}) = \frac{12}{52} = \frac{3}{13} \approx 23.1\%$
ML Connection: Training/Test Accuracy as Probability

When you report "model accuracy = 94%", you're saying: if you randomly pick a sample from the test set, $P(\text{correct prediction}) = 0.94$. This is the frequentist interpretation — accuracy is estimated from observed data.

Interactive: Probability from Data

2. Basic Rules of Probability

Rule 1: Complement

$$P(\text{not } A) = P(\bar{A}) = 1 - P(A)$$

Often easier to calculate what you DON'T want, then subtract.

Worked Example: At Least One Head in 5 Flips
$P(\text{at least one head}) = 1 - P(\text{no heads}) = 1 - P(\text{all tails})$
$= 1 - (0.5)^5 = 1 - 0.03125 = 0.96875 = 96.9\%$

Rule 2: Addition (OR)

$$P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)$$

Subtract the overlap to avoid double-counting. If mutually exclusive: $P(A \text{ or } B) = P(A) + P(B)$.

Rule 3: Multiplication (AND)

$$P(A \text{ and } B) = P(A) \times P(B|A)$$

If independent: $P(A \text{ and } B) = P(A) \times P(B)$.

Worked Example: Two Draws Without Replacement

Draw 2 cards from a deck without replacement. $P(\text{both aces})$?

$P(\text{1st ace}) = \frac{4}{52}$
$P(\text{2nd ace} | \text{1st ace}) = \frac{3}{51}$ (one ace gone, one card gone)
$P(\text{both aces}) = \frac{4}{52} \times \frac{3}{51} = \frac{12}{2652} = \frac{1}{221} \approx 0.45\%$
ML Connection: Probability Rules in Decision Trees

Decision trees split data by asking sequential questions. At each leaf, the class probability is $P(Y=c | X_1, X_2, \ldots)$. The multiplication rule is how we chain conditional probabilities down the tree: $P(A \text{ and } B) = P(A) \cdot P(B|A)$.

3. Conditional Probability

Conditional probability is the probability of an event GIVEN that another event has occurred:

$$P(A|B) = \frac{P(A \text{ and } B)}{P(B)}$$

"If we know B happened, what fraction of the B-world also has A?"

Key insight: Conditional probability shrinks the sample space. You're no longer looking at the whole universe — only the part where $B$ is true. This is the foundation of all Bayesian reasoning and every ML classifier.
Worked Example: Spam Filter Setup

Your inbox: 40% spam. 80% of spam contains "FREE". 10% of non-spam contains "FREE".

$P(\text{spam}) = 0.4$, $P(\text{"FREE"}|\text{spam}) = 0.8$, $P(\text{"FREE"}|\text{not spam}) = 0.1$
Total probability of "FREE":
$P(\text{"FREE"}) = 0.8 \times 0.4 + 0.1 \times 0.6 = 0.32 + 0.06 = 0.38$

This uses the Law of Total Probability: $P(B) = P(B|A)P(A) + P(B|\bar{A})P(\bar{A})$.

Law of Total Probability

$$P(B) = \sum_{i} P(B|A_i) \cdot P(A_i)$$

If $A_1, A_2, \ldots$ partition the sample space (they cover everything and don't overlap), then the total probability of $B$ is the weighted sum across all partitions.

ML Connection: Marginalisation

In ML, this appears as marginalisation. If a generative model has latent variable $Z$:

$P(X) = \sum_z P(X|Z=z) \cdot P(Z=z)$

This is how Gaussian Mixture Models, Variational Autoencoders, and Hidden Markov Models compute the probability of observed data by summing over all possible hidden states.

4. Bayes' Theorem

Bayes' Theorem lets you flip conditional probabilities. If you know $P(B|A)$, it tells you $P(A|B)$:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

The terms have names:

Key insight: Bayes' Theorem is the mathematical engine of learning from data. Start with a belief (prior), observe evidence (data), update the belief (posterior). This is literally what training a Bayesian ML model does — and it's how humans update beliefs too.
Worked Example: Medical Testing (The Classic Gotcha)

Disease affects 1% of people. Test is 99% accurate (99% true positive, 99% true negative). You test positive. What's the chance you actually have the disease?

Prior: $P(\text{disease}) = 0.01$
Likelihood: $P(\text{+}|\text{disease}) = 0.99$
False positive: $P(\text{+}|\text{no disease}) = 0.01$
Evidence: $P(\text{+}) = 0.99 \times 0.01 + 0.01 \times 0.99 = 0.0198$
Posterior: $P(\text{disease}|\text{+}) = \frac{0.99 \times 0.01}{0.0198} = 0.5 = 50\%$

Only 50%! Because the disease is rare, half of positive results are false positives. This is why anomaly detectors have high false alarm rates when anomalies are rare — a direct consequence of Bayes.

Worked Example: Spam Classification (Bayes in Action)

From Section 3: email contains "FREE". Is it spam?

$P(\text{spam}|\text{"FREE"}) = \frac{P(\text{"FREE"}|\text{spam}) \cdot P(\text{spam})}{P(\text{"FREE"})} = \frac{0.8 \times 0.4}{0.38} = \frac{0.32}{0.38} = 84.2\%$

This is exactly how Naive Bayes spam filters work (Section 20).

Interactive: Bayes' Calculator

5. Independence

Two events are independent if one happening doesn't affect the other:

$$A \perp B \iff P(A \text{ and } B) = P(A) \times P(B)$$ $$\text{Equivalently: } P(A|B) = P(A)$$
Independence ≠ mutually exclusive. Mutually exclusive events (can't both happen) are actually maximally dependent — knowing one happened tells you the other didn't. Independent events CAN happen together; they just don't affect each other's probability.
Worked Example: Testing Independence

Roll two dice. Are "die 1 shows 6" and "sum is 7" independent?

$P(\text{die 1 = 6}) = 1/6$
$P(\text{sum = 7}) = 6/36 = 1/6$ (there are 6 ways: 1+6, 2+5, ...)
$P(\text{die 1 = 6 AND sum = 7}) = P(\text{die 1 = 6, die 2 = 1}) = 1/36$
$P(A) \times P(B) = 1/6 \times 1/6 = 1/36 = P(A \text{ and } B)$ ✓ Independent!
ML Connection: The Naive Bayes "Naive" Assumption

Naive Bayes classifiers assume all features are conditionally independent given the class label: $P(x_1, x_2, \ldots | y) = \prod_i P(x_i | y)$. This is almost never true in practice (word frequencies ARE correlated), but the classifier works surprisingly well anyway. This "naive" independence assumption is what makes computation tractable — instead of estimating a joint distribution over all features, you only need individual feature distributions.

6. Random Variables & Distributions

A random variable $X$ is a number whose value depends on a random outcome. It maps outcomes to numbers.

Discrete vs Continuous

CDF (Cumulative Distribution Function) — works for both:

$$F(x) = P(X \leq x)$$

For continuous: $F(x) = \int_{-\infty}^{x} f(t) \, dt$. The CDF always goes from 0 to 1, never decreases.

Key insight: For continuous variables, $P(X = \text{exactly } 3.14159...) = 0$. Only ranges have non-zero probability: $P(2 < X < 4) = \int_2^4 f(x) dx$. The PDF value $f(x)$ is a density, not a probability — it can be greater than 1!
Worked Example: Discrete PMF

A loaded die has: $P(1) = P(2) = P(3) = P(4) = P(5) = 0.1$, $P(6) = 0.5$.

Check: $5(0.1) + 0.5 = 0.5 + 0.5 = 1.0$ ✓ (PMF must sum to 1)
$P(X \geq 5) = P(5) + P(6) = 0.1 + 0.5 = 0.6$
$P(X \leq 3) = 3 \times 0.1 = 0.3$

7. Expected Value

The expected value (mean, first moment) is the long-run average:

$$E[X] = \sum_{x} x \cdot P(X = x) \quad \text{(discrete)} \qquad E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \quad \text{(continuous)}$$

Linearity of expectation — always holds, even for dependent variables:

$$E[aX + bY + c] = aE[X] + bE[Y] + c$$
Worked Example: Expected Value of a Die
$E[X] = 1(\frac{1}{6}) + 2(\frac{1}{6}) + 3(\frac{1}{6}) + 4(\frac{1}{6}) + 5(\frac{1}{6}) + 6(\frac{1}{6}) = \frac{21}{6} = 3.5$

You can never roll 3.5 — the expected value doesn't have to be a possible outcome.

Worked Example: Linearity of Expectation

Roll 3 dice. What's the expected sum?

$E[X_1 + X_2 + X_3] = E[X_1] + E[X_2] + E[X_3] = 3.5 + 3.5 + 3.5 = 10.5$

No need to enumerate all $6^3 = 216$ outcomes! Linearity makes this trivial.

ML Connection: Expected Value = Loss Function

Training a model minimises the expected loss: $E_{\mathbf{x},y}[\mathcal{L}(f(\mathbf{x}), y)]$. Mean Squared Error, Cross-Entropy Loss, etc. are all expected values of a loss function over the data distribution. When we average the loss over a mini-batch, we're estimating this expected value.

Interactive: Expected Value Calculator

8. Variance & Standard Deviation

Expected value is the center. Variance is the spread.

$$\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$ $$\sigma = \sqrt{\text{Var}(X)} \qquad \text{(standard deviation — same units as data)}$$

Key properties:

$$\text{Var}(aX + b) = a^2 \text{Var}(X)$$ $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$

If $X$ and $Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.

Worked Example: Variance of a Fair Die
$E[X] = 3.5$ (from above)
$E[X^2] = 1^2(\frac{1}{6}) + 2^2(\frac{1}{6}) + \cdots + 6^2(\frac{1}{6}) = \frac{1+4+9+16+25+36}{6} = \frac{91}{6} \approx 15.17$
$\text{Var}(X) = 15.17 - 3.5^2 = 15.17 - 12.25 = 2.917$
$\sigma = \sqrt{2.917} \approx 1.71$

9. Common Distributions

Bernoulli — One Yes/No Trial

$$X \in \{0, 1\}, \quad P(X=1) = p, \quad E[X] = p, \quad \text{Var}(X) = p(1-p)$$

Binomial — $n$ Independent Yes/No Trials

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad E[X] = np, \quad \text{Var}(X) = np(1-p)$$

Interactive: Binomial Distribution

20 0.50

Poisson — Rare Events at a Rate

$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad E[X] = \lambda, \quad \text{Var}(X) = \lambda$$

$\lambda$ = average rate. Events per second, errors per hour, mutations per genome.

Exponential — Time Between Events

$$f(x) = \lambda e^{-\lambda x} \text{ for } x \geq 0, \quad E[X] = \frac{1}{\lambda}, \quad \text{Var}(X) = \frac{1}{\lambda^2}$$

If events arrive at Poisson rate $\lambda$, the waiting time between them is exponential. "Memoryless" — past waiting doesn't affect future waiting.

Uniform — All Values Equally Likely

$$f(x) = \frac{1}{b - a} \text{ for } a \leq x \leq b, \quad E[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}$$
Worked Example: Which Distribution?
ScenarioDistributionWhy
10 emails, each spam or notBinomial(10, p)Fixed $n$ trials, yes/no
Typos per page of codePoisson($\lambda$)Count of rare events in a window
Time until next server crashExponential($\lambda$)Waiting time between Poisson events
Random number generatorUniform(0, 1)All values equally likely
Measurement errorsNormal($\mu$, $\sigma$)Sum of many small random effects

10. The Normal Distribution

The most important distribution in all of statistics. The bell curve.

$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

The 68-95-99.7 rule:

The Z-Score

$$z = \frac{x - \mu}{\sigma}$$

How many standard deviations is $x$ from the mean? $z = 2$ means "2 sigma above average" = top 2.5%.

Interactive: Normal Distribution

0 10
ML Connection: Why the Normal Distribution is Everywhere in ML

1. Weight initialisation: Neural network weights are typically initialised from $\mathcal{N}(0, \sigma^2)$.

2. Gaussian noise: VAEs, diffusion models, and regularisation all add $\mathcal{N}(0, \sigma^2)$ noise.

3. MSE loss: Minimising MSE is equivalent to maximum likelihood under a Gaussian noise model.

4. Batch normalisation: Forces activations to be approximately $\mathcal{N}(0, 1)$.

5. Gaussian Processes: A powerful Bayesian method where everything is multivariate normal.

11. Joint & Marginal Distributions

When you have two or more random variables, you need to describe their joint behavior — not just each one alone.

Joint PMF: $P(X = x, Y = y)$ — probability that $X = x$ AND $Y = y$ simultaneously.

Marginal PMF: $P(X = x) = \sum_y P(X = x, Y = y)$ — "sum out" the variable you don't care about.

Conditional: $P(Y = y | X = x) = \frac{P(X=x, Y=y)}{P(X=x)}$
Worked Example: Joint Distribution Table

Weather ($X$) vs Ice Cream Sales ($Y$):

$Y$ = Low$Y$ = HighMarginal $P(X)$
$X$ = Cold0.400.100.50
$X$ = Hot0.050.450.50
Marginal $P(Y)$0.450.551.00
$P(Y=\text{High}|X=\text{Hot}) = \frac{0.45}{0.50} = 0.90 = 90\%$
$P(Y=\text{High}|X=\text{Cold}) = \frac{0.10}{0.50} = 0.20 = 20\%$

Hot weather strongly predicts high ice cream sales. The joint distribution captures this relationship that the marginals alone can't.

Covariance — Measuring Joint Variation

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$

Positive: $X$ and $Y$ move together. Negative: they move opposite. Zero: no linear relationship.

Key insight: If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = 0$. But the converse is NOT true — zero covariance doesn't guarantee independence (they could have a nonlinear relationship).

12. Descriptive Statistics

Measures of Center

Mean: $\bar{x} = \frac{1}{n}\sum x_i$ — sensitive to outliers.

Median: Middle value when sorted — robust to outliers.

Mode: Most frequent value.

Measures of Spread

Variance: $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$ (Bessel's correction: $n-1$ instead of $n$)

Standard Deviation: $s = \sqrt{s^2}$ — same units as data

IQR: $Q3 - Q1$ — range of middle 50%, robust to outliers

Interactive: Descriptive Stats Calculator

13. Correlation

$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$

$r \in [-1, +1]$. It's the normalised covariance — covariance scaled to be unitless.

Correlation ≠ causation. Ice cream sales and drowning deaths correlate ($r \approx 0.8$) because both are caused by hot weather (confounding variable). In ML, high feature correlation → multicollinearity → drop redundant features.

Interactive: Correlation Explorer

Click canvas to add points. Watch correlation update live.

14. Linear Regression

$$\hat{y} = mx + b \qquad \text{Minimize } \sum(y_i - \hat{y}_i)^2$$ $$m = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2} \qquad b = \bar{y} - m\bar{x}$$
$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

$R^2$ = fraction of variance explained by the model. For simple linear regression, $R^2 = r^2$.

Key insight for ML: Linear regression is the simplest supervised learning model. The "least squares" objective is equivalent to Maximum Likelihood Estimation assuming Gaussian noise (Section 19). Every gradient descent optimization in deep learning is a generalization of this same idea.

Interactive: Linear Regression

Click to add points. Best-fit line updates automatically.

15. Hypothesis Testing

  1. $H_0$ (null): "No effect. Any difference is chance."
  2. $H_1$ (alternative): "There IS a real effect."
  3. Compute a test statistic from data.
  4. p-value: probability of data this extreme if $H_0$ true.
  5. If p-value < $\alpha$ (usually 0.05): reject $H_0$.
$H_0$ true$H_0$ false
Reject $H_0$Type I Error (false positive, $\alpha$)Correct!
Don't rejectCorrect!Type II Error (false negative, $\beta$)
ML Connection — Precision & Recall: In classification, Type I = false positive (low precision), Type II = false negative (low recall). The precision-recall tradeoff mirrors the $\alpha$-$\beta$ tradeoff in hypothesis testing. Lowering $\alpha$ reduces false positives but increases false negatives.

Interactive: A/B Test Calculator

visitors, conv. visitors, conv.

16. Confidence Intervals

$$\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}$$

90%: $z = 1.645$. 95%: $z = 1.960$. 99%: $z = 2.576$.

Key: to halve uncertainty, you need 4× the data (because $\sqrt{4n} = 2\sqrt{n}$).

Interactive: Confidence Interval Calculator

17. Law of Large Numbers & Central Limit Theorem

Law of Large Numbers

$$\bar{X}_n \xrightarrow{n \to \infty} E[X]$$

More data = sample average converges to true mean. No shortcut.

Central Limit Theorem (CLT)

No matter what the original distribution, the average of many independent samples → Normal.

$$\bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$$

Usually $n \geq 30$ is enough. This is why the Normal distribution appears everywhere.

ML Connection: The CLT is why mini-batch gradient estimates work. Each mini-batch average is approximately normal around the true gradient (by CLT), even if individual sample gradients are weirdly distributed. Larger batches → lower variance of the gradient estimate → smoother training.

Interactive: CLT Simulator

30

18. Entropy & Information Theory

Entropy measures uncertainty or information content of a probability distribution. High entropy = high uncertainty = lots of surprise. Low entropy = predictable.

Shannon Entropy: $$H(X) = -\sum_{x} P(x) \log_2 P(x)$$

Measured in bits (with $\log_2$) or nats (with $\ln$). By convention, $0 \log 0 = 0$.

Intuition: Entropy answers "how many yes/no questions do I need, on average, to determine the outcome?"

Worked Example: Computing Entropy

Weather: $P(\text{sunny}) = 0.7$, $P(\text{rainy}) = 0.2$, $P(\text{snowy}) = 0.1$.

$H = -(0.7 \log_2 0.7 + 0.2 \log_2 0.2 + 0.1 \log_2 0.1)$
$= -(0.7 \times (-0.515) + 0.2 \times (-2.322) + 0.1 \times (-3.322))$
$= -(- 0.360 - 0.464 - 0.332) = 1.157$ bits

Compare: if all three were equally likely, $H = \log_2 3 = 1.585$ bits. The uneven distribution has less entropy (less uncertainty — we can usually guess "sunny").

Cross-Entropy — The ML Loss Function

$$H(p, q) = -\sum_{x} p(x) \log q(x)$$

$p$ = true distribution, $q$ = model's predicted distribution. Cross-entropy measures how well $q$ approximates $p$.

Key insight: Cross-entropy loss is the most common loss function for classification in ML. When you train a neural network with `CrossEntropyLoss`, you're literally minimizing $H(p, q)$ where $p$ is the one-hot true label and $q$ is the model's softmax output. Minimizing cross-entropy = making the model's predictions match the true distribution.
Worked Example: Cross-Entropy for Classification

True label: cat (class 0). One-hot: $p = [1, 0, 0]$ (cat, dog, bird).

Model predicts: $q = [0.7, 0.2, 0.1]$.

$H(p, q) = -(1 \cdot \log 0.7 + 0 \cdot \log 0.2 + 0 \cdot \log 0.1) = -\log 0.7 \approx 0.357$

If model predicted $q = [0.95, 0.03, 0.02]$: $H(p, q) = -\log 0.95 \approx 0.051$ (much lower — better prediction!).

If model predicted $q = [0.01, 0.98, 0.01]$: $H(p, q) = -\log 0.01 \approx 4.605$ (terrible — very high loss).

KL Divergence — Distance Between Distributions

$$D_{KL}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$

$D_{KL} \geq 0$, with equality iff $p = q$. NOT symmetric: $D_{KL}(p\|q) \neq D_{KL}(q\|p)$.

KL divergence = "extra bits needed because you used code $q$ instead of the optimal code $p$." It measures the "wasted information" from using the wrong distribution.

ML Connection: Cross-entropy = Entropy + KL divergence: $H(p,q) = H(p) + D_{KL}(p\|q)$. Since $H(p)$ is fixed (determined by the data), minimizing cross-entropy loss = minimizing KL divergence = making the model's distribution as close to the true distribution as possible. This is why cross-entropy is the "right" loss for classification.

Interactive: Entropy Calculator

ML Connection: Decision Tree Splitting with Entropy

Decision trees pick the feature split that reduces entropy the most. Information gain = entropy before split − weighted entropy after split. A split that produces pure nodes (all same class) reduces entropy to 0 — maximum information gain. This is exactly the criterion used in ID3 and C4.5 algorithms.

19. Maximum Likelihood Estimation (MLE)

Given observed data, what parameters make the data most probable? That's MLE — the most fundamental estimation method in statistics and ML.

$$\hat{\theta}_{MLE} = \arg\max_{\theta} \; P(\text{data} | \theta) = \arg\max_{\theta} \; \prod_{i=1}^{n} P(x_i | \theta)$$

Each data point is assumed independent, so the joint probability is the product. In practice, we maximize the log-likelihood (sums are easier than products):

$$\hat{\theta}_{MLE} = \arg\max_{\theta} \; \sum_{i=1}^{n} \log P(x_i | \theta)$$
Key insight: Almost every "training" procedure in ML is MLE in disguise. Minimizing cross-entropy loss = maximizing log-likelihood. Minimizing MSE = maximizing log-likelihood under Gaussian noise. When you call `model.fit()` or run gradient descent on a loss function, you're doing MLE.
Worked Example: MLE for a Coin

Flip a coin 100 times. Get 62 heads. What's the MLE for $p$ (probability of heads)?

Likelihood: $L(p) = \binom{100}{62} p^{62}(1-p)^{38}$
Log-likelihood: $\ell(p) = 62\log p + 38\log(1-p) + \text{const}$
Differentiate, set to 0: $\frac{d\ell}{dp} = \frac{62}{p} - \frac{38}{1-p} = 0$
$62(1-p) = 38p \Rightarrow 62 = 100p \Rightarrow \hat{p} = 0.62$

The MLE is just the sample proportion — intuitive! 62 heads out of 100 → $\hat{p} = 0.62$.

Worked Example: MLE for Normal Distribution = Least Squares

Data $y_i = f(x_i) + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.

$P(y_i | x_i, \theta) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(y_i - f(x_i;\theta))^2}{2\sigma^2}\right)$
$\log L = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(y_i - f(x_i;\theta))^2$
Maximising = minimising $\sum(y_i - f(x_i;\theta))^2$ = least squares!

So linear regression with MSE loss is secretly MLE under a Gaussian noise assumption.

MAP: Bayesian Alternative to MLE

$$\hat{\theta}_{MAP} = \arg\max_{\theta} \; P(\theta | \text{data}) = \arg\max_{\theta} \; P(\text{data}|\theta) \cdot P(\theta)$$

MAP = MLE + prior. The prior $P(\theta)$ encodes your beliefs before seeing data. With lots of data, MAP → MLE.

When the prior says "parameters should be small" (e.g., Gaussian prior $P(\theta) \propto e^{-\lambda\|\theta\|^2}$), MAP becomes L2 regularisation (weight decay in neural networks).

20. The Naive Bayes Classifier

Naive Bayes is Bayes' Theorem turned into a classifier. Despite its simplicity, it works remarkably well for spam filtering, text classification, and more.

$$P(y | x_1, x_2, \ldots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i | y)}{P(x_1, x_2, \ldots, x_n)}$$

The "naive" assumption: features are conditionally independent given the class. This makes the product factorize.

$$\hat{y} = \arg\max_{y} \; P(y) \prod_{i=1}^{n} P(x_i | y)$$
Worked Example: Spam Classifier with Two Words

Training data: $P(\text{spam}) = 0.4$, $P(\text{not spam}) = 0.6$.

Word probabilities:

Word$P(\text{word}|\text{spam})$$P(\text{word}|\text{not spam})$
"FREE"0.800.10
"meeting"0.050.50

New email contains both "FREE" and "meeting". Is it spam?

Spam score: $P(\text{spam}) \cdot P(\text{"FREE"}|\text{spam}) \cdot P(\text{"meeting"}|\text{spam})$ $= 0.4 \times 0.80 \times 0.05 = 0.016$
Ham score: $P(\text{not spam}) \cdot P(\text{"FREE"}|\text{not spam}) \cdot P(\text{"meeting"}|\text{not spam})$ $= 0.6 \times 0.10 \times 0.50 = 0.030$
Normalise: $P(\text{spam}|\text{words}) = \frac{0.016}{0.016 + 0.030} = \frac{0.016}{0.046} = 34.8\%$
Prediction: Not spam (65.2% > 34.8%). The word "meeting" pulled it toward ham despite "FREE" pulling toward spam.
Why "Naive" Works: The Bias-Variance Tradeoff

The independence assumption is wrong (words ARE correlated). But it has low variance — the model has very few parameters to estimate ($n \times k$ instead of $n^k$), so it doesn't overfit. For small datasets, this low variance often more than compensates for the bias from the wrong assumption. This is a classic bias-variance tradeoff win.

21. Logistic Regression & the Sigmoid

Linear regression predicts a continuous value. But for classification, we need a probability between 0 and 1. Enter the sigmoid function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Maps any real number to $(0, 1)$. $\sigma(0) = 0.5$, $\sigma(\infty) = 1$, $\sigma(-\infty) = 0$.

Logistic regression: Apply a linear model, then squeeze through the sigmoid:

$$P(y = 1 | \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$
Key insight: The sigmoid output IS a probability. Training uses binary cross-entropy loss: $$\mathcal{L} = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$$ This is the negative log-likelihood of a Bernoulli distribution — so logistic regression is MLE for binary classification.
Worked Example: Logistic Regression Prediction

Trained model: $w_1 = 2.0$, $w_2 = -1.5$, $b = 0.5$. New input: $x = (1, 2)$.

Linear part: $z = 2.0(1) + (-1.5)(2) + 0.5 = 2 - 3 + 0.5 = -0.5$
Sigmoid: $P(y=1) = \frac{1}{1+e^{0.5}} = \frac{1}{1+1.649} = \frac{1}{2.649} = 0.378$
Prediction: $P(y=1) = 37.8\% < 50\%$, so predict class 0.

Softmax — Multi-Class Generalization

$$P(y = k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} \qquad \text{where } z_k = \mathbf{w}_k^T\mathbf{x} + b_k$$

For $K$ classes, softmax converts $K$ raw scores ("logits") into $K$ probabilities that sum to 1. Used in the final layer of almost every classification neural network.

Worked Example: Softmax

Three classes, logits: $z = [2.0, 1.0, 0.5]$.

$e^{z_1} = e^{2.0} = 7.389, \quad e^{z_2} = e^{1.0} = 2.718, \quad e^{z_3} = e^{0.5} = 1.649$
$\text{sum} = 7.389 + 2.718 + 1.649 = 11.756$
$P = [\frac{7.389}{11.756}, \frac{2.718}{11.756}, \frac{1.649}{11.756}] = [0.629, 0.231, 0.140]$

Probabilities sum to 1. Class 1 is most likely at 62.9%.

Interactive: Sigmoid & Softmax

0.0 0.500

22. Multivariate Normal & Covariance

In ML, data points have multiple features. The multivariate normal distribution is the multi-dimensional generalization of the bell curve.

$$\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$ $$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

$\boldsymbol{\mu}$ = mean vector ($d$-dimensional), $\boldsymbol{\Sigma}$ = covariance matrix ($d \times d$), $|\boldsymbol{\Sigma}|$ = its determinant.

The Covariance Matrix

$\boldsymbol{\Sigma}$ encodes both the spread of each variable (on the diagonal) and how they co-vary (off-diagonal):

$$\boldsymbol{\Sigma} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) \end{bmatrix} = \begin{bmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{bmatrix}$$

$\rho$ = correlation coefficient. Diagonal entries = variances. Off-diagonal = covariances. Always symmetric: $\Sigma_{ij} = \Sigma_{ji}$.

Worked Example: Reading a Covariance Matrix

$\boldsymbol{\Sigma} = \begin{bmatrix} 4 & 1.5 \\ 1.5 & 1 \end{bmatrix}$

$\text{Var}(X_1) = 4 \Rightarrow \sigma_1 = 2$. $\text{Var}(X_2) = 1 \Rightarrow \sigma_2 = 1$.
$\text{Cov}(X_1, X_2) = 1.5$. Positive → variables increase together.
$\rho = \frac{1.5}{2 \times 1} = 0.75$. Strong positive correlation.

The eigenvectors of $\boldsymbol{\Sigma}$ give the principal axes of the elliptical contours — this is exactly what PCA computes.

ML Connection: PCA, GMMs, and Gaussian Processes

PCA: Finds the eigenvectors of the data's covariance matrix → principal components = directions of maximum variance.

GMMs: Model data as a mixture of $K$ multivariate normals, each with its own $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$.

Gaussian Processes: Define a distribution over functions. The covariance function (kernel) determines the shape of possible functions.

VAEs: The latent space is assumed to be $\mathcal{N}(\mathbf{0}, \mathbf{I})$, and the encoder outputs $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ for each input.

23. Sampling & Monte Carlo Methods

When you can't compute an expectation or integral analytically, sample from the distribution and average:

$$E[f(X)] = \int f(x) p(x) dx \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i) \qquad x_i \sim p(x)$$

Draw $N$ samples from $p(x)$, evaluate $f$ at each, average. As $N \to \infty$, the estimate converges to the true expectation (by LLN).

Key insight: Monte Carlo is the workhorse of modern ML. Every time you compute a mini-batch loss (instead of the full-dataset loss), you're doing Monte Carlo estimation. Every time a VAE samples from the latent space, or a diffusion model adds/removes noise, Monte Carlo is at work.
Worked Example: Estimating $\pi$ with Monte Carlo

Throw $N$ random darts at a unit square $[0,1] \times [0,1]$. Count how many land inside the quarter-circle $x^2 + y^2 \leq 1$.

$\frac{\text{hits}}{\text{total}} \approx \frac{\text{area of quarter-circle}}{\text{area of square}} = \frac{\pi/4}{1} = \frac{\pi}{4}$
$\hat{\pi} = 4 \times \frac{\text{hits}}{N}$

With $N = 10{,}000$: typically get $\hat{\pi} \approx 3.14 \pm 0.02$. More samples → better estimate.

Interactive: Monte Carlo $\pi$ Estimator

MCMC: Markov Chain Monte Carlo

What if you can't directly sample from $p(x)$? MCMC constructs a Markov chain whose stationary distribution IS $p(x)$. After enough steps, samples from the chain are samples from $p(x)$.

Metropolis-Hastings (sketch):
  1. Start at some point $x_0$.
  2. Propose a move to $x'$ (e.g., add Gaussian noise).
  3. Accept with probability $\min\left(1, \frac{p(x')}{p(x_t)}\right)$.
  4. Repeat. After burn-in, the chain samples from $p(x)$.

MCMC is used in Bayesian deep learning, Bayesian neural networks, and probabilistic programming (Stan, PyMC).

24. Markov Chains

A Markov chain is a sequence of states where the next state depends only on the current state, not the history:

$$P(X_{t+1} = j | X_t = i, X_{t-1}, \ldots) = P(X_{t+1} = j | X_t = i) = T_{ij}$$

This is the Markov property (memorylessness). $T$ is the transition matrix where $T_{ij}$ = probability of going from state $i$ to state $j$.

Worked Example: Weather Markov Chain

Two states: Sunny (S) and Rainy (R).

From \ ToSunnyRainy
Sunny0.80.2
Rainy0.40.6

$T = \begin{bmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{bmatrix}$

If today is sunny, $P(\text{sunny tomorrow}) = 0.8$, $P(\text{rainy tomorrow}) = 0.2$.
$P(\text{sunny in 2 days} | \text{sunny today}) = T^2$ entry $(1,1)$:
$T^2 = \begin{bmatrix} 0.72 & 0.28 \\ 0.56 & 0.44 \end{bmatrix}$. So 72% chance of sunny in 2 days.

Stationary Distribution

$$\boldsymbol{\pi} T = \boldsymbol{\pi} \qquad \text{and} \qquad \sum_i \pi_i = 1$$

The stationary distribution $\boldsymbol{\pi}$ is the long-run fraction of time spent in each state. It's the eigenvector of $T^T$ with eigenvalue 1.

Worked Example: Finding Stationary Distribution

For the weather chain: $\pi_S (0.8) + \pi_R (0.4) = \pi_S$ and $\pi_S + \pi_R = 1$.

$0.8\pi_S + 0.4\pi_R = \pi_S \Rightarrow 0.4\pi_R = 0.2\pi_S \Rightarrow \pi_R = 0.5\pi_S$
$\pi_S + 0.5\pi_S = 1 \Rightarrow \pi_S = 2/3 \approx 66.7\%$, $\pi_R = 1/3 \approx 33.3\%$

In the long run, it's sunny 2/3 of the time, rainy 1/3.

ML Connections:
PageRank: The web is a Markov chain. The stationary distribution = page importance scores.
HMMs: Hidden Markov Models use Markov chains with hidden states for speech recognition, gene finding, etc.
Reinforcement Learning: MDPs (Markov Decision Processes) extend Markov chains with actions and rewards — the foundation of RL.
LLMs: Autoregressive language models generate text one token at a time, where each token depends on the context — a (high-order) Markov-like process.

25. Probability in Quantum Mechanics

Classical probability assigns a number $P(x) \in [0,1]$ to each outcome. Quantum mechanics is fundamentally different — and understanding how requires some new vocabulary.

Key Vocabulary

Probability Amplitudes vs Classical Probability

Classical: $P(\text{outcome}) \in [0, 1]$. Probabilities add: $P(A \text{ or } B) = P(A) + P(B)$.

Quantum: Each outcome has a complex amplitude $\alpha$. Amplitudes add (and can cancel!). Then: $$P(\text{outcome}) = |\alpha|^2$$

This is called the Born rule. The probability is the squared magnitude of the amplitude.

Key insight — Why quantum is weird: In classical probability, combining two possibilities always makes things more likely (probabilities add). In quantum mechanics, amplitudes can be negative or complex, so they can cancel each other out. This is called interference, and it's what gives quantum computers their power — they amplify correct answers and cancel wrong ones.
Worked Example: A Qubit State

A qubit in state $|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle$ (called the "$|+\rangle$ state" or "Hadamard state").

Amplitude of $|0\rangle$: $\alpha_0 = \frac{1}{\sqrt{2}}$
Amplitude of $|1\rangle$: $\alpha_1 = \frac{1}{\sqrt{2}}$
Probability of measuring 0: $|\alpha_0|^2 = \frac{1}{2} = 50\%$ (Born rule)
Probability of measuring 1: $|\alpha_1|^2 = \frac{1}{2} = 50\%$
Check: $|\alpha_0|^2 + |\alpha_1|^2 = \frac{1}{2} + \frac{1}{2} = 1$ ✓ (must sum to 1)

This looks like a fair coin flip — but the qubit's state contains more information (the phase of the amplitudes) that can be exploited by quantum algorithms.

Worked Example: Complex Amplitudes

$|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{i}{\sqrt{2}}|1\rangle$ where $i = \sqrt{-1}$.

$P(0) = \left|\frac{1}{\sqrt{2}}\right|^2 = \frac{1}{2}$
$P(1) = \left|\frac{i}{\sqrt{2}}\right|^2 = \frac{|i|^2}{2} = \frac{1}{2}$ (since $|i| = 1$)

Same measurement probabilities as the previous example! But the states are physically different — a quantum computer can tell them apart using interference. The complex phase $i$ carries information invisible to simple measurement.

Worked Example: Interference — Where Quantum Differs from Classical

Imagine two paths to an outcome, with amplitudes $\alpha_1 = +\frac{1}{\sqrt{2}}$ and $\alpha_2 = -\frac{1}{\sqrt{2}}$.

Classical: $P = P_1 + P_2 = \frac{1}{2} + \frac{1}{2} = 1$ (always happens)
Quantum: Total amplitude = $\alpha_1 + \alpha_2 = \frac{1}{\sqrt{2}} - \frac{1}{\sqrt{2}} = 0$
Quantum probability: $P = |0|^2 = 0$ (never happens!)

Destructive interference: The two paths cancel completely! Classically impossible — you can't add two positive probabilities and get zero. But amplitudes can be negative, so they cancel. This is the heart of quantum computing.

Multi-Qubit States and Entanglement

For 2 qubits, the state space has $2^2 = 4$ basis states: $|00\rangle, |01\rangle, |10\rangle, |11\rangle$.

$$|\psi\rangle = \alpha_{00}|00\rangle + \alpha_{01}|01\rangle + \alpha_{10}|10\rangle + \alpha_{11}|11\rangle$$ $$\sum |{\alpha_{ij}}|^2 = 1$$

Entanglement occurs when a multi-qubit state cannot be written as a product of individual qubit states. The classic example:

$$|\Phi^+\rangle = \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle$$

This is a Bell state. If you measure the first qubit and get 0, the second MUST be 0. If you get 1, the second MUST be 1. The outcomes are perfectly correlated — even though each individual qubit is random (50/50).

There's no classical probability analogy for entanglement. It's a genuinely new kind of correlation that violates Bell's inequality — a mathematical test that distinguishes quantum from classical correlations.

QC Connection: Quantum Algorithms Exploit Interference

Grover's search: Searching an unsorted database of $N$ items classically takes $O(N)$ checks. Grover's algorithm uses amplitude amplification — constructive interference on the correct answer, destructive on wrong answers — to find it in $O(\sqrt{N})$.

Shor's algorithm: Factors large numbers exponentially faster by using the Quantum Fourier Transform to create constructive interference on the period of a modular function. This is what threatens RSA encryption.

Both algorithms work by carefully arranging amplitudes so the correct answer accumulates probability through constructive interference.

26. Quantum Entropy & Density Matrices

In quantum computing, we sometimes don't know the exact state of a system — just probabilities of different states. This requires extending our tools.

Density Matrices — Quantum Probability Distributions

A density matrix (or density operator) $\rho$ represents the state of a quantum system, generalizing the state vector. It's the quantum analogue of a probability distribution.

Pure state (known quantum state $|\psi\rangle$): $$\rho = |\psi\rangle\langle\psi|$$

$\langle\psi|$ is the "bra" — the conjugate transpose of $|\psi\rangle$. So $\rho$ is a matrix formed by the outer product.

Mixed state (classical uncertainty over quantum states): $$\rho = \sum_i p_i |\psi_i\rangle\langle\psi_i|$$

With probability $p_i$, the system is in state $|\psi_i\rangle$. This is classical probability (which state?) layered on top of quantum probability (measurement outcomes).

Worked Example: Density Matrix of a Pure State

For $|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix}$:

$\rho = |\psi\rangle\langle\psi| = \frac{1}{2}\begin{bmatrix}1\\1\end{bmatrix}\begin{bmatrix}1&1\end{bmatrix} = \frac{1}{2}\begin{bmatrix}1&1\\1&1\end{bmatrix}$

The off-diagonal elements (called coherences) are non-zero — this is the signature of quantum superposition. They encode the phase relationship between $|0\rangle$ and $|1\rangle$.

Worked Example: Density Matrix of a Mixed State

A machine prepares $|0\rangle$ with probability $\frac{1}{2}$ and $|1\rangle$ with probability $\frac{1}{2}$ (classical coin flip, not superposition).

$\rho = \frac{1}{2}|0\rangle\langle 0| + \frac{1}{2}|1\rangle\langle 1| = \frac{1}{2}\begin{bmatrix}1&0\\0&0\end{bmatrix} + \frac{1}{2}\begin{bmatrix}0&0\\0&1\end{bmatrix} = \frac{1}{2}\begin{bmatrix}1&0\\0&1\end{bmatrix}$

The off-diagonal elements are zero — no coherence, no superposition. This is just classical uncertainty. The measurement probabilities are the same (50/50) as the pure superposition above, but the state is fundamentally different — a quantum computer CAN distinguish them.

Von Neumann Entropy — Quantum Entropy

Just as Shannon entropy measures uncertainty for classical distributions, von Neumann entropy measures uncertainty for quantum states:

$$S(\rho) = -\text{Tr}(\rho \log_2 \rho) = -\sum_i \lambda_i \log_2 \lambda_i$$

$\lambda_i$ are the eigenvalues of $\rho$. $\text{Tr}$ is the trace (sum of diagonal elements). By convention, $0 \log 0 = 0$.

State$S(\rho)$Meaning
Pure state $|\psi\rangle$$0$No uncertainty — we know the exact state
Maximally mixed ($\frac{I}{d}$)$\log_2 d$Maximum uncertainty — all states equally likely
Entangled subsystem$> 0$Entanglement creates uncertainty in subsystems
Worked Example: Von Neumann Entropy

Pure state: $\rho = \frac{1}{2}\begin{bmatrix}1&1\\1&1\end{bmatrix}$. Eigenvalues: $\lambda_1 = 1, \lambda_2 = 0$.

$S = -(1\log_2 1 + 0\log_2 0) = -(0 + 0) = 0$ bits. No uncertainty.

Maximally mixed: $\rho = \frac{1}{2}\begin{bmatrix}1&0\\0&1\end{bmatrix}$. Eigenvalues: $\lambda_1 = \frac{1}{2}, \lambda_2 = \frac{1}{2}$.

$S = -(\frac{1}{2}\log_2 \frac{1}{2} + \frac{1}{2}\log_2 \frac{1}{2}) = -(-\frac{1}{2} - \frac{1}{2}) = 1$ bit. Maximum uncertainty for 1 qubit.

Von Neumann entropy of 1 bit = equivalent to the uncertainty of a fair coin flip. It equals the Shannon entropy of the eigenvalue distribution.

Key insight — Entanglement and Entropy: For a pure entangled state like $|\Phi^+\rangle = \frac{1}{\sqrt{2}}(|00\rangle + |11\rangle)$, the total 2-qubit system has $S = 0$ (pure state = no uncertainty). But if you look at just one qubit (by "tracing out" the other), its reduced density matrix is maximally mixed: $\rho_A = \frac{I}{2}$, with $S = 1$ bit. Entanglement creates local uncertainty from global certainty. This is measured by entanglement entropy — a key concept in quantum information theory.
QC Connection: Quantum Error Correction and Entropy

Quantum computers are noisy — interactions with the environment cause decoherence, turning pure states into mixed states (increasing entropy). Quantum error correction works by encoding information in entangled states that are robust to local noise. The theory of how much quantum information can be transmitted through a noisy channel is governed by quantum channel capacity, which uses von Neumann entropy — the quantum generalization of Shannon's channel capacity theorem.


Summary — The Big Picture

ConceptFormulaWhere It Appears in ML/QC
Bayes' Theorem$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$Naive Bayes, Bayesian inference, MAP
Expected Value$E[X] = \sum x \cdot P(x)$Loss functions, reward in RL
Variance$E[X^2] - (E[X])^2$Bias-variance tradeoff, risk
Normal Dist.$\mathcal{N}(\mu, \sigma^2)$Weight init, noise, batch norm, GPs
Entropy$-\sum p \log p$Cross-entropy loss, decision trees
KL Divergence$\sum p \log(p/q)$VAE loss (ELBO), policy gradient
MLE$\arg\max \sum \log P(x_i|\theta)$All model training (SGD on loss)
Sigmoid/Softmax$\frac{1}{1+e^{-z}}$, $\frac{e^{z_k}}{\sum e^{z_j}}$Classification outputs
Markov Chains$P(X_{t+1}|X_t)$HMMs, RL (MDPs), MCMC, PageRank
Monte Carlo$E[f] \approx \frac{1}{N}\sum f(x_i)$Mini-batch SGD, sampling, MC dropout
Born Rule$P = |\alpha|^2$Quantum measurement, QML
Von Neumann Entropy$-\text{Tr}(\rho\log\rho)$Entanglement, quantum channels

Inspired by: Khan Academy, StatQuest, Chris Olah — Visual Information Theory, and Qiskit Textbook.

← Back to all guides