A complete interactive guide — from coin flips to ML loss functions to quantum measurement.
Estimated reading time: 2-3 hours.
Probability measures how likely something is to happen. It's a number between 0 and 1:
$P = 0$: impossible. $P = 1$: certain. $P = 0.5$: 50-50 coin flip.
That's the classical definition — works when all outcomes are equally likely (dice, coins, cards).
Key terminology:
A standard deck has 52 cards. What's $P(\text{heart})$?
What's $P(\text{face card})$? There are 12 face cards (J, Q, K in 4 suits).
When you report "model accuracy = 94%", you're saying: if you randomly pick a sample from the test set, $P(\text{correct prediction}) = 0.94$. This is the frequentist interpretation — accuracy is estimated from observed data.
Often easier to calculate what you DON'T want, then subtract.
Subtract the overlap to avoid double-counting. If mutually exclusive: $P(A \text{ or } B) = P(A) + P(B)$.
If independent: $P(A \text{ and } B) = P(A) \times P(B)$.
Draw 2 cards from a deck without replacement. $P(\text{both aces})$?
Decision trees split data by asking sequential questions. At each leaf, the class probability is $P(Y=c | X_1, X_2, \ldots)$. The multiplication rule is how we chain conditional probabilities down the tree: $P(A \text{ and } B) = P(A) \cdot P(B|A)$.
Conditional probability is the probability of an event GIVEN that another event has occurred:
"If we know B happened, what fraction of the B-world also has A?"
Your inbox: 40% spam. 80% of spam contains "FREE". 10% of non-spam contains "FREE".
This uses the Law of Total Probability: $P(B) = P(B|A)P(A) + P(B|\bar{A})P(\bar{A})$.
If $A_1, A_2, \ldots$ partition the sample space (they cover everything and don't overlap), then the total probability of $B$ is the weighted sum across all partitions.
In ML, this appears as marginalisation. If a generative model has latent variable $Z$:
This is how Gaussian Mixture Models, Variational Autoencoders, and Hidden Markov Models compute the probability of observed data by summing over all possible hidden states.
Bayes' Theorem lets you flip conditional probabilities. If you know $P(B|A)$, it tells you $P(A|B)$:
The terms have names:
Disease affects 1% of people. Test is 99% accurate (99% true positive, 99% true negative). You test positive. What's the chance you actually have the disease?
Only 50%! Because the disease is rare, half of positive results are false positives. This is why anomaly detectors have high false alarm rates when anomalies are rare — a direct consequence of Bayes.
From Section 3: email contains "FREE". Is it spam?
This is exactly how Naive Bayes spam filters work (Section 20).
Two events are independent if one happening doesn't affect the other:
Roll two dice. Are "die 1 shows 6" and "sum is 7" independent?
Naive Bayes classifiers assume all features are conditionally independent given the class label: $P(x_1, x_2, \ldots | y) = \prod_i P(x_i | y)$. This is almost never true in practice (word frequencies ARE correlated), but the classifier works surprisingly well anyway. This "naive" independence assumption is what makes computation tractable — instead of estimating a joint distribution over all features, you only need individual feature distributions.
A random variable $X$ is a number whose value depends on a random outcome. It maps outcomes to numbers.
CDF (Cumulative Distribution Function) — works for both:
$$F(x) = P(X \leq x)$$For continuous: $F(x) = \int_{-\infty}^{x} f(t) \, dt$. The CDF always goes from 0 to 1, never decreases.
A loaded die has: $P(1) = P(2) = P(3) = P(4) = P(5) = 0.1$, $P(6) = 0.5$.
The expected value (mean, first moment) is the long-run average:
Linearity of expectation — always holds, even for dependent variables:
You can never roll 3.5 — the expected value doesn't have to be a possible outcome.
Roll 3 dice. What's the expected sum?
No need to enumerate all $6^3 = 216$ outcomes! Linearity makes this trivial.
Training a model minimises the expected loss: $E_{\mathbf{x},y}[\mathcal{L}(f(\mathbf{x}), y)]$. Mean Squared Error, Cross-Entropy Loss, etc. are all expected values of a loss function over the data distribution. When we average the loss over a mini-batch, we're estimating this expected value.
Expected value is the center. Variance is the spread.
Key properties:
If $X$ and $Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.
$\lambda$ = average rate. Events per second, errors per hour, mutations per genome.
If events arrive at Poisson rate $\lambda$, the waiting time between them is exponential. "Memoryless" — past waiting doesn't affect future waiting.
| Scenario | Distribution | Why |
|---|---|---|
| 10 emails, each spam or not | Binomial(10, p) | Fixed $n$ trials, yes/no |
| Typos per page of code | Poisson($\lambda$) | Count of rare events in a window |
| Time until next server crash | Exponential($\lambda$) | Waiting time between Poisson events |
| Random number generator | Uniform(0, 1) | All values equally likely |
| Measurement errors | Normal($\mu$, $\sigma$) | Sum of many small random effects |
The most important distribution in all of statistics. The bell curve.
The 68-95-99.7 rule:
How many standard deviations is $x$ from the mean? $z = 2$ means "2 sigma above average" = top 2.5%.
1. Weight initialisation: Neural network weights are typically initialised from $\mathcal{N}(0, \sigma^2)$.
2. Gaussian noise: VAEs, diffusion models, and regularisation all add $\mathcal{N}(0, \sigma^2)$ noise.
3. MSE loss: Minimising MSE is equivalent to maximum likelihood under a Gaussian noise model.
4. Batch normalisation: Forces activations to be approximately $\mathcal{N}(0, 1)$.
5. Gaussian Processes: A powerful Bayesian method where everything is multivariate normal.
When you have two or more random variables, you need to describe their joint behavior — not just each one alone.
Weather ($X$) vs Ice Cream Sales ($Y$):
| $Y$ = Low | $Y$ = High | Marginal $P(X)$ | |
|---|---|---|---|
| $X$ = Cold | 0.40 | 0.10 | 0.50 |
| $X$ = Hot | 0.05 | 0.45 | 0.50 |
| Marginal $P(Y)$ | 0.45 | 0.55 | 1.00 |
Hot weather strongly predicts high ice cream sales. The joint distribution captures this relationship that the marginals alone can't.
Positive: $X$ and $Y$ move together. Negative: they move opposite. Zero: no linear relationship.
Mean: $\bar{x} = \frac{1}{n}\sum x_i$ — sensitive to outliers.
Median: Middle value when sorted — robust to outliers.
Mode: Most frequent value.
Variance: $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$ (Bessel's correction: $n-1$ instead of $n$)
Standard Deviation: $s = \sqrt{s^2}$ — same units as data
IQR: $Q3 - Q1$ — range of middle 50%, robust to outliers
$r \in [-1, +1]$. It's the normalised covariance — covariance scaled to be unitless.
Click canvas to add points. Watch correlation update live.
$R^2$ = fraction of variance explained by the model. For simple linear regression, $R^2 = r^2$.
Click to add points. Best-fit line updates automatically.
| $H_0$ true | $H_0$ false | |
|---|---|---|
| Reject $H_0$ | Type I Error (false positive, $\alpha$) | Correct! |
| Don't reject | Correct! | Type II Error (false negative, $\beta$) |
90%: $z = 1.645$. 95%: $z = 1.960$. 99%: $z = 2.576$.
Key: to halve uncertainty, you need 4× the data (because $\sqrt{4n} = 2\sqrt{n}$).
More data = sample average converges to true mean. No shortcut.
No matter what the original distribution, the average of many independent samples → Normal.
$$\bar{X}_n \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$$Usually $n \geq 30$ is enough. This is why the Normal distribution appears everywhere.
Entropy measures uncertainty or information content of a probability distribution. High entropy = high uncertainty = lots of surprise. Low entropy = predictable.
Measured in bits (with $\log_2$) or nats (with $\ln$). By convention, $0 \log 0 = 0$.
Intuition: Entropy answers "how many yes/no questions do I need, on average, to determine the outcome?"
Weather: $P(\text{sunny}) = 0.7$, $P(\text{rainy}) = 0.2$, $P(\text{snowy}) = 0.1$.
Compare: if all three were equally likely, $H = \log_2 3 = 1.585$ bits. The uneven distribution has less entropy (less uncertainty — we can usually guess "sunny").
$p$ = true distribution, $q$ = model's predicted distribution. Cross-entropy measures how well $q$ approximates $p$.
True label: cat (class 0). One-hot: $p = [1, 0, 0]$ (cat, dog, bird).
Model predicts: $q = [0.7, 0.2, 0.1]$.
If model predicted $q = [0.95, 0.03, 0.02]$: $H(p, q) = -\log 0.95 \approx 0.051$ (much lower — better prediction!).
If model predicted $q = [0.01, 0.98, 0.01]$: $H(p, q) = -\log 0.01 \approx 4.605$ (terrible — very high loss).
$D_{KL} \geq 0$, with equality iff $p = q$. NOT symmetric: $D_{KL}(p\|q) \neq D_{KL}(q\|p)$.
KL divergence = "extra bits needed because you used code $q$ instead of the optimal code $p$." It measures the "wasted information" from using the wrong distribution.
Decision trees pick the feature split that reduces entropy the most. Information gain = entropy before split − weighted entropy after split. A split that produces pure nodes (all same class) reduces entropy to 0 — maximum information gain. This is exactly the criterion used in ID3 and C4.5 algorithms.
Given observed data, what parameters make the data most probable? That's MLE — the most fundamental estimation method in statistics and ML.
Each data point is assumed independent, so the joint probability is the product. In practice, we maximize the log-likelihood (sums are easier than products):
$$\hat{\theta}_{MLE} = \arg\max_{\theta} \; \sum_{i=1}^{n} \log P(x_i | \theta)$$Flip a coin 100 times. Get 62 heads. What's the MLE for $p$ (probability of heads)?
The MLE is just the sample proportion — intuitive! 62 heads out of 100 → $\hat{p} = 0.62$.
Data $y_i = f(x_i) + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.
So linear regression with MSE loss is secretly MLE under a Gaussian noise assumption.
MAP = MLE + prior. The prior $P(\theta)$ encodes your beliefs before seeing data. With lots of data, MAP → MLE.
When the prior says "parameters should be small" (e.g., Gaussian prior $P(\theta) \propto e^{-\lambda\|\theta\|^2}$), MAP becomes L2 regularisation (weight decay in neural networks).
Naive Bayes is Bayes' Theorem turned into a classifier. Despite its simplicity, it works remarkably well for spam filtering, text classification, and more.
The "naive" assumption: features are conditionally independent given the class. This makes the product factorize.
$$\hat{y} = \arg\max_{y} \; P(y) \prod_{i=1}^{n} P(x_i | y)$$Training data: $P(\text{spam}) = 0.4$, $P(\text{not spam}) = 0.6$.
Word probabilities:
| Word | $P(\text{word}|\text{spam})$ | $P(\text{word}|\text{not spam})$ |
|---|---|---|
| "FREE" | 0.80 | 0.10 |
| "meeting" | 0.05 | 0.50 |
New email contains both "FREE" and "meeting". Is it spam?
The independence assumption is wrong (words ARE correlated). But it has low variance — the model has very few parameters to estimate ($n \times k$ instead of $n^k$), so it doesn't overfit. For small datasets, this low variance often more than compensates for the bias from the wrong assumption. This is a classic bias-variance tradeoff win.
Linear regression predicts a continuous value. But for classification, we need a probability between 0 and 1. Enter the sigmoid function:
Maps any real number to $(0, 1)$. $\sigma(0) = 0.5$, $\sigma(\infty) = 1$, $\sigma(-\infty) = 0$.
Logistic regression: Apply a linear model, then squeeze through the sigmoid:
Trained model: $w_1 = 2.0$, $w_2 = -1.5$, $b = 0.5$. New input: $x = (1, 2)$.
For $K$ classes, softmax converts $K$ raw scores ("logits") into $K$ probabilities that sum to 1. Used in the final layer of almost every classification neural network.
Three classes, logits: $z = [2.0, 1.0, 0.5]$.
Probabilities sum to 1. Class 1 is most likely at 62.9%.
In ML, data points have multiple features. The multivariate normal distribution is the multi-dimensional generalization of the bell curve.
$\boldsymbol{\mu}$ = mean vector ($d$-dimensional), $\boldsymbol{\Sigma}$ = covariance matrix ($d \times d$), $|\boldsymbol{\Sigma}|$ = its determinant.
$\boldsymbol{\Sigma}$ encodes both the spread of each variable (on the diagonal) and how they co-vary (off-diagonal):
$\rho$ = correlation coefficient. Diagonal entries = variances. Off-diagonal = covariances. Always symmetric: $\Sigma_{ij} = \Sigma_{ji}$.
$\boldsymbol{\Sigma} = \begin{bmatrix} 4 & 1.5 \\ 1.5 & 1 \end{bmatrix}$
The eigenvectors of $\boldsymbol{\Sigma}$ give the principal axes of the elliptical contours — this is exactly what PCA computes.
PCA: Finds the eigenvectors of the data's covariance matrix → principal components = directions of maximum variance.
GMMs: Model data as a mixture of $K$ multivariate normals, each with its own $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_k$.
Gaussian Processes: Define a distribution over functions. The covariance function (kernel) determines the shape of possible functions.
VAEs: The latent space is assumed to be $\mathcal{N}(\mathbf{0}, \mathbf{I})$, and the encoder outputs $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ for each input.
When you can't compute an expectation or integral analytically, sample from the distribution and average:
Draw $N$ samples from $p(x)$, evaluate $f$ at each, average. As $N \to \infty$, the estimate converges to the true expectation (by LLN).
Throw $N$ random darts at a unit square $[0,1] \times [0,1]$. Count how many land inside the quarter-circle $x^2 + y^2 \leq 1$.
With $N = 10{,}000$: typically get $\hat{\pi} \approx 3.14 \pm 0.02$. More samples → better estimate.
What if you can't directly sample from $p(x)$? MCMC constructs a Markov chain whose stationary distribution IS $p(x)$. After enough steps, samples from the chain are samples from $p(x)$.
MCMC is used in Bayesian deep learning, Bayesian neural networks, and probabilistic programming (Stan, PyMC).
A Markov chain is a sequence of states where the next state depends only on the current state, not the history:
This is the Markov property (memorylessness). $T$ is the transition matrix where $T_{ij}$ = probability of going from state $i$ to state $j$.
Two states: Sunny (S) and Rainy (R).
| From \ To | Sunny | Rainy |
|---|---|---|
| Sunny | 0.8 | 0.2 |
| Rainy | 0.4 | 0.6 |
$T = \begin{bmatrix} 0.8 & 0.2 \\ 0.4 & 0.6 \end{bmatrix}$
The stationary distribution $\boldsymbol{\pi}$ is the long-run fraction of time spent in each state. It's the eigenvector of $T^T$ with eigenvalue 1.
For the weather chain: $\pi_S (0.8) + \pi_R (0.4) = \pi_S$ and $\pi_S + \pi_R = 1$.
In the long run, it's sunny 2/3 of the time, rainy 1/3.
Classical probability assigns a number $P(x) \in [0,1]$ to each outcome. Quantum mechanics is fundamentally different — and understanding how requires some new vocabulary.
This is called the Born rule. The probability is the squared magnitude of the amplitude.
A qubit in state $|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle$ (called the "$|+\rangle$ state" or "Hadamard state").
This looks like a fair coin flip — but the qubit's state contains more information (the phase of the amplitudes) that can be exploited by quantum algorithms.
$|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{i}{\sqrt{2}}|1\rangle$ where $i = \sqrt{-1}$.
Same measurement probabilities as the previous example! But the states are physically different — a quantum computer can tell them apart using interference. The complex phase $i$ carries information invisible to simple measurement.
Imagine two paths to an outcome, with amplitudes $\alpha_1 = +\frac{1}{\sqrt{2}}$ and $\alpha_2 = -\frac{1}{\sqrt{2}}$.
Destructive interference: The two paths cancel completely! Classically impossible — you can't add two positive probabilities and get zero. But amplitudes can be negative, so they cancel. This is the heart of quantum computing.
For 2 qubits, the state space has $2^2 = 4$ basis states: $|00\rangle, |01\rangle, |10\rangle, |11\rangle$.
Entanglement occurs when a multi-qubit state cannot be written as a product of individual qubit states. The classic example:
This is a Bell state. If you measure the first qubit and get 0, the second MUST be 0. If you get 1, the second MUST be 1. The outcomes are perfectly correlated — even though each individual qubit is random (50/50).
There's no classical probability analogy for entanglement. It's a genuinely new kind of correlation that violates Bell's inequality — a mathematical test that distinguishes quantum from classical correlations.
Grover's search: Searching an unsorted database of $N$ items classically takes $O(N)$ checks. Grover's algorithm uses amplitude amplification — constructive interference on the correct answer, destructive on wrong answers — to find it in $O(\sqrt{N})$.
Shor's algorithm: Factors large numbers exponentially faster by using the Quantum Fourier Transform to create constructive interference on the period of a modular function. This is what threatens RSA encryption.
Both algorithms work by carefully arranging amplitudes so the correct answer accumulates probability through constructive interference.
In quantum computing, we sometimes don't know the exact state of a system — just probabilities of different states. This requires extending our tools.
A density matrix (or density operator) $\rho$ represents the state of a quantum system, generalizing the state vector. It's the quantum analogue of a probability distribution.
$\langle\psi|$ is the "bra" — the conjugate transpose of $|\psi\rangle$. So $\rho$ is a matrix formed by the outer product.
Mixed state (classical uncertainty over quantum states): $$\rho = \sum_i p_i |\psi_i\rangle\langle\psi_i|$$With probability $p_i$, the system is in state $|\psi_i\rangle$. This is classical probability (which state?) layered on top of quantum probability (measurement outcomes).
For $|\psi\rangle = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle = \frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix}$:
The off-diagonal elements (called coherences) are non-zero — this is the signature of quantum superposition. They encode the phase relationship between $|0\rangle$ and $|1\rangle$.
A machine prepares $|0\rangle$ with probability $\frac{1}{2}$ and $|1\rangle$ with probability $\frac{1}{2}$ (classical coin flip, not superposition).
The off-diagonal elements are zero — no coherence, no superposition. This is just classical uncertainty. The measurement probabilities are the same (50/50) as the pure superposition above, but the state is fundamentally different — a quantum computer CAN distinguish them.
Just as Shannon entropy measures uncertainty for classical distributions, von Neumann entropy measures uncertainty for quantum states:
$\lambda_i$ are the eigenvalues of $\rho$. $\text{Tr}$ is the trace (sum of diagonal elements). By convention, $0 \log 0 = 0$.
| State | $S(\rho)$ | Meaning |
|---|---|---|
| Pure state $|\psi\rangle$ | $0$ | No uncertainty — we know the exact state |
| Maximally mixed ($\frac{I}{d}$) | $\log_2 d$ | Maximum uncertainty — all states equally likely |
| Entangled subsystem | $> 0$ | Entanglement creates uncertainty in subsystems |
Pure state: $\rho = \frac{1}{2}\begin{bmatrix}1&1\\1&1\end{bmatrix}$. Eigenvalues: $\lambda_1 = 1, \lambda_2 = 0$.
Maximally mixed: $\rho = \frac{1}{2}\begin{bmatrix}1&0\\0&1\end{bmatrix}$. Eigenvalues: $\lambda_1 = \frac{1}{2}, \lambda_2 = \frac{1}{2}$.
Von Neumann entropy of 1 bit = equivalent to the uncertainty of a fair coin flip. It equals the Shannon entropy of the eigenvalue distribution.
Quantum computers are noisy — interactions with the environment cause decoherence, turning pure states into mixed states (increasing entropy). Quantum error correction works by encoding information in entangled states that are robust to local noise. The theory of how much quantum information can be transmitted through a noisy channel is governed by quantum channel capacity, which uses von Neumann entropy — the quantum generalization of Shannon's channel capacity theorem.
| Concept | Formula | Where It Appears in ML/QC |
|---|---|---|
| Bayes' Theorem | $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ | Naive Bayes, Bayesian inference, MAP |
| Expected Value | $E[X] = \sum x \cdot P(x)$ | Loss functions, reward in RL |
| Variance | $E[X^2] - (E[X])^2$ | Bias-variance tradeoff, risk |
| Normal Dist. | $\mathcal{N}(\mu, \sigma^2)$ | Weight init, noise, batch norm, GPs |
| Entropy | $-\sum p \log p$ | Cross-entropy loss, decision trees |
| KL Divergence | $\sum p \log(p/q)$ | VAE loss (ELBO), policy gradient |
| MLE | $\arg\max \sum \log P(x_i|\theta)$ | All model training (SGD on loss) |
| Sigmoid/Softmax | $\frac{1}{1+e^{-z}}$, $\frac{e^{z_k}}{\sum e^{z_j}}$ | Classification outputs |
| Markov Chains | $P(X_{t+1}|X_t)$ | HMMs, RL (MDPs), MCMC, PageRank |
| Monte Carlo | $E[f] \approx \frac{1}{N}\sum f(x_i)$ | Mini-batch SGD, sampling, MC dropout |
| Born Rule | $P = |\alpha|^2$ | Quantum measurement, QML |
| Von Neumann Entropy | $-\text{Tr}(\rho\log\rho)$ | Entanglement, quantum channels |
Inspired by: Khan Academy, StatQuest, Chris Olah — Visual Information Theory, and Qiskit Textbook.