An interactive course for software engineers — prerequisite math for ML, robotics, quantum computing, and beyond
You write code for a living. You think in logic, loops, and data structures. This course translates the mathematics you need into that mindset. Every concept is grounded in something concrete, connected to systems you care about — machine learning, autonomous vehicles, sensor fusion, quantum computing — and packed with worked examples so you can practice until it clicks.
The Khan Academy philosophy here: Don't memorise formulas. Understand why they work. Every formula started as someone's clever observation. We'll rebuild that observation, step by step, with examples. If you understand the "why," you'll re-derive what you forget.
This course is structured sequentially — each chapter builds on the previous ones. Budget roughly 3 hours for a thorough first pass, more if you work through every practice problem.
Algebra is just using letters to represent unknown numbers, then finding those numbers using rules. You already do this in code: let x = totalPrice / quantity. Math does the same thing — with different notation.
Think of it this way: Algebra is the "programming language" of math. Variables are parameters. Equations are assertions. Solving is debugging — you manipulate until you isolate the unknown. Everything else in this course is written in this language.
Variables and Expressions
A variable is a placeholder for a number you don't know yet (like a function parameter). An expression is a recipe that uses variables:
$3x + 7$ means "take some number $x$, multiply by 3, add 7"
If $x = 4$: $3(4) + 7 = 12 + 7 = 19$
If $x = -2$: $3(-2) + 7 = -6 + 7 = 1$
If $x = 0$: $3(0) + 7 = 0 + 7 = 7$
Multiplication is often written without a sign: $3x$ means $3 \times x$. Parentheses work like code: evaluate the inside first.
More examples of expressions:
$2x + 3y$ — two variables (like a function with two parameters)
$x^2 - 4x + 3$ — powers of a single variable
$\frac{a + b}{2}$ — the average of $a$ and $b$
Equations: Finding the Unknown
An equation says two expressions are equal. Solving means finding what value of the variable makes it true.
Golden rule: Whatever you do to one side, do to the other side. The equation stays balanced — like a scale. Add 5 to the left? Add 5 to the right. Multiply the left by 3? Multiply the right by 3.
Worked example 3 (rearranging a formula): Solve $\frac{v - v_0}{a} = t$ for $v$
Step 1: Multiply both sides by $a$ → $v - v_0 = at$
Step 2: Add $v_0$ to both sides → $v = v_0 + at$
This is the velocity equation from physics. You just derived it by rearranging. You'll see this exact equation in robotics and autonomous vehicles.
Worked example 4 (ML connection): In linear regression, you predict $\hat{y} = wx + b$. If you know two data points $(1, 3)$ and $(2, 5)$, find $w$ and $b$.
Step 1: From point $(1,3)$: $w(1) + b = 3$ → $w + b = 3$
Step 2: From point $(2,5)$: $w(2) + b = 5$ → $2w + b = 5$
Step 3: Subtract equation 1 from equation 2: $(2w + b) - (w + b) = 5 - 3$ → $w = 2$
This is the simplest form of "training" a model — solving equations to find weights. Real ML does this with millions of parameters using calculus (Chapter 4).
Interactive: Equation Solver
Solve $ax + b = c$ — enter coefficients and check your answer.
x +
=
Systems of Equations
Sometimes you have multiple unknowns. You need as many equations as unknowns. This is called a system of equations.
Worked example: A drone's position depends on wind speed $w$ and motor thrust $m$. Given measurements:
$m + w = 10$ (moving forward, wind assists)
$m - w = 6$ (moving backward, wind opposes)
Step 1: Add both equations: $2m = 16$ → $m = 8$
Step 2: Substitute: $8 + w = 10$ → $w = 2$
Check: $8 + 2 = 10$ ✓, $8 - 2 = 6$ ✓
Why systems matter: In ML, training a neural network is solving a massive system of equations (millions of weights). In quantum computing, the state of $n$ qubits involves $2^n$ probability amplitudes that must satisfy constraints. Systems of equations are everywhere.
Subscripts and Greek Letters
Math uses subscripts to label related variables: $v_0$ means "the initial velocity", $v_f$ means "the final velocity". They're just names — like v_initial and v_final in code.
Greek letters are used because we run out of Roman ones:
For each data point $i$: take the true value $y_i$, subtract your model's prediction $\hat{y}_i$, square it (so negatives don't cancel), and average. This single formula drives the training of most ML models.
Real-World Example: Quantum — Probability Amplitudes
A qubit's state is $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ where $\alpha$ and $\beta$ are complex numbers. The constraint:
$$|\alpha|^2 + |\beta|^2 = 1$$
This is just algebra: the probabilities of measuring 0 or 1 must sum to 1. If $\alpha = \frac{1}{\sqrt{2}}$ and $\beta = \frac{1}{\sqrt{2}}$, then $|\alpha|^2 + |\beta|^2 = \frac{1}{2} + \frac{1}{2} = 1$ ✓. The qubit has equal probability of being measured as 0 or 1 — that's superposition.
Practice Problems
Solve: $5x - 3 = 17$
Solve: $\frac{x + 4}{2} = 7$
Solve for $t$: $d = v_0 t + \frac{1}{2}at^2$ when $d = 100$, $v_0 = 0$, $a = 10$
Compute: $\displaystyle\sum_{i=1}^{4} (2i + 1)$
If MSE $= \frac{1}{3}\sum_{i=1}^{3}(y_i - \hat{y}_i)^2$ and the errors are $1, -2, 3$, find MSE.
The vending machine analogy: You put in a coin (input), press a button (function), and get exactly one item (output). You can't press one button and get two different things. That's what "function" means mathematically: one input always gives the same one output.
The input variable (here $x$) is called the independent variable. The output $f(x)$ is the dependent variable (it depends on what you feed in). We often write $y = f(x)$.
Evaluating Functions — Step by Step
To evaluate a function, replace every $x$ with the input value:
Notice: $(-1)^2 = 1$ (negative times negative = positive), and $-2(-1) = +2$. Signs trip up everyone at first — go slow.
Graphing a Function
A graph is a picture of all input-output pairs. The x-axis is the input, the y-axis is the output. Every point $(x, y)$ on the curve satisfies $y = f(x)$.
Reading a graph: Pick any x-value on the horizontal axis. Go straight up (or down) until you hit the curve. The y-value at that point is $f(x)$. That's it. A graph is just a visual lookup table.
Key Function Types You'll See Everywhere
Linear: $f(x) = mx + b$
A straight line. $m$ = slope (rise/run), $b$ = y-intercept. Constant rate of change.
Example: $f(x) = 2x + 3$ passes through $(0, 3)$ with slope 2 (rises 2 for every 1 step right).
ML: A neuron without activation is just $y = wx + b$.
Quadratic: $f(x) = ax^2 + bx + c$
A parabola (U-shape or ∩-shape). The rate of change itself changes.
Example: $f(x) = x^2$ makes a U-shape centered at origin.
ML: Mean Squared Error is quadratic in the predictions.
Exponential: $f(x) = a \cdot b^x$
Grows (or decays) by a constant percentage each step. Covered in depth next section.
Oscillates forever between $-A$ and $A$. Everything that repeats — waves, vibrations, AC power, seasonal patterns.
Quantum: Probability amplitudes are complex exponentials, which are sines and cosines.
ML-Specific Functions: Activation Functions
Neural networks need nonlinear functions between layers. Here are the big three:
ReLU: $f(x) = \max(0, x) = \begin{cases} 0 & \text{if } x < 0 \\ x & \text{if } x \ge 0 \end{cases}$ Dead simple: zero out negatives. Used in ~90% of deep learning.
Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$ (squashes any number to the range $(0, 1)$) Output looks like probability. Used in binary classification.
Tanh: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ (squashes to $(-1, 1)$) Like sigmoid but centered at zero. Used in RNNs and LSTMs.
Interactive: Function Explorer
Pick a function type and adjust parameters. See how the graph changes.
2.01.0
Composition: Functions Feeding Functions
If $f(x) = x^2$ and $g(x) = x + 3$, then $f(g(x)) = f(x+3) = (x+3)^2$. This is like piping in Unix: the output of $g$ becomes the input of $f$.
Example in ML: A two-layer neural network is function composition:
Layer 1: multiply by weights, add bias, apply ReLU. Layer 2: multiply by more weights, add bias, apply sigmoid. Function composition is all deep learning is.
Real-World Example: Sensor Processing Pipeline
A thermistor outputs resistance $R$. The processing pipeline is a chain of functions:
Each arrow is a function. The full pipeline is $f_4(f_3(f_2(f_1(R))))$. Every sensor in an autonomous vehicle, every feature extraction in ML, and every quantum circuit is a chain of functions.
If $f(x) = x^2$ and $g(x) = 2x + 1$, find $f(g(3))$ and $g(f(3))$.
Evaluate ReLU for inputs: $-5, -0.1, 0, 0.5, 3$.
For sigmoid $\sigma(x) = \frac{1}{1+e^{-x}}$: what is $\sigma(0)$? What happens as $x \to \infty$? As $x \to -\infty$?
Show Solutions
1. $f(0) = 1$, $f(1) = 0$, $f(-1) = 6$, $f(3) = 10$
2. $g(3) = 7$, so $f(g(3)) = 49$. $f(3) = 9$, so $g(f(3)) = 19$. Note: $f(g(x)) \neq g(f(x))$ in general!
3. ReLU: $0, 0, 0, 0.5, 3$ (all negatives become 0)
4. $\sigma(0) = \frac{1}{1+1} = 0.5$. As $x \to \infty$: $e^{-x} \to 0$, so $\sigma \to 1$. As $x \to -\infty$: $e^{-x} \to \infty$, so $\sigma \to 0$.
2. Limits: Getting Infinitely Close
Before we can define derivatives or integrals, we need one key idea: what value does a function approach as the input gets close to some number? This is a limit.
The hallway analogy: Imagine walking toward a door. You take a step that covers half the remaining distance. Then another half. Then another. You never reach the door, but you get infinitely close. A limit is the mathematical way to talk about that destination — the value you approach even if you never land exactly on it.
The Intuition
What is $f(x) = \frac{x^2 - 1}{x - 1}$ when $x = 1$? Plugging in gives $\frac{0}{0}$ — undefined. But watch what happens as $x$ gets close to 1:
The function approaches $2$. We write: $\displaystyle\lim_{x \to 1}\frac{x^2-1}{x-1} = 2$
Why? Factor the numerator: $\frac{x^2-1}{x-1} = \frac{(x-1)(x+1)}{x-1} = x + 1$ (when $x \neq 1$). As $x \to 1$, this approaches $1 + 1 = 2$. The function has a "hole" at $x = 1$, but the limit fills it in.
Formal Notation
$$\lim_{x \to a} f(x) = L$$
"As $x$ gets arbitrarily close to $a$ (but not equal to $a$), $f(x)$ gets arbitrarily close to $L$."
The function doesn't need to be defined at $a$ — limits care about what happens near $a$, not at $a$.
Computing Limits: The Toolkit
1. Direct Substitution
If plugging in gives a real number (no $0/0$, no blowup), that's your answer:
Worked example: $\displaystyle\lim_{x \to 2}(x^3 - 4x + 1)$
Try direct substitution: $(2)^3 - 4(2) + 1 = 8 - 8 + 1 = 1$. No issues, so the limit is $1$.
2. Factor and Cancel (resolving $0/0$)
Worked example: $\displaystyle\lim_{x \to 2}\frac{x^2 - 4}{x - 2}$
Step 1: Direct substitution gives $\frac{4-4}{2-2} = \frac{0}{0}$ — indeterminate.
Step 2: Factor: $\frac{(x-2)(x+2)}{x-2} = x + 2$ (valid when $x \neq 2$)
Step 3: Now substitute: $\displaystyle\lim_{x \to 2}(x+2) = 4$
Another example: $\displaystyle\lim_{x \to 3}\frac{x^2 - 9}{x - 3}$
Step 1: $\frac{0}{0}$ — indeterminate.
Step 2: $\frac{(x-3)(x+3)}{x-3} = x + 3$
Step 3: $\lim_{x \to 3}(x+3) = 6$
3. Multiply by Conjugate
Worked example: $\displaystyle\lim_{x \to 0}\frac{\sqrt{x+4}-2}{x}$
Step 1: Direct sub gives $\frac{2-2}{0} = \frac{0}{0}$.
Step 2: Multiply by conjugate: $\frac{\sqrt{x+4}-2}{x}\cdot\frac{\sqrt{x+4}+2}{\sqrt{x+4}+2} = \frac{x}{x(\sqrt{x+4}+2)} = \frac{1}{\sqrt{x+4}+2}$
Step 3: Now substitute: $\frac{1}{\sqrt{4}+2} = \frac{1}{4}$
4. Limits at Infinity
What happens as $x$ grows without bound? The highest-power terms dominate:
$\displaystyle\lim_{x \to \infty}\frac{3x^2 + 5x}{x^2 + 1}$: divide top and bottom by $x^2$ → $\frac{3+5/x}{1+1/x^2} \to \frac{3}{1} = 3$
Same idea as Big-O notation: keep the dominant term. $O(n^2 + n)$ simplifies to $O(n^2)$.
5. The Squeeze Theorem
If $g(x) \le f(x) \le h(x)$ near $a$, and both $g$ and $h$ approach $L$, then $f$ must also approach $L$. The function is "squeezed" to the limit.
Critical Limits (Memorise These)
$$\lim_{x \to 0}\frac{\sin x}{x} = 1$$
The sinc limit. Even though $\sin(0)/0$ is undefined, the ratio approaches exactly 1. This is why $\frac{d}{dx}\sin x = \cos x$ — it's the foundation of all trig calculus.
$$\lim_{n \to \infty}\left(1 + \frac{1}{n}\right)^n = e \approx 2.71828$$
The definition of $e$ — the natural base of growth and decay. Covered in depth next section.
$$\lim_{x \to \infty}\frac{x^n}{e^x} = 0 \quad\text{for any } n$$
Exponential always beats polynomial. This is why $O(2^n)$ algorithms are impractical.
One-Sided Limits and Continuity
Sometimes a function approaches different values from left vs right:
Since left ≠ right, the two-sided limit does not exist.
A function is continuous if: $\lim_{x \to a} f(x) = f(a)$. No holes, no jumps, no blowups.
Why this matters: ML optimisation (gradient descent) assumes the loss function is continuous and differentiable. ReLU has a corner at $x = 0$ (not differentiable there) — but it's continuous, and the "derivative" can be defined as either 0 or 1 at that point (subgradient). Understanding limits tells you where these edge cases arise.
Interactive: Limit Explorer
Watch what value $f(x)$ approaches as $x$ gets close to the critical point.
5
Real-World Example: Numerical Derivatives — Limits in Your Code
Every numerical derivative is an approximation of a limit:
You can't set $h = 0$. You make $h$ small and hope it's "close enough." But there's a tradeoff: too large → truncation error. Too small → floating-point cancellation. The sweet spot is $h \approx 10^{-8}$ for 64-bit floats. Understanding limits tells you why this tradeoff exists.
Real-World Example: ML — Softmax Temperature
The softmax function with temperature $T$ is: $\text{softmax}_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$
Low temperature = confident, picks the highest score. High temperature = uncertain, spreads probability evenly. This is exactly a limit! GPT's "temperature" slider controls this limit — $T = 0$ is greedy decoding, $T = \infty$ is random.
Does $\displaystyle\lim_{x \to 0}\frac{|x|}{x}$ exist? (Hint: try from left and right separately)
Show Solutions
1. Direct sub: $3(5) - 7 = 8$
2. $\frac{0}{0}$ → factor: $\frac{(x-1)(x+1)}{x+1} = x - 1$ → $\lim = -1 - 1 = -2$
3. Divide by $x^3$: $\frac{5 + 2/x^2}{1 - 1/x^3} \to 5$
4. From right: $|x|/x = x/x = 1$. From left: $|x|/x = -x/x = -1$. Left ≠ right, so limit DNE.
3. Exponents, Logarithms & the Number $e$
This chapter is critical. Almost every differential equation solution, every ML activation function, every quantum time-evolution involves $e^{something}$.
Exponents: Repeated Multiplication
$2^3 = 2 \times 2 \times 2 = 8$ ("2 multiplied by itself 3 times")
$5^2 = 25, \quad 10^4 = 10000, \quad 3^1 = 3, \quad 7^0 = 1$ (anything to the 0th power = 1)
$$\log_b(x) = y \quad\Longleftrightarrow\quad b^y = x$$
$\log_2(8) = 3$ because $2^3 = 8$
$\log_{10}(1000) = 3$ because $10^3 = 1000$
$\log_2(1) = 0$ because $2^0 = 1$
$\log_2(32) = 5$ because $2^5 = 32$
Think of it as "how many times do I multiply?" $\log_2(64)$ asks: "How many times do I multiply 2 by itself to get 64?" Answer: $2 \times 2 \times 2 \times 2 \times 2 \times 2 = 64$, so 6 times. $\log_2(64) = 6$. This is literally binary search depth: searching 64 items takes at most $\log_2(64) = 6$ steps.
$e$ is what you get when you compound continuously.
Why $e$ is everywhere: $e^x$ is the unique function that is its own derivative: $\frac{d}{dx}e^x = e^x$. The rate of change equals the current value. This makes it the natural solution to "something changes proportionally to itself" — which is how nearly every physical, biological, and computational system works.
The natural logarithm $\ln(x) = \log_e(x)$. In code: Math.exp(x) = $e^x$, Math.log(x) = $\ln(x)$.
Interactive: Exponential Growth & Decay
See how $y = a \cdot e^{kx}$ behaves. Positive $k$ = growth, negative $k$ = decay.
1.00.50
Real-World Example: ML — Cross-Entropy Loss & Log Probabilities
The cross-entropy loss for classification is:
$$L = -\sum_{i} y_i \log(\hat{y}_i)$$
Why $\log$? If the model predicts $\hat{y} = 0.99$ for the correct class, $-\log(0.99) \approx 0.01$ (tiny loss, good!). If it predicts $\hat{y} = 0.01$, $-\log(0.01) \approx 4.6$ (huge loss, bad!). The $\log$ turns multiplication-scale probabilities into addition-scale losses. It penalises confident wrong answers much more than uncertain ones.
This is also why we work in "log space" — multiplying tiny probabilities causes underflow; adding their logs doesn't.
where $H$ is the Hamiltonian (energy operator). The exponential of a matrix appears! This $e^{iHt}$ is what quantum gates implement. For instance, a rotation gate $R_z(\theta) = e^{-i\theta Z/2}$ rotates a qubit's phase by angle $\theta$. The math of $e^x$ extends from scalars to matrices to operators — same concept, bigger objects.
Practice Problems
Simplify: $3^4 \cdot 3^2$
Find: $\log_3(81)$
A ML model's accuracy doubles every 2 years. Starting at 50%, write an exponential model for accuracy $A(t)$, and find $A(6)$.
Compute $-\log_2(0.5)$ and $-\log_2(0.125)$. Which represents more "surprise" (information content)?
If a radioactive sample decays as $N(t) = 1000 \cdot e^{-0.1t}$, how much remains after $t = 10$? After $t = 23$ (hint: find the half-life first)?
The Problem: How Fast Is Something Changing Right Now?
You're driving. Your odometer reads 100 km at 2:00 PM and 160 km at 3:00 PM. Average speed = 60 km/h. But were you going exactly 60 the whole time? Probably not.
To know your speed at exactly 2:15 PM, shrink the time interval smaller and smaller. The derivative is the limit of this process — the instantaneous rate of change.
Coding analogy: If you log a value every millisecond, the derivative at time $t$ is approximately (values[t+1] - values[t]) / dt. The derivative is this finite difference in the limit as dt → 0.
The gradient is the vector of all partial derivatives: $\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)$. It points in the direction of steepest ascent. Gradient descent goes the opposite way: $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L(\mathbf{w})$. That's the entire idea behind training neural networks.
Maximum at $x = 0$: $\sigma'(0) = 0.5 \times 0.5 = 0.25$. At the extremes, $\sigma' \to 0$ — the vanishing gradient problem. This is why deep networks with sigmoid activations train poorly; gradients shrink exponentially through layers.
Practice Problems
Differentiate: $f(x) = 6x^4 - 3x^2 + 2x - 9$
Use the chain rule: $f(x) = \sin(3x)$
Use the chain rule: $f(x) = e^{2x+1}$
Find the partial derivatives of $f(x,y) = x^2 + 3xy + y^3$
At what $x$ is $f(x) = x^2 - 4x + 3$ at its minimum? (Hint: set $f'(x) = 0$)
The ReLU function $f(x) = \max(0, x)$ has derivative $f'(x) = 0$ for $x < 0$ and $f'(x) = 1$ for $x > 0$. Why is this computationally efficient compared to sigmoid?
Show Solutions
1. $24x^3 - 6x + 2$
2. $3\cos(3x)$
3. $2e^{2x+1}$
4. $\frac{\partial f}{\partial x} = 2x + 3y$, $\frac{\partial f}{\partial y} = 3x + 3y^2$
5. $f'(x) = 2x - 4 = 0$ → $x = 2$. $f(2) = 4 - 8 + 3 = -1$. Minimum at $(2, -1)$.
6. ReLU derivative is just 0 or 1 — no exponential computation. Sigmoid needs $e^{-x}$ which is expensive. In a network with millions of neurons, this matters hugely.
5. Integrals: Accumulating Change
The integral is the reverse of the derivative. If the derivative chops a quantity into rates, the integral adds the rates back up.
The rainfall analogy: A derivative tells you "how fast is rain falling right now?" (mm/hour). An integral tells you "how much total rain accumulated over the day?" You add up tiny amounts of rain at each moment — that's integration.
The Problem: Adding Up Tiny Pieces
You have a graph of speed vs time. Distance = speed × time. But speed keeps changing! So chop time into tiny pieces $\Delta t$, multiply each by the speed at that moment, and add:
$$\text{distance} \approx \sum_{i=1}^{N} v(t_i) \cdot \Delta t \quad\xrightarrow{\Delta t \to 0}\quad \text{distance} = \int_{t_1}^{t_2} v(t)\,dt$$
The $\int$ is a stretched "S" for "Sum". The $dt$ tells you what variable you're summing over.
Antiderivatives
We ask: "what function, when differentiated, gives me this?"
$$\int x^n\,dx = \frac{x^{n+1}}{n+1} + C \quad(n \neq -1) \qquad \int e^x\,dx = e^x + C$$
$$\int \frac{1}{x}\,dx = \ln|x| + C \qquad \int \cos x\,dx = \sin x + C \qquad \int \sin x\,dx = -\cos x + C$$
Worked example 1: $\int (6x^2 + 4x - 3)\,dx$
Step 1: Apply power rule to each term: $\frac{6x^3}{3} + \frac{4x^2}{2} - 3x + C = 2x^3 + 2x^2 - 3x + C$
Think: The derivative of $e^{2x}$ is $2e^{2x}$ (chain rule). So the antiderivative of $e^{2x}$ is $\frac{1}{2}e^{2x}$.
Result: $5 \cdot \frac{1}{2}e^{2x} + C = \frac{5}{2}e^{2x} + C$
Definite Integrals (Area Under the Curve)
$$\int_a^b f(x)\,dx = F(b) - F(a)$$
The Fundamental Theorem of Calculus: find the antiderivative $F$, evaluate at endpoints, subtract.
Worked example: $\int_0^3 x^2\,dx$
Step 1: Antiderivative: $F(x) = \frac{x^3}{3}$
Step 2: $F(3) - F(0) = \frac{27}{3} - 0 = 9$
Worked example: $\int_1^e \frac{1}{x}\,dx$
Step 1: Antiderivative: $F(x) = \ln(x)$
Step 2: $\ln(e) - \ln(1) = 1 - 0 = 1$
Interactive: Riemann Sum → Integral
Watch the rectangles approximate the area. More rectangles = better approximation.
5
Real-World Example: ML — Expected Value is an Integral
The expected value of a continuous random variable:
$$E[X] = \int_{-\infty}^{\infty} x \cdot p(x)\,dx$$
This is a weighted average where the weights are the probability density $p(x)$. In ML, expected loss over the data distribution is: $E[L] = \int L(\theta, x) \cdot p(x)\,dx$. Since we can't compute this integral exactly (we don't know $p(x)$), we approximate it with a sample average — the empirical risk. That's why training uses batches of data: they're Monte Carlo estimates of an integral.
Real-World Example: Quantum — Normalisation of Wave Functions
A particle's wave function $\psi(x)$ must satisfy:
$$\int_{-\infty}^{\infty} |\psi(x)|^2\,dx = 1$$
The probability of finding the particle somewhere must be 1 (it exists!). The integral of $|\psi|^2$ over all space is the total probability. If you solve the Schrödinger equation and get a solution $\phi(x)$, you normalise it: $\psi(x) = \frac{\phi(x)}{\sqrt{\int|\phi|^2\,dx}}$. This is integration in action.
Practice Problems
$\int (3x^2 + 2x + 1)\,dx$
$\int_0^2 (4x - 1)\,dx$
$\int_0^{\pi} \sin(x)\,dx$ (hint: antiderivative of $\sin$ is $-\cos$)
If velocity $v(t) = 3t^2$, find position $x(t)$ given $x(0) = 5$.
Write the Riemann sum (code-style) for $\int_0^1 e^x\,dx$ with $N = 4$ rectangles.
Now we combine derivatives with equations. A differential equation (DE) is an equation containing derivatives. Instead of telling you a value, it tells you the rule by which a value changes.
Ordinary equation: $x = 5$ (tells you the value) Differential equation: $\frac{dx}{dt} = -2x$ (tells you the rule of change)
The SWE analogy: An ordinary equation is a constant: const x = 5. A differential equation is a loop that updates state: while(running) { x += -2*x*dt; }. You define the update rule, and the system evolves.
Why DEs, Not Direct Answers?
In the real world, we rarely know the answer directly. We know how things relate in the moment:
"Hot coffee cools proportionally to the temperature difference" → $\frac{dT}{dt} = -k(T - T_{\text{room}})$
"Force equals mass times acceleration" → $F = m\frac{d^2x}{dt^2}$
"A capacitor's voltage changes at a rate inversely proportional to RC" → $\frac{dV}{dt} = -\frac{V}{RC}$
"Neural network weights move opposite to the gradient" → $\frac{d\mathbf{w}}{dt} = -\eta\nabla L(\mathbf{w})$
"Quantum state evolves according to the Hamiltonian" → $i\hbar\frac{d|\psi\rangle}{dt} = H|\psi\rangle$
Terminology
Order = highest derivative. $\frac{dx}{dt} = \ldots$ is 1st order. $\frac{d^2x}{dt^2} = \ldots$ is 2nd order.
ODE = Ordinary DE (one independent variable, usually time $t$).
Solution = a function $x(t)$ that satisfies the equation when you plug it in.
Initial condition (IC) = the starting value, e.g. $x(0) = 5$. Without this, you have a family of solutions.
Worked example: $\frac{dv}{dt} = 9.8, \quad v(0) = 0$ (a ball dropped from rest)
Step 1: This says "velocity changes at a constant rate of 9.8 m/s²" (gravity).
Pick a real system. See its DE, the solution, and the curve.
Real-World Example: ML Training Loss is a DE
Gradient descent: $w_{n+1} = w_n - \eta \frac{\partial L}{\partial w}$. In the continuous limit:
$$\frac{dw}{dt} = -\eta \nabla L(w)$$
For quadratic loss $L = w^2$: $\frac{dw}{dt} = -2\eta w$, solution $w(t) = w_0 e^{-2\eta t}$. This is why training loss curves look like exponential decays — they literally are.
Real-World Example: Quantum — The Schrödinger Equation
This says: "the rate of change of the quantum state is determined by the energy operator (Hamiltonian)." For a free particle, $\hat{H} = -\frac{\hbar^2}{2m}\frac{\partial^2}{\partial x^2}$, and the solutions are plane waves $\psi = Ae^{i(kx - \omega t)}$. Every quantum computer simulates this equation — it's the "physics engine" of reality.
Practice Problems
Solve: $\frac{dy}{dt} = 3$, $y(0) = 10$
Solve: $\frac{dy}{dt} = 2t$, $y(0) = 5$
What does the DE $\frac{dN}{dt} = rN$ model? What's the general solution?
A model's loss satisfies $\frac{dL}{dt} = -0.5L$. Starting from $L(0) = 10$, what is $L$ after 4 time units?
This is the discrete version of the first-order ODE $\frac{d\hat{\mu}}{dt} + \alpha\hat{\mu} = \alpha x$. The parameter $\alpha$ controls the time constant: small $\alpha$ → slow, smooth tracking (large $\tau$). Large $\alpha$ → fast, noisy tracking (small $\tau$). Same math as the RC circuit, different application.
Practice Problems
Solve: $\frac{dy}{dt} = -2y$, $y(0) = 10$. What is $y(3)$?
Solve: $\frac{dy}{dt} + 0.5y = 5$, $y(0) = 0$. What is the steady state? What is $y(4)$?
A learning rate decays as $\eta(t) = \eta_0 e^{-\lambda t}$. If $\eta_0 = 0.01$ and you want the rate to halve every 100 epochs, find $\lambda$.
Real systems have multiple quantities changing at once, often affecting each other. A car has position AND velocity AND heading. A neural network has millions of weights updating simultaneously.
From One DE to Many
Newton's law $m\frac{d^2x}{dt^2} = F$ can be split into two first-order DEs:
$$\frac{dx}{dt} = v \qquad \frac{dv}{dt} = \frac{F}{m}$$
State-Space Form
Pack all variables into a state vector and write the system as a matrix equation:
$$\dot{\mathbf{x}} = A\mathbf{x} + B\mathbf{u}$$
$\mathbf{x}$ = state vector, $A$ = system dynamics, $B$ = input coupling, $\mathbf{u}$ = control input
Row 1: $\dot{x} = v$ (position changes by velocity)
Row 2: $\dot{v} = -bv + F/m$ (velocity changes by force minus drag)
Why state space? It turns any system into a single matrix equation. Eigenvalues of $A$ determine stability. Matrix exponential $e^{At}$ gives the solution. The tools from linear algebra apply directly.
Interactive: 2D Vehicle Model
A car with position and velocity. Adjust throttle and drag.
2.00.50.0
Real-World Example: ML — Neural ODEs
A ResNet layer computes $\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta)$. In the continuous limit:
This is a system of DEs! A "Neural ODE" (Chen et al., 2018) literally replaces discrete layers with a continuous DE solved by numerical integration (Chapter 9). The depth of the network becomes continuous. This connects state-space models directly to deep learning architecture design.
Real-World Example: Quantum — Multi-Qubit State Evolution
A 2-qubit system has a $4 \times 4$ Hamiltonian matrix $H$. The state vector $|\psi\rangle$ has 4 components (probability amplitudes for $|00\rangle, |01\rangle, |10\rangle, |11\rangle$). The system of DEs:
Same structure as $\dot{\mathbf{x}} = A\mathbf{x}$! For $n$ qubits, the state vector has $2^n$ components. Simulating this on a classical computer requires exponential resources — that's why quantum computers exist.
9. Numerical Methods: How Computers Solve DEs
As a SWE, this is where you live. You'll rarely solve a DE by hand in production. But you must understand how numerical solvers work, because the choice of method affects accuracy, stability, and speed.
Euler's Method
$$y_{n+1} = y_n + \Delta t \cdot f(t_n, y_n)$$
"Current value + step size × current slope = next value"
In code:
let y = y0; for (let t = 0; t < tMax; t += dt) { y += dt * f(t, y); }
Worked example: Solve $\frac{dy}{dt} = -y$, $y(0) = 1$, with $\Delta t = 0.5$.
This is literally Euler's method on $\frac{dw}{dt} = -\nabla L(w)$ with step size $\eta$!
Adam, RMSProp, and other optimisers are more sophisticated numerical methods — they use momentum (like RK4 uses multiple slope samples) and adaptive step sizes. Understanding numerical methods tells you why Adam works better than vanilla SGD: it's a better DE solver.
10. Control Systems & PID
A control system is a DE with a feedback loop. It measures where you are, compares to where you want to be, and applies a correction.
P (Proportional) — "The further I am, the harder I push." Can overshoot.
I (Integral) — "If I've been off for a while, push harder." Eliminates steady-state error but can cause sluggish oscillation.
D (Derivative) — "If I'm approaching the target fast, ease off." Provides damping but amplifies noise.
Driving analogy: P = "I'm far from the lane center, steer hard." I = "I've been drifting right for 10 seconds, add persistent left correction." D = "I'm swinging back fast, ease off before I overshoot."
Interactive: PID Controller
A mass must reach the target. Try: P only (oscillates), P+D (fast, less overshoot), P+I (no steady-state error).
5.01.02.03.0
Real-World Example: ML — Learning Rate as Control
Training a neural network is a control problem:
Target: Minimum loss (zero gradient)
Error: Current gradient $\nabla L$
P (proportional): $-\eta \nabla L$ (vanilla SGD — step proportional to gradient)
I (integral): Momentum — accumulates past gradients. $v_{t+1} = \beta v_t + \nabla L$ is literally the integral of gradients
D (derivative): Gradient clipping — if the gradient changes too fast (high "derivative"), limit it
Adam optimiser is essentially a sophisticated PID controller for the loss landscape.
11. Probability & Statistics for Sensors
Every sensor lies. Every model is uncertain. Statistics quantifies uncertainty and tells you how much to trust each source of information.
$P(A \text{ or } B) = P(A) + P(B)$ (if mutually exclusive)
$P(A \text{ and } B) = P(A) \times P(B)$ (if independent)
Bayes' Theorem
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
Worked example (ML): A spam classifier has 95% accuracy on spam and 90% on non-spam. 20% of emails are spam. Given a "spam" classification, what's the probability it's actually spam?
If $\sigma$ is too large, activations explode. Too small, they vanish. These formulas come from variance analysis of matrix multiplications through layers — statistics meets linear algebra meets calculus.
Real-World Example: Quantum — Measurement is Probabilistic
A qubit in state $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ gives:
Unlike classical computing, quantum outcomes are inherently probabilistic. To estimate $P(0)$, you run the circuit many times and compute the frequency — the same statistical estimation as sensor averaging. The more measurements, the more precise your estimate (by $\sim 1/\sqrt{N}$, the same as GPS averaging). Statistical thinking is essential for quantum computing.
Practice Problems
If $P(A) = 0.3$ and $P(B) = 0.4$, with $A$ and $B$ independent, find $P(A \text{ and } B)$.
A sensor gives readings: 5.1, 4.9, 5.0, 5.2, 4.8. Find mean and standard deviation.
Using Bayes' theorem: A disease affects 1% of people. A test is 99% accurate (true positive and true negative). If you test positive, what's the probability you're actually sick?
If measurements follow $\mathcal{N}(100, 4)$ (mean 100, $\sigma = 2$), what percentage fall between 96 and 104?
Show Solutions
1. $P(A \cap B) = 0.3 \times 0.4 = 0.12$
2. Mean = 5.0. Deviations: 0.1, -0.1, 0, 0.2, -0.2. Variance = $(0.01+0.01+0+0.04+0.04)/5 = 0.02$. $\sigma \approx 0.141$.
3. $P(\text{sick}|+) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0099 + 0.0099} = 0.5$. Only 50%! The low base rate (1%) means false positives dominate.
4. 96 to 104 is $\mu \pm 2\sigma$, so approximately 95%.
12. The Kalman Filter
This uses everything before it. Algebra (equations), functions (state transitions), exponents (uncertainty growth), derivatives (system model), integrals (prediction), DEs (physics model), systems (state space), numerical methods (discrete propagation), control (correction), and statistics (Gaussian uncertainty). It is one of the most important algorithms in engineering.
The Core Problem
You have two imperfect sources of information:
A physics model (DE): "I was going 20 m/s north, so I should be 20 m further north." But the model drifts.
A sensor measurement (GPS, LiDAR, etc.): "You are at position X." But the sensor is noisy.
Neither is perfect. The Kalman filter answers: "What is the best estimate given both?"
The two drunk friends analogy: Imagine two friends, both lost, each pointing in a different direction to "home." One has a compass (but it's wobbly — the model), the other has a blurry map (the sensor). Individually, neither is reliable. But by combining their estimates, weighting each by their reliability, you get a much better answer. That's the Kalman filter.
The Kalman gain $K$ is the magic: If sensor is precise ($R$ small), $K \to 1$ (trust sensor). If model is precise ($P$ small), $K \to 0$ (trust model). It automatically computes the optimal blend.
After just one measurement, uncertainty dropped from 4.1 to 2.02. Each subsequent measurement reduces it further. This is the power of sensor fusion.
Interactive: 1D Kalman Filter
A car moves at constant velocity. GPS (red dots) is noisy. The Kalman filter (green) fuses the motion model with measurements. Adjust noise levels.
0.102.0
Real-World Example: GPS/INS Fusion
A self-driving car runs an Extended Kalman Filter at 100 Hz:
Predict (100 Hz, from IMU): Integrate acceleration & gyro. Uncertainty grows. Update (10 Hz, from GPS): Compare predicted position to GPS fix. Large correction. Uncertainty shrinks. Update (30 Hz, from camera): Compare predicted lane position to detected lines. Another correction.
In a tunnel (no GPS), uncertainty grows and grows. When GPS returns, the first fix causes a large correction (high gain × big innovation).
Real-World Example: ML — Bayesian Updating
The Kalman filter is Bayesian inference for linear Gaussian systems. The predict step is the prior. The update step applies Bayes' theorem with a Gaussian likelihood. The result is the posterior.
In ML, Bayesian neural networks, Gaussian processes, and variational inference all use this same predict-update loop conceptually. The Kalman filter is the simplest, most elegant instance of this pattern.
Real-World Example: Quantum — State Estimation
Quantum state tomography — reconstructing a quantum state from measurements — faces the same problem: noisy measurements of an uncertain state. Quantum Kalman filters exist for tracking continuously measured quantum systems, using the same predict-update structure but adapted for quantum noise (Heisenberg uncertainty principle adds a fundamental floor on $Q$ and $R$).
Practice Problems
If $P = 2$ and $R = 8$, what is the Kalman gain $K$? Does the filter trust the model or the sensor more?
If $Q$ is very large (model is bad), what happens to $K$ over time?
A Kalman filter has $\hat{x} = 50$, prediction $= 52$, measurement $= 58$, $K = 0.4$. What is the updated estimate?
Why can't the Kalman filter handle $\dot{x} = v\cos\theta$ directly? (Hint: what makes this nonlinear?)
Show Solutions
1. $K = P/(P+R) = 2/(2+8) = 0.2$. Low gain — trusts the model more (it has lower uncertainty).
2. Large $Q$ → $P$ grows quickly → $K$ stays high → trusts sensor more (model is unreliable).
3. $\hat{x} = 52 + 0.4(58 - 52) = 52 + 2.4 = 54.4$
4. $\cos\theta$ depends on state $\theta$, so the "A matrix" isn't constant — it changes with the state. This is nonlinear. The Extended Kalman Filter (EKF) linearises by computing the Jacobian (matrix of partial derivatives) at each timestep.
Every topic feeds the next. The math isn't abstract — it's the literal code running inside every GPS receiver, every autonomous vehicle, every drone, every ML training loop, and every quantum computer. You now have the complete foundation.
Where to go from here:
ML depth: Study linear algebra for matrix operations, backpropagation, SVD, PCA
Robotics: Study trigonometry for rotations, coordinate transforms, inverse kinematics
Quantum: Study complex numbers, linear algebra (Hilbert spaces), and the Dirac notation
Statistics: Study the full probability & statistics guide for distributions, hypothesis testing, and Bayesian methods