Complete Engineering Mathematics

An interactive course for software engineers — prerequisite math for ML, robotics, quantum computing, and beyond

You write code for a living. You think in logic, loops, and data structures. This course translates the mathematics you need into that mindset. Every concept is grounded in something concrete, connected to systems you care about — machine learning, autonomous vehicles, sensor fusion, quantum computing — and packed with worked examples so you can practice until it clicks.

The Khan Academy philosophy here: Don't memorise formulas. Understand why they work. Every formula started as someone's clever observation. We'll rebuild that observation, step by step, with examples. If you understand the "why," you'll re-derive what you forget.

This course is structured sequentially — each chapter builds on the previous ones. Budget roughly 3 hours for a thorough first pass, more if you work through every practice problem.

Algebra → Functions → Limits → Exponents & Logs → Derivatives → Integrals → Diff Equations → Systems → Numerical Methods → Control → Statistics → Kalman Filter

0. Algebra: The Language

Algebra is just using letters to represent unknown numbers, then finding those numbers using rules. You already do this in code: let x = totalPrice / quantity. Math does the same thing — with different notation.

Think of it this way: Algebra is the "programming language" of math. Variables are parameters. Equations are assertions. Solving is debugging — you manipulate until you isolate the unknown. Everything else in this course is written in this language.

Variables and Expressions

A variable is a placeholder for a number you don't know yet (like a function parameter). An expression is a recipe that uses variables:

$3x + 7$ means "take some number $x$, multiply by 3, add 7"
If $x = 4$:   $3(4) + 7 = 12 + 7 = 19$
If $x = -2$:   $3(-2) + 7 = -6 + 7 = 1$
If $x = 0$:   $3(0) + 7 = 0 + 7 = 7$

Multiplication is often written without a sign: $3x$ means $3 \times x$. Parentheses work like code: evaluate the inside first.

More examples of expressions:

Equations: Finding the Unknown

An equation says two expressions are equal. Solving means finding what value of the variable makes it true.

Golden rule: Whatever you do to one side, do to the other side. The equation stays balanced — like a scale. Add 5 to the left? Add 5 to the right. Multiply the left by 3? Multiply the right by 3.

Worked example 1: Solve $3x + 7 = 22$

Step 1: Subtract 7 from both sides → $3x + 7 - 7 = 22 - 7$ → $3x = 15$
Step 2: Divide both sides by 3 → $x = 5$
Check: $3(5) + 7 = 15 + 7 = 22$ ✓

Worked example 2: Solve $2(x - 3) = 10$

Step 1: Distribute the 2: $2x - 6 = 10$
Step 2: Add 6 to both sides: $2x = 16$
Step 3: Divide by 2: $x = 8$
Check: $2(8 - 3) = 2(5) = 10$ ✓

Worked example 3 (rearranging a formula): Solve $\frac{v - v_0}{a} = t$ for $v$

Step 1: Multiply both sides by $a$ → $v - v_0 = at$
Step 2: Add $v_0$ to both sides → $v = v_0 + at$

This is the velocity equation from physics. You just derived it by rearranging. You'll see this exact equation in robotics and autonomous vehicles.

Worked example 4 (ML connection): In linear regression, you predict $\hat{y} = wx + b$. If you know two data points $(1, 3)$ and $(2, 5)$, find $w$ and $b$.

Step 1: From point $(1,3)$: $w(1) + b = 3$ → $w + b = 3$
Step 2: From point $(2,5)$: $w(2) + b = 5$ → $2w + b = 5$
Step 3: Subtract equation 1 from equation 2: $(2w + b) - (w + b) = 5 - 3$ → $w = 2$
Step 4: Substitute back: $2 + b = 3$ → $b = 1$
Result: $\hat{y} = 2x + 1$. Check: $f(1) = 3$ ✓, $f(2) = 5$ ✓

This is the simplest form of "training" a model — solving equations to find weights. Real ML does this with millions of parameters using calculus (Chapter 4).

Interactive: Equation Solver

Solve $ax + b = c$ — enter coefficients and check your answer.

x + =

Systems of Equations

Sometimes you have multiple unknowns. You need as many equations as unknowns. This is called a system of equations.

Worked example: A drone's position depends on wind speed $w$ and motor thrust $m$. Given measurements:

$m + w = 10$   (moving forward, wind assists)
$m - w = 6$   (moving backward, wind opposes)
Step 1: Add both equations: $2m = 16$ → $m = 8$
Step 2: Substitute: $8 + w = 10$ → $w = 2$
Check: $8 + 2 = 10$ ✓, $8 - 2 = 6$ ✓
Why systems matter: In ML, training a neural network is solving a massive system of equations (millions of weights). In quantum computing, the state of $n$ qubits involves $2^n$ probability amplitudes that must satisfy constraints. Systems of equations are everywhere.

Subscripts and Greek Letters

Math uses subscripts to label related variables: $v_0$ means "the initial velocity", $v_f$ means "the final velocity". They're just names — like v_initial and v_final in code.

Greek letters are used because we run out of Roman ones:

Summation Notation (Σ)

This is literally a for-loop:

$$\sum_{i=1}^{5} i^2 = 1^2 + 2^2 + 3^2 + 4^2 + 5^2 = 1 + 4 + 9 + 16 + 25 = 55$$

In code: let sum = 0; for (let i = 1; i <= 5; i++) sum += i*i;

More examples:

$\displaystyle\sum_{i=1}^{4} i = 1 + 2 + 3 + 4 = 10$

$\displaystyle\sum_{i=0}^{3} 2^i = 1 + 2 + 4 + 8 = 15$

$\displaystyle\sum_{i=1}^{N} x_i$ just means "add up all the $x$ values from $x_1$ to $x_N$"
Real-World Example: GPS Averaging for Better Accuracy

Surveyors collect $N$ GPS readings and average them to reduce noise:

$$\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i$$

This is just algebra + summation. With 100 readings, the average position is ~10× more precise than a single reading.

Real-World Example: ML Loss Function

The Mean Squared Error loss function is pure algebra with summation:

$$\text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2$$

For each data point $i$: take the true value $y_i$, subtract your model's prediction $\hat{y}_i$, square it (so negatives don't cancel), and average. This single formula drives the training of most ML models.

In code: mse = predictions.map((p,i) => (y[i]-p)**2).reduce((a,b) => a+b) / N;

Real-World Example: Quantum — Probability Amplitudes

A qubit's state is $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ where $\alpha$ and $\beta$ are complex numbers. The constraint:

$$|\alpha|^2 + |\beta|^2 = 1$$

This is just algebra: the probabilities of measuring 0 or 1 must sum to 1. If $\alpha = \frac{1}{\sqrt{2}}$ and $\beta = \frac{1}{\sqrt{2}}$, then $|\alpha|^2 + |\beta|^2 = \frac{1}{2} + \frac{1}{2} = 1$ ✓. The qubit has equal probability of being measured as 0 or 1 — that's superposition.

Practice Problems

  1. Solve: $5x - 3 = 17$
  2. Solve: $\frac{x + 4}{2} = 7$
  3. Solve for $t$: $d = v_0 t + \frac{1}{2}at^2$ when $d = 100$, $v_0 = 0$, $a = 10$
  4. Compute: $\displaystyle\sum_{i=1}^{4} (2i + 1)$
  5. If MSE $= \frac{1}{3}\sum_{i=1}^{3}(y_i - \hat{y}_i)^2$ and the errors are $1, -2, 3$, find MSE.
Show Solutions
1. $5x = 20$ → $x = 4$
2. $x + 4 = 14$ → $x = 10$
3. $100 = \frac{1}{2}(10)t^2$ → $t^2 = 20$ → $t = \sqrt{20} \approx 4.47$ s
4. $(2(1)+1) + (2(2)+1) + (2(3)+1) + (2(4)+1) = 3 + 5 + 7 + 9 = 24$
5. $\frac{1}{3}(1^2 + (-2)^2 + 3^2) = \frac{1}{3}(1 + 4 + 9) = \frac{14}{3} \approx 4.67$

1. Functions: Input → Output Machines

A function is a machine that takes an input and produces exactly one output. In code, it's literally a function:

Math: $f(x) = x^2 + 1$
Code: function f(x) { return x*x + 1; }

$f(3) = 10, \quad f(-2) = 5, \quad f(0) = 1, \quad f(1) = 2$
The vending machine analogy: You put in a coin (input), press a button (function), and get exactly one item (output). You can't press one button and get two different things. That's what "function" means mathematically: one input always gives the same one output.

The input variable (here $x$) is called the independent variable. The output $f(x)$ is the dependent variable (it depends on what you feed in). We often write $y = f(x)$.

Evaluating Functions — Step by Step

To evaluate a function, replace every $x$ with the input value:

Example: $g(x) = 3x^2 - 2x + 5$. Find $g(2)$ and $g(-1)$.

$g(2)$: $3(2)^2 - 2(2) + 5 = 3(4) - 4 + 5 = 12 - 4 + 5 = 13$
$g(-1)$: $3(-1)^2 - 2(-1) + 5 = 3(1) + 2 + 5 = 10$

Notice: $(-1)^2 = 1$ (negative times negative = positive), and $-2(-1) = +2$. Signs trip up everyone at first — go slow.

Graphing a Function

A graph is a picture of all input-output pairs. The x-axis is the input, the y-axis is the output. Every point $(x, y)$ on the curve satisfies $y = f(x)$.

Reading a graph: Pick any x-value on the horizontal axis. Go straight up (or down) until you hit the curve. The y-value at that point is $f(x)$. That's it. A graph is just a visual lookup table.

Key Function Types You'll See Everywhere

Linear: $f(x) = mx + b$

A straight line. $m$ = slope (rise/run), $b$ = y-intercept. Constant rate of change.

Example: $f(x) = 2x + 3$ passes through $(0, 3)$ with slope 2 (rises 2 for every 1 step right).

ML: A neuron without activation is just $y = wx + b$.

Quadratic: $f(x) = ax^2 + bx + c$

A parabola (U-shape or ∩-shape). The rate of change itself changes.

Example: $f(x) = x^2$ makes a U-shape centered at origin.

ML: Mean Squared Error is quadratic in the predictions.

Exponential: $f(x) = a \cdot b^x$

Grows (or decays) by a constant percentage each step. Covered in depth next section.

ML: Learning rate decay, vanishing/exploding gradients.

Sinusoidal: $f(x) = A\sin(\omega x + \phi)$

Oscillates forever between $-A$ and $A$. Everything that repeats — waves, vibrations, AC power, seasonal patterns.

Quantum: Probability amplitudes are complex exponentials, which are sines and cosines.

ML-Specific Functions: Activation Functions

Neural networks need nonlinear functions between layers. Here are the big three:

ReLU: $f(x) = \max(0, x) = \begin{cases} 0 & \text{if } x < 0 \\ x & \text{if } x \ge 0 \end{cases}$
Dead simple: zero out negatives. Used in ~90% of deep learning.

Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$   (squashes any number to the range $(0, 1)$)
Output looks like probability. Used in binary classification.

Tanh: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$   (squashes to $(-1, 1)$)
Like sigmoid but centered at zero. Used in RNNs and LSTMs.

Interactive: Function Explorer

Pick a function type and adjust parameters. See how the graph changes.

2.0 1.0

Composition: Functions Feeding Functions

If $f(x) = x^2$ and $g(x) = x + 3$, then $f(g(x)) = f(x+3) = (x+3)^2$. This is like piping in Unix: the output of $g$ becomes the input of $f$.

Example in ML: A two-layer neural network is function composition:

$$\text{output} = f_2(f_1(\mathbf{x})) = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2)$$

Layer 1: multiply by weights, add bias, apply ReLU. Layer 2: multiply by more weights, add bias, apply sigmoid. Function composition is all deep learning is.

Real-World Example: Sensor Processing Pipeline

A thermistor outputs resistance $R$. The processing pipeline is a chain of functions:

$$R \xrightarrow{f_1} \text{voltage} \xrightarrow{f_2} \text{digital count} \xrightarrow{f_3} \text{temperature in °C} \xrightarrow{f_4} \text{filtered temperature}$$

Each arrow is a function. The full pipeline is $f_4(f_3(f_2(f_1(R))))$. Every sensor in an autonomous vehicle, every feature extraction in ML, and every quantum circuit is a chain of functions.

Practice Problems

  1. If $f(x) = 2x^2 - 3x + 1$, find $f(0)$, $f(1)$, $f(-1)$, $f(3)$.
  2. If $f(x) = x^2$ and $g(x) = 2x + 1$, find $f(g(3))$ and $g(f(3))$.
  3. Evaluate ReLU for inputs: $-5, -0.1, 0, 0.5, 3$.
  4. For sigmoid $\sigma(x) = \frac{1}{1+e^{-x}}$: what is $\sigma(0)$? What happens as $x \to \infty$? As $x \to -\infty$?
Show Solutions
1. $f(0) = 1$, $f(1) = 0$, $f(-1) = 6$, $f(3) = 10$
2. $g(3) = 7$, so $f(g(3)) = 49$. $f(3) = 9$, so $g(f(3)) = 19$. Note: $f(g(x)) \neq g(f(x))$ in general!
3. ReLU: $0, 0, 0, 0.5, 3$ (all negatives become 0)
4. $\sigma(0) = \frac{1}{1+1} = 0.5$. As $x \to \infty$: $e^{-x} \to 0$, so $\sigma \to 1$. As $x \to -\infty$: $e^{-x} \to \infty$, so $\sigma \to 0$.

2. Limits: Getting Infinitely Close

Before we can define derivatives or integrals, we need one key idea: what value does a function approach as the input gets close to some number? This is a limit.

The hallway analogy: Imagine walking toward a door. You take a step that covers half the remaining distance. Then another half. Then another. You never reach the door, but you get infinitely close. A limit is the mathematical way to talk about that destination — the value you approach even if you never land exactly on it.

The Intuition

What is $f(x) = \frac{x^2 - 1}{x - 1}$ when $x = 1$? Plugging in gives $\frac{0}{0}$ — undefined. But watch what happens as $x$ gets close to 1:

$f(0.9) = 1.9 \quad f(0.99) = 1.99 \quad f(0.999) = 1.999$
$f(1.001) = 2.001 \quad f(1.01) = 2.01 \quad f(1.1) = 2.1$

The function approaches $2$. We write: $\displaystyle\lim_{x \to 1}\frac{x^2-1}{x-1} = 2$

Why? Factor the numerator: $\frac{x^2-1}{x-1} = \frac{(x-1)(x+1)}{x-1} = x + 1$ (when $x \neq 1$). As $x \to 1$, this approaches $1 + 1 = 2$. The function has a "hole" at $x = 1$, but the limit fills it in.

Formal Notation

$$\lim_{x \to a} f(x) = L$$

"As $x$ gets arbitrarily close to $a$ (but not equal to $a$), $f(x)$ gets arbitrarily close to $L$."

The function doesn't need to be defined at $a$ — limits care about what happens near $a$, not at $a$.

Computing Limits: The Toolkit

1. Direct Substitution

If plugging in gives a real number (no $0/0$, no blowup), that's your answer:

$\displaystyle\lim_{x \to 3}(2x + 1) = 7 \qquad \lim_{x \to 0}\cos(x) = 1 \qquad \lim_{x \to 4}\sqrt{x} = 2$

Worked example: $\displaystyle\lim_{x \to 2}(x^3 - 4x + 1)$

Try direct substitution: $(2)^3 - 4(2) + 1 = 8 - 8 + 1 = 1$. No issues, so the limit is $1$.

2. Factor and Cancel (resolving $0/0$)

Worked example: $\displaystyle\lim_{x \to 2}\frac{x^2 - 4}{x - 2}$

Step 1: Direct substitution gives $\frac{4-4}{2-2} = \frac{0}{0}$ — indeterminate.
Step 2: Factor: $\frac{(x-2)(x+2)}{x-2} = x + 2$ (valid when $x \neq 2$)
Step 3: Now substitute: $\displaystyle\lim_{x \to 2}(x+2) = 4$

Another example: $\displaystyle\lim_{x \to 3}\frac{x^2 - 9}{x - 3}$

Step 1: $\frac{0}{0}$ — indeterminate.
Step 2: $\frac{(x-3)(x+3)}{x-3} = x + 3$
Step 3: $\lim_{x \to 3}(x+3) = 6$

3. Multiply by Conjugate

Worked example: $\displaystyle\lim_{x \to 0}\frac{\sqrt{x+4}-2}{x}$

Step 1: Direct sub gives $\frac{2-2}{0} = \frac{0}{0}$.
Step 2: Multiply by conjugate: $\frac{\sqrt{x+4}-2}{x}\cdot\frac{\sqrt{x+4}+2}{\sqrt{x+4}+2} = \frac{x}{x(\sqrt{x+4}+2)} = \frac{1}{\sqrt{x+4}+2}$
Step 3: Now substitute: $\frac{1}{\sqrt{4}+2} = \frac{1}{4}$

4. Limits at Infinity

What happens as $x$ grows without bound? The highest-power terms dominate:

$\displaystyle\lim_{x \to \infty}\frac{3x^2 + 5x}{x^2 + 1}$: divide top and bottom by $x^2$ → $\frac{3+5/x}{1+1/x^2} \to \frac{3}{1} = 3$

Same idea as Big-O notation: keep the dominant term. $O(n^2 + n)$ simplifies to $O(n^2)$.

5. The Squeeze Theorem

If $g(x) \le f(x) \le h(x)$ near $a$, and both $g$ and $h$ approach $L$, then $f$ must also approach $L$. The function is "squeezed" to the limit.

Critical Limits (Memorise These)

$$\lim_{x \to 0}\frac{\sin x}{x} = 1$$

The sinc limit. Even though $\sin(0)/0$ is undefined, the ratio approaches exactly 1. This is why $\frac{d}{dx}\sin x = \cos x$ — it's the foundation of all trig calculus.

$$\lim_{n \to \infty}\left(1 + \frac{1}{n}\right)^n = e \approx 2.71828$$

The definition of $e$ — the natural base of growth and decay. Covered in depth next section.

$$\lim_{x \to \infty}\frac{x^n}{e^x} = 0 \quad\text{for any } n$$

Exponential always beats polynomial. This is why $O(2^n)$ algorithms are impractical.

One-Sided Limits and Continuity

Sometimes a function approaches different values from left vs right:

$\displaystyle\lim_{x \to 0^+}\frac{1}{x} = +\infty \qquad \lim_{x \to 0^-}\frac{1}{x} = -\infty$

Since left ≠ right, the two-sided limit does not exist.

A function is continuous if: $\lim_{x \to a} f(x) = f(a)$. No holes, no jumps, no blowups.

Why this matters: ML optimisation (gradient descent) assumes the loss function is continuous and differentiable. ReLU has a corner at $x = 0$ (not differentiable there) — but it's continuous, and the "derivative" can be defined as either 0 or 1 at that point (subgradient). Understanding limits tells you where these edge cases arise.

Interactive: Limit Explorer

Watch what value $f(x)$ approaches as $x$ gets close to the critical point.

5
Real-World Example: Numerical Derivatives — Limits in Your Code

Every numerical derivative is an approximation of a limit:

const derivative = (f(x + h) - f(x)) / h; // h = 1e-8

You can't set $h = 0$. You make $h$ small and hope it's "close enough." But there's a tradeoff: too large → truncation error. Too small → floating-point cancellation. The sweet spot is $h \approx 10^{-8}$ for 64-bit floats. Understanding limits tells you why this tradeoff exists.

Real-World Example: ML — Softmax Temperature

The softmax function with temperature $T$ is: $\text{softmax}_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$

$$\lim_{T \to 0^+} \text{softmax} \to \text{argmax (one-hot)} \qquad \lim_{T \to \infty} \text{softmax} \to \text{uniform}$$

Low temperature = confident, picks the highest score. High temperature = uncertain, spreads probability evenly. This is exactly a limit! GPT's "temperature" slider controls this limit — $T = 0$ is greedy decoding, $T = \infty$ is random.

Practice Problems

  1. $\displaystyle\lim_{x \to 5}(3x - 7)$
  2. $\displaystyle\lim_{x \to -1}\frac{x^2 - 1}{x + 1}$
  3. $\displaystyle\lim_{x \to \infty}\frac{5x^3 + 2x}{x^3 - 1}$
  4. Does $\displaystyle\lim_{x \to 0}\frac{|x|}{x}$ exist? (Hint: try from left and right separately)
Show Solutions
1. Direct sub: $3(5) - 7 = 8$
2. $\frac{0}{0}$ → factor: $\frac{(x-1)(x+1)}{x+1} = x - 1$ → $\lim = -1 - 1 = -2$
3. Divide by $x^3$: $\frac{5 + 2/x^2}{1 - 1/x^3} \to 5$
4. From right: $|x|/x = x/x = 1$. From left: $|x|/x = -x/x = -1$. Left ≠ right, so limit DNE.

3. Exponents, Logarithms & the Number $e$

This chapter is critical. Almost every differential equation solution, every ML activation function, every quantum time-evolution involves $e^{something}$.

Exponents: Repeated Multiplication

$2^3 = 2 \times 2 \times 2 = 8$   ("2 multiplied by itself 3 times")
$5^2 = 25, \quad 10^4 = 10000, \quad 3^1 = 3, \quad 7^0 = 1$ (anything to the 0th power = 1)

Key rules (worth memorising):

$$a^m \cdot a^n = a^{m+n} \qquad \frac{a^m}{a^n} = a^{m-n} \qquad (a^m)^n = a^{mn}$$ $$a^{-n} = \frac{1}{a^n} \qquad a^{1/2} = \sqrt{a} \qquad a^{1/n} = \sqrt[n]{a}$$

Worked examples with the rules:

Example: $2^3 \times 2^4 = 2^{3+4} = 2^7 = 128$
Example: $\frac{10^6}{10^2} = 10^{6-2} = 10^4 = 10000$
Example: $(3^2)^3 = 3^{2 \times 3} = 3^6 = 729$
Example: $4^{-1} = \frac{1}{4} = 0.25$
Example: $9^{1/2} = \sqrt{9} = 3$

Logarithms: The Inverse of Exponents

A logarithm answers: "what power do I need?"

$$\log_b(x) = y \quad\Longleftrightarrow\quad b^y = x$$

$\log_2(8) = 3$ because $2^3 = 8$
$\log_{10}(1000) = 3$ because $10^3 = 1000$
$\log_2(1) = 0$ because $2^0 = 1$
$\log_2(32) = 5$ because $2^5 = 32$

Think of it as "how many times do I multiply?" $\log_2(64)$ asks: "How many times do I multiply 2 by itself to get 64?" Answer: $2 \times 2 \times 2 \times 2 \times 2 \times 2 = 64$, so 6 times. $\log_2(64) = 6$. This is literally binary search depth: searching 64 items takes at most $\log_2(64) = 6$ steps.

Log rules (mirror the exponent rules):

$$\log(xy) = \log(x) + \log(y) \qquad \log(x/y) = \log(x) - \log(y) \qquad \log(x^n) = n\log(x)$$

Worked example: Simplify $\log_2(8 \times 32)$

Method 1 (direct): $8 \times 32 = 256 = 2^8$, so $\log_2(256) = 8$
Method 2 (using rules): $\log_2(8) + \log_2(32) = 3 + 5 = 8$ ✓

The Number $e$ ≈ 2.71828...

Imagine you invest $1 at 100% annual interest. How much do you have after 1 year?

$$e = \lim_{n \to \infty}\left(1 + \frac{1}{n}\right)^n \approx 2.71828$$

$e$ is what you get when you compound continuously.

Why $e$ is everywhere: $e^x$ is the unique function that is its own derivative: $\frac{d}{dx}e^x = e^x$. The rate of change equals the current value. This makes it the natural solution to "something changes proportionally to itself" — which is how nearly every physical, biological, and computational system works.

The natural logarithm $\ln(x) = \log_e(x)$. In code: Math.exp(x) = $e^x$, Math.log(x) = $\ln(x)$.

Interactive: Exponential Growth & Decay

See how $y = a \cdot e^{kx}$ behaves. Positive $k$ = growth, negative $k$ = decay.

1.0 0.50
Real-World Example: ML — Cross-Entropy Loss & Log Probabilities

The cross-entropy loss for classification is:

$$L = -\sum_{i} y_i \log(\hat{y}_i)$$

Why $\log$? If the model predicts $\hat{y} = 0.99$ for the correct class, $-\log(0.99) \approx 0.01$ (tiny loss, good!). If it predicts $\hat{y} = 0.01$, $-\log(0.01) \approx 4.6$ (huge loss, bad!). The $\log$ turns multiplication-scale probabilities into addition-scale losses. It penalises confident wrong answers much more than uncertain ones.

This is also why we work in "log space" — multiplying tiny probabilities causes underflow; adding their logs doesn't.

Real-World Example: Information Theory — Entropy

Shannon's entropy measures uncertainty:

$$H = -\sum_{i} p_i \log_2(p_i) \quad\text{(bits)}$$

Fair coin: $H = -(\frac{1}{2}\log_2\frac{1}{2} + \frac{1}{2}\log_2\frac{1}{2}) = -(-\frac{1}{2} + -\frac{1}{2}) = 1$ bit. Maximum uncertainty.

Biased coin ($p=0.99$): $H \approx 0.08$ bits. Almost no uncertainty.

This underpins data compression, information gain in decision trees, and the KL divergence used in variational autoencoders.

Real-World Example: Quantum — Unitary Evolution

In quantum computing, the time evolution of a quantum state is:

$$|\psi(t)\rangle = e^{-iHt/\hbar}|\psi(0)\rangle$$

where $H$ is the Hamiltonian (energy operator). The exponential of a matrix appears! This $e^{iHt}$ is what quantum gates implement. For instance, a rotation gate $R_z(\theta) = e^{-i\theta Z/2}$ rotates a qubit's phase by angle $\theta$. The math of $e^x$ extends from scalars to matrices to operators — same concept, bigger objects.

Practice Problems

  1. Simplify: $3^4 \cdot 3^2$
  2. Find: $\log_3(81)$
  3. A ML model's accuracy doubles every 2 years. Starting at 50%, write an exponential model for accuracy $A(t)$, and find $A(6)$.
  4. Compute $-\log_2(0.5)$ and $-\log_2(0.125)$. Which represents more "surprise" (information content)?
  5. If a radioactive sample decays as $N(t) = 1000 \cdot e^{-0.1t}$, how much remains after $t = 10$? After $t = 23$ (hint: find the half-life first)?
Show Solutions
1. $3^{4+2} = 3^6 = 729$
2. $3^? = 81 = 3^4$, so $\log_3(81) = 4$
3. Doubling every 2 years: $A(t) = 50 \cdot 2^{t/2}$. $A(6) = 50 \cdot 2^3 = 400\%$ (obviously capped — but the math works!)
4. $-\log_2(0.5) = -(-1) = 1$ bit. $-\log_2(0.125) = -(-3) = 3$ bits. The rarer event (0.125) carries more surprise/information.
5. $N(10) = 1000 \cdot e^{-1} \approx 368$. Half-life: $t_{1/2} = \frac{\ln 2}{0.1} \approx 6.93$. At $t=23$: $N = 1000 \cdot e^{-2.3} \approx 100$.

4. Derivatives: Measuring Change

The Problem: How Fast Is Something Changing Right Now?

You're driving. Your odometer reads 100 km at 2:00 PM and 160 km at 3:00 PM. Average speed = 60 km/h. But were you going exactly 60 the whole time? Probably not.

To know your speed at exactly 2:15 PM, shrink the time interval smaller and smaller. The derivative is the limit of this process — the instantaneous rate of change.

Coding analogy: If you log a value every millisecond, the derivative at time $t$ is approximately (values[t+1] - values[t]) / dt. The derivative is this finite difference in the limit as dt → 0.

The Derivative: Slope as $h \to 0$

$$f'(x) = \frac{df}{dx} = \lim_{h \to 0}\frac{f(x+h) - f(x)}{h}$$

Worked example from scratch: Find the derivative of $f(x) = x^2$.

Step 1: Compute $f(x+h) = (x+h)^2 = x^2 + 2xh + h^2$
Step 2: Compute the difference: $f(x+h) - f(x) = 2xh + h^2$
Step 3: Divide by $h$: $\frac{2xh + h^2}{h} = 2x + h$
Step 4: Let $h \to 0$: $\lim_{h \to 0}(2x + h) = 2x$
Result: $f'(x) = 2x$. At $x = 3$, the slope is $6$. At $x = -1$, the slope is $-2$.

One more from scratch: Derivative of $f(x) = 3x + 5$.

Step 1: $f(x+h) = 3(x+h) + 5 = 3x + 3h + 5$
Step 2: $f(x+h) - f(x) = 3h$
Step 3: $\frac{3h}{h} = 3$
Step 4: $\lim_{h \to 0} 3 = 3$. The derivative of a line is its slope. Makes sense!

Derivative Rules (Shortcuts)

Power rule: $\frac{d}{dx}x^n = nx^{n-1}$
  $x^2 \to 2x \quad x^3 \to 3x^2 \quad x^5 \to 5x^4 \quad x^1 \to 1 \quad x^0 \to 0$

Constant multiplier: $\frac{d}{dx}[cf(x)] = c \cdot f'(x)$
  $\frac{d}{dx}[5x^3] = 5 \cdot 3x^2 = 15x^2$

Sum rule: $\frac{d}{dx}[f + g] = f' + g'$
  $\frac{d}{dx}[x^3 + 4x^2 - 7x + 2] = 3x^2 + 8x - 7$

Product rule: $\frac{d}{dx}[fg] = f'g + fg'$
  $\frac{d}{dx}[x^2 \sin x] = 2x\sin x + x^2\cos x$

Quotient rule: $\frac{d}{dx}\left[\frac{f}{g}\right] = \frac{f'g - fg'}{g^2}$

Exponential: $\frac{d}{dx}e^x = e^x$   (this is why $e$ is special!)

Chain rule: $\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$
  $\frac{d}{dx}e^{3x} = e^{3x} \cdot 3 = 3e^{3x}$   (outer derivative × inner derivative)

Trig: $\frac{d}{dx}\sin x = \cos x \qquad \frac{d}{dx}\cos x = -\sin x$

Worked Examples with Rules

Example 1 (power + sum): $f(x) = 4x^3 - 2x^2 + 7x - 5$

$f'(x) = 12x^2 - 4x + 7$

Example 2 (chain rule): $f(x) = (2x + 1)^5$

Outer: $(\text{something})^5 \to 5(\text{something})^4$. Inner: $2x + 1 \to 2$.
$f'(x) = 5(2x+1)^4 \cdot 2 = 10(2x+1)^4$

Example 3 (chain rule with $e$): $f(x) = e^{-x^2}$ (the Gaussian shape!)

Outer: $e^{(\text{something})} \to e^{(\text{something})}$. Inner: $-x^2 \to -2x$.
$f'(x) = e^{-x^2} \cdot (-2x) = -2xe^{-x^2}$

Partial Derivatives (Multiple Inputs)

When a function has multiple inputs, you can take the derivative with respect to each one, treating the others as constants:

$f(x, y) = x^2 y + 3xy^2$

$\frac{\partial f}{\partial x} = 2xy + 3y^2$   (treat $y$ as constant, differentiate w.r.t. $x$)

$\frac{\partial f}{\partial y} = x^2 + 6xy$   (treat $x$ as constant, differentiate w.r.t. $y$)
The gradient is the vector of all partial derivatives: $\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)$. It points in the direction of steepest ascent. Gradient descent goes the opposite way: $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L(\mathbf{w})$. That's the entire idea behind training neural networks.

Physical Meanings

Position $x(t)$ → Velocity $v(t) = \frac{dx}{dt}$ → Acceleration $a(t) = \frac{dv}{dt} = \frac{d^2x}{dt^2}$

Temperature $T(t)$ → Cooling/heating rate $\frac{dT}{dt}$

Loss $L(\mathbf{w})$ → Gradient $\nabla L$ → Weight update direction

Interactive: Derivative Visualiser

See $f(x)$, its derivative $f'(x)$, and the tangent line at any point.

1.00
Real-World Example: ML — Backpropagation is Just the Chain Rule

Consider a simple network: input $x$ → hidden $h = \sigma(wx + b)$ → output $\hat{y} = vh + c$ → loss $L = (\hat{y} - y)^2$.

To train, we need $\frac{\partial L}{\partial w}$. By the chain rule:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial w}$$
$\frac{\partial L}{\partial \hat{y}}$: $2(\hat{y} - y)$   (power rule on loss)
$\frac{\partial \hat{y}}{\partial h}$: $v$   (linear layer)
$\frac{\partial h}{\partial w}$: $\sigma'(wx + b) \cdot x$   (chain rule on activation)

Multiply them together. That's backpropagation — the chain rule applied systematically from output to input. PyTorch's loss.backward() computes exactly this, automatically.

Real-World Example: The Sigmoid Derivative

The sigmoid $\sigma(x) = \frac{1}{1+e^{-x}}$ has an elegant derivative:

$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

Derivation: Write $\sigma = (1+e^{-x})^{-1}$. Chain rule: $\sigma' = -(1+e^{-x})^{-2} \cdot (-e^{-x}) = \frac{e^{-x}}{(1+e^{-x})^2}$.

Factor: $= \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = \sigma \cdot \frac{(1+e^{-x})-1}{1+e^{-x}} = \sigma(1-\sigma)$.

Maximum at $x = 0$: $\sigma'(0) = 0.5 \times 0.5 = 0.25$. At the extremes, $\sigma' \to 0$ — the vanishing gradient problem. This is why deep networks with sigmoid activations train poorly; gradients shrink exponentially through layers.

Practice Problems

  1. Differentiate: $f(x) = 6x^4 - 3x^2 + 2x - 9$
  2. Use the chain rule: $f(x) = \sin(3x)$
  3. Use the chain rule: $f(x) = e^{2x+1}$
  4. Find the partial derivatives of $f(x,y) = x^2 + 3xy + y^3$
  5. At what $x$ is $f(x) = x^2 - 4x + 3$ at its minimum? (Hint: set $f'(x) = 0$)
  6. The ReLU function $f(x) = \max(0, x)$ has derivative $f'(x) = 0$ for $x < 0$ and $f'(x) = 1$ for $x > 0$. Why is this computationally efficient compared to sigmoid?
Show Solutions
1. $24x^3 - 6x + 2$
2. $3\cos(3x)$
3. $2e^{2x+1}$
4. $\frac{\partial f}{\partial x} = 2x + 3y$, $\frac{\partial f}{\partial y} = 3x + 3y^2$
5. $f'(x) = 2x - 4 = 0$ → $x = 2$. $f(2) = 4 - 8 + 3 = -1$. Minimum at $(2, -1)$.
6. ReLU derivative is just 0 or 1 — no exponential computation. Sigmoid needs $e^{-x}$ which is expensive. In a network with millions of neurons, this matters hugely.

5. Integrals: Accumulating Change

The integral is the reverse of the derivative. If the derivative chops a quantity into rates, the integral adds the rates back up.

The rainfall analogy: A derivative tells you "how fast is rain falling right now?" (mm/hour). An integral tells you "how much total rain accumulated over the day?" You add up tiny amounts of rain at each moment — that's integration.

The Problem: Adding Up Tiny Pieces

You have a graph of speed vs time. Distance = speed × time. But speed keeps changing! So chop time into tiny pieces $\Delta t$, multiply each by the speed at that moment, and add:

$$\text{distance} \approx \sum_{i=1}^{N} v(t_i) \cdot \Delta t \quad\xrightarrow{\Delta t \to 0}\quad \text{distance} = \int_{t_1}^{t_2} v(t)\,dt$$

The $\int$ is a stretched "S" for "Sum". The $dt$ tells you what variable you're summing over.

Antiderivatives

We ask: "what function, when differentiated, gives me this?"

$$\int x^n\,dx = \frac{x^{n+1}}{n+1} + C \quad(n \neq -1) \qquad \int e^x\,dx = e^x + C$$ $$\int \frac{1}{x}\,dx = \ln|x| + C \qquad \int \cos x\,dx = \sin x + C \qquad \int \sin x\,dx = -\cos x + C$$

Worked example 1: $\int (6x^2 + 4x - 3)\,dx$

Step 1: Apply power rule to each term: $\frac{6x^3}{3} + \frac{4x^2}{2} - 3x + C = 2x^3 + 2x^2 - 3x + C$
Check: $\frac{d}{dx}(2x^3 + 2x^2 - 3x) = 6x^2 + 4x - 3$ ✓

Worked example 2: $\int 5e^{2x}\,dx$

Think: The derivative of $e^{2x}$ is $2e^{2x}$ (chain rule). So the antiderivative of $e^{2x}$ is $\frac{1}{2}e^{2x}$.
Result: $5 \cdot \frac{1}{2}e^{2x} + C = \frac{5}{2}e^{2x} + C$

Definite Integrals (Area Under the Curve)

$$\int_a^b f(x)\,dx = F(b) - F(a)$$

The Fundamental Theorem of Calculus: find the antiderivative $F$, evaluate at endpoints, subtract.

Worked example: $\int_0^3 x^2\,dx$

Step 1: Antiderivative: $F(x) = \frac{x^3}{3}$
Step 2: $F(3) - F(0) = \frac{27}{3} - 0 = 9$

Worked example: $\int_1^e \frac{1}{x}\,dx$

Step 1: Antiderivative: $F(x) = \ln(x)$
Step 2: $\ln(e) - \ln(1) = 1 - 0 = 1$

Interactive: Riemann Sum → Integral

Watch the rectangles approximate the area. More rectangles = better approximation.

5
Real-World Example: ML — Expected Value is an Integral

The expected value of a continuous random variable:

$$E[X] = \int_{-\infty}^{\infty} x \cdot p(x)\,dx$$

This is a weighted average where the weights are the probability density $p(x)$. In ML, expected loss over the data distribution is: $E[L] = \int L(\theta, x) \cdot p(x)\,dx$. Since we can't compute this integral exactly (we don't know $p(x)$), we approximate it with a sample average — the empirical risk. That's why training uses batches of data: they're Monte Carlo estimates of an integral.

Real-World Example: Quantum — Normalisation of Wave Functions

A particle's wave function $\psi(x)$ must satisfy:

$$\int_{-\infty}^{\infty} |\psi(x)|^2\,dx = 1$$

The probability of finding the particle somewhere must be 1 (it exists!). The integral of $|\psi|^2$ over all space is the total probability. If you solve the Schrödinger equation and get a solution $\phi(x)$, you normalise it: $\psi(x) = \frac{\phi(x)}{\sqrt{\int|\phi|^2\,dx}}$. This is integration in action.

Practice Problems

  1. $\int (3x^2 + 2x + 1)\,dx$
  2. $\int_0^2 (4x - 1)\,dx$
  3. $\int_0^{\pi} \sin(x)\,dx$ (hint: antiderivative of $\sin$ is $-\cos$)
  4. If velocity $v(t) = 3t^2$, find position $x(t)$ given $x(0) = 5$.
  5. Write the Riemann sum (code-style) for $\int_0^1 e^x\,dx$ with $N = 4$ rectangles.
Show Solutions
1. $x^3 + x^2 + x + C$
2. $[2x^2 - x]_0^2 = (8 - 2) - 0 = 6$
3. $[-\cos x]_0^\pi = -\cos(\pi) - (-\cos(0)) = -(-1) + 1 = 2$
4. $x(t) = \int 3t^2\,dt = t^3 + C$. $x(0) = 5 \Rightarrow C = 5$. So $x(t) = t^3 + 5$.
5. $\Delta x = 0.25$. Sum $= 0.25[e^0 + e^{0.25} + e^{0.5} + e^{0.75}] = 0.25[1 + 1.284 + 1.649 + 2.117] \approx 1.512$. Exact = $e - 1 \approx 1.718$.

6. Differential Equations: Rules of Change

Now we combine derivatives with equations. A differential equation (DE) is an equation containing derivatives. Instead of telling you a value, it tells you the rule by which a value changes.

Ordinary equation: $x = 5$   (tells you the value)
Differential equation: $\frac{dx}{dt} = -2x$   (tells you the rule of change)
The SWE analogy: An ordinary equation is a constant: const x = 5. A differential equation is a loop that updates state: while(running) { x += -2*x*dt; }. You define the update rule, and the system evolves.

Why DEs, Not Direct Answers?

In the real world, we rarely know the answer directly. We know how things relate in the moment:

Terminology

Worked example: $\frac{dv}{dt} = 9.8, \quad v(0) = 0$ (a ball dropped from rest)

Step 1: This says "velocity changes at a constant rate of 9.8 m/s²" (gravity).
Step 2: Integrate: $v(t) = 9.8t + C$
Step 3: Use IC: $v(0) = 0 \Rightarrow C = 0$
Result: $v(t) = 9.8t$. After 3 seconds: $v(3) = 29.4$ m/s.

Interactive: Physical System Modeler

Pick a real system. See its DE, the solution, and the curve.

Real-World Example: ML Training Loss is a DE

Gradient descent: $w_{n+1} = w_n - \eta \frac{\partial L}{\partial w}$. In the continuous limit:

$$\frac{dw}{dt} = -\eta \nabla L(w)$$

For quadratic loss $L = w^2$: $\frac{dw}{dt} = -2\eta w$, solution $w(t) = w_0 e^{-2\eta t}$. This is why training loss curves look like exponential decays — they literally are.

Real-World Example: Quantum — The Schrödinger Equation

The most important DE in quantum mechanics:

$$i\hbar\frac{\partial \psi}{\partial t} = \hat{H}\psi$$

This says: "the rate of change of the quantum state is determined by the energy operator (Hamiltonian)." For a free particle, $\hat{H} = -\frac{\hbar^2}{2m}\frac{\partial^2}{\partial x^2}$, and the solutions are plane waves $\psi = Ae^{i(kx - \omega t)}$. Every quantum computer simulates this equation — it's the "physics engine" of reality.

Practice Problems

  1. Solve: $\frac{dy}{dt} = 3$, $y(0) = 10$
  2. Solve: $\frac{dy}{dt} = 2t$, $y(0) = 5$
  3. What does the DE $\frac{dN}{dt} = rN$ model? What's the general solution?
  4. A model's loss satisfies $\frac{dL}{dt} = -0.5L$. Starting from $L(0) = 10$, what is $L$ after 4 time units?
Show Solutions
1. $y = 3t + 10$
2. $y = t^2 + 5$
3. Exponential growth/decay (populations, compound interest). Solution: $N(t) = N_0 e^{rt}$.
4. $L(t) = 10e^{-0.5t}$. $L(4) = 10e^{-2} \approx 1.35$. Loss dropped from 10 to 1.35.

7. First-Order ODEs — Step by Step

Two equations handle the vast majority of engineering. Let's solve them completely.

Equation A: Exponential Growth/Decay — $\frac{dy}{dt} = ky$

"The rate of change of $y$ is proportional to $y$ itself." Big $y$ → big change.

Full solution by separation of variables:

Step 1 — Separate: $\frac{dy}{y} = k\,dt$
Step 2 — Integrate: $\ln|y| = kt + C_1$
Step 3 — Solve for $y$: $y = Ce^{kt}$
Step 4 — Apply IC: $y(0) = y_0 \Rightarrow C = y_0$
Result: $\boxed{y(t) = y_0 e^{kt}}$

$k > 0$: growth. $k < 0$: decay. Half-life: $t_{1/2} = \frac{\ln 2}{|k|}$.

Worked example (ML): A model's validation loss decays as $L' = -0.3L$, starting from $L(0) = 5$.

$L(t) = 5e^{-0.3t}$
After 10 epochs: $L(10) = 5e^{-3} \approx 0.25$. Half-life: $\frac{\ln 2}{0.3} \approx 2.3$ epochs.

Equation B: Linear with Forcing — $\frac{dy}{dt} + ay = b$

"$y$ is being pulled toward the value $b/a$." Like a thermostat driving room temperature toward the set point.

Result: $y(t) = \underbrace{\frac{b}{a}}_{\text{steady state}} + \underbrace{\left(y_0 - \frac{b}{a}\right)e^{-at}}_{\text{decaying transient}}$

The time constant $\tau = 1/a$ — how long to reach ~63% of the way to steady state.

Worked example (RC circuit): $C = 100\,\mu$F charges through $R = 10\,$kΩ from 5V supply.

$\tau = RC = 1$ s. Steady state = $5$ V.
$V_C(t) = 5(1 - e^{-t})$
After $1\tau$: 63% → 3.15V. After $3\tau$: 95% → 4.75V. After $5\tau$: 99.3% → 4.97V.

Interactive: First-Order ODE Explorer

Adjust $a$, $b$, $y_0$. Watch the solution curve. Dashed = steady state. Vertical = time constant $\tau$.

1.0 3.0 0.0
Real-World Example: ML — Exponential Moving Average

Many ML frameworks use an EMA for tracking metrics or model weights (e.g., in batch normalisation):

$$\hat{\mu}_{n+1} = \alpha \cdot x_n + (1 - \alpha) \cdot \hat{\mu}_n$$

This is the discrete version of the first-order ODE $\frac{d\hat{\mu}}{dt} + \alpha\hat{\mu} = \alpha x$. The parameter $\alpha$ controls the time constant: small $\alpha$ → slow, smooth tracking (large $\tau$). Large $\alpha$ → fast, noisy tracking (small $\tau$). Same math as the RC circuit, different application.

Practice Problems

  1. Solve: $\frac{dy}{dt} = -2y$, $y(0) = 10$. What is $y(3)$?
  2. Solve: $\frac{dy}{dt} + 0.5y = 5$, $y(0) = 0$. What is the steady state? What is $y(4)$?
  3. A learning rate decays as $\eta(t) = \eta_0 e^{-\lambda t}$. If $\eta_0 = 0.01$ and you want the rate to halve every 100 epochs, find $\lambda$.
Show Solutions
1. $y(t) = 10e^{-2t}$. $y(3) = 10e^{-6} \approx 0.0025$.
2. Steady state = $b/a = 5/0.5 = 10$. $y(t) = 10(1 - e^{-0.5t})$. $y(4) = 10(1 - e^{-2}) \approx 8.65$.
3. Half-life = 100 epochs. $\lambda = \ln(2)/100 \approx 0.00693$.

8. Systems of DEs & State Space

Real systems have multiple quantities changing at once, often affecting each other. A car has position AND velocity AND heading. A neural network has millions of weights updating simultaneously.

From One DE to Many

Newton's law $m\frac{d^2x}{dt^2} = F$ can be split into two first-order DEs:

$$\frac{dx}{dt} = v \qquad \frac{dv}{dt} = \frac{F}{m}$$

State-Space Form

Pack all variables into a state vector and write the system as a matrix equation:

$$\dot{\mathbf{x}} = A\mathbf{x} + B\mathbf{u}$$

$\mathbf{x}$ = state vector, $A$ = system dynamics, $B$ = input coupling, $\mathbf{u}$ = control input

Example (car with drag):

$\mathbf{x} = \begin{bmatrix} x \\ v \end{bmatrix}$, $A = \begin{bmatrix} 0 & 1 \\ 0 & -b \end{bmatrix}$, $B = \begin{bmatrix} 0 \\ 1/m \end{bmatrix}$, $u = F$

Row 1: $\dot{x} = v$ (position changes by velocity)
Row 2: $\dot{v} = -bv + F/m$ (velocity changes by force minus drag)
Why state space? It turns any system into a single matrix equation. Eigenvalues of $A$ determine stability. Matrix exponential $e^{At}$ gives the solution. The tools from linear algebra apply directly.

Interactive: 2D Vehicle Model

A car with position and velocity. Adjust throttle and drag.

2.0 0.5 0.0
Real-World Example: ML — Neural ODEs

A ResNet layer computes $\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta)$. In the continuous limit:

$$\frac{d\mathbf{h}}{dt} = f(\mathbf{h}(t), \theta)$$

This is a system of DEs! A "Neural ODE" (Chen et al., 2018) literally replaces discrete layers with a continuous DE solved by numerical integration (Chapter 9). The depth of the network becomes continuous. This connects state-space models directly to deep learning architecture design.

Real-World Example: Quantum — Multi-Qubit State Evolution

A 2-qubit system has a $4 \times 4$ Hamiltonian matrix $H$. The state vector $|\psi\rangle$ has 4 components (probability amplitudes for $|00\rangle, |01\rangle, |10\rangle, |11\rangle$). The system of DEs:

$$i\hbar\frac{d}{dt}\begin{bmatrix}\alpha_{00}\\\alpha_{01}\\\alpha_{10}\\\alpha_{11}\end{bmatrix} = H\begin{bmatrix}\alpha_{00}\\\alpha_{01}\\\alpha_{10}\\\alpha_{11}\end{bmatrix}$$

Same structure as $\dot{\mathbf{x}} = A\mathbf{x}$! For $n$ qubits, the state vector has $2^n$ components. Simulating this on a classical computer requires exponential resources — that's why quantum computers exist.

9. Numerical Methods: How Computers Solve DEs

As a SWE, this is where you live. You'll rarely solve a DE by hand in production. But you must understand how numerical solvers work, because the choice of method affects accuracy, stability, and speed.

Euler's Method

$$y_{n+1} = y_n + \Delta t \cdot f(t_n, y_n)$$

"Current value + step size × current slope = next value"

In code:

let y = y0;
for (let t = 0; t < tMax; t += dt) {
  y += dt * f(t, y);
}

Worked example: Solve $\frac{dy}{dt} = -y$, $y(0) = 1$, with $\Delta t = 0.5$.

$t=0$: $y = 1$, slope $= -1$, $y_1 = 1 + 0.5(-1) = 0.5$
$t=0.5$: $y = 0.5$, slope $= -0.5$, $y_2 = 0.5 + 0.5(-0.5) = 0.25$
$t=1.0$: $y = 0.25$. Exact: $e^{-1} \approx 0.368$. Euler gives $0.25$ — off by 32%.

Big step size = big error. With $\Delta t = 0.1$, we'd get 0.349 — much better.

Runge-Kutta 4 (RK4)

Instead of one slope sample, take four and average them (weighted). Dramatically more accurate.

$$y_{n+1} = y_n + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4)$$

where $k_1 = f(t, y)$, $k_2 = f(t+h/2, y+hk_1/2)$, etc.

Interactive: Euler vs RK4

Solve $dy/dt = -2y + 4$, $y(0) = 0$. Increase $\Delta t$ to see Euler fail while RK4 stays close.

0.30
Real-World Example: ML — SGD is Euler's Method on the Loss Landscape

Stochastic gradient descent update: $w_{n+1} = w_n - \eta \nabla L(w_n)$

This is literally Euler's method on $\frac{dw}{dt} = -\nabla L(w)$ with step size $\eta$!

Adam, RMSProp, and other optimisers are more sophisticated numerical methods — they use momentum (like RK4 uses multiple slope samples) and adaptive step sizes. Understanding numerical methods tells you why Adam works better than vanilla SGD: it's a better DE solver.

10. Control Systems & PID

A control system is a DE with a feedback loop. It measures where you are, compares to where you want to be, and applies a correction.

The Feedback Loop

$$e(t) = \text{target} - \text{actual} \qquad\text{(error)}$$ $$u(t) = \text{controller}(e) \qquad\text{(correction)}$$

PID: Three-Part Correction

$$u(t) = \underbrace{K_p \cdot e(t)}_{\text{Proportional}} + \underbrace{K_i \int_0^t e(\tau)\,d\tau}_{\text{Integral}} + \underbrace{K_d \cdot \frac{de}{dt}}_{\text{Derivative}}$$

P (Proportional) — "The further I am, the harder I push." Can overshoot.

I (Integral) — "If I've been off for a while, push harder." Eliminates steady-state error but can cause sluggish oscillation.

D (Derivative) — "If I'm approaching the target fast, ease off." Provides damping but amplifies noise.

Driving analogy: P = "I'm far from the lane center, steer hard." I = "I've been drifting right for 10 seconds, add persistent left correction." D = "I'm swinging back fast, ease off before I overshoot."

Interactive: PID Controller

A mass must reach the target. Try: P only (oscillates), P+D (fast, less overshoot), P+I (no steady-state error).

5.0 1.0 2.0 3.0
Real-World Example: ML — Learning Rate as Control

Training a neural network is a control problem:

Adam optimiser is essentially a sophisticated PID controller for the loss landscape.

11. Probability & Statistics for Sensors

Every sensor lies. Every model is uncertain. Statistics quantifies uncertainty and tells you how much to trust each source of information.

Probability Basics

$$P(\text{event}) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$$

Fair coin: $P(\text{heads}) = 1/2$. Fair die: $P(\text{six}) = 1/6$.

Key rules:

Bayes' Theorem

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Worked example (ML): A spam classifier has 95% accuracy on spam and 90% on non-spam. 20% of emails are spam. Given a "spam" classification, what's the probability it's actually spam?

$P(\text{spam}|\text{flagged})$ $= \frac{P(\text{flagged}|\text{spam}) \cdot P(\text{spam})}{P(\text{flagged})}$
$P(\text{flagged}) = 0.95 \times 0.20 + 0.10 \times 0.80 = 0.19 + 0.08 = 0.27$
$P(\text{spam}|\text{flagged}) = \frac{0.95 \times 0.20}{0.27} = \frac{0.19}{0.27} \approx 0.704$

Only 70% chance it's actually spam! The false positive rate matters a lot when the base rate (20%) is low.

Mean, Variance, Standard Deviation

Mean: $\mu = \frac{1}{N}\sum x_i$   (center of the data)

Variance: $\sigma^2 = \frac{1}{N}\sum (x_i - \mu)^2$   (average squared deviation)

Standard deviation: $\sigma = \sqrt{\sigma^2}$   (spread in original units)

Worked example: GPS readings (metres): 10.2, 9.8, 10.5, 10.1, 9.4

Mean: $(10.2 + 9.8 + 10.5 + 10.1 + 9.4)/5 = 10.0$ m
Variance: $(0.04 + 0.04 + 0.25 + 0.01 + 0.36)/5 = 0.14$ m²
Std dev: $\sigma = \sqrt{0.14} \approx 0.374$ m

The Gaussian (Normal) Distribution

$$p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Two parameters define the entire shape: $\mu$ (center) and $\sigma$ (width).

Interactive: Gaussian Explorer

Adjust $\mu$ and $\sigma$. The shaded bands show 1σ, 2σ, 3σ intervals.

0.0 1.0
Real-World Example: ML — Why Gaussian Initialisation Matters

Neural network weights are typically initialised from a Gaussian: $w \sim \mathcal{N}(0, \sigma^2)$.

Xavier init: $\sigma = \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}$
He init: $\sigma = \sqrt{\frac{2}{n_{\text{in}}}}$ (for ReLU)

If $\sigma$ is too large, activations explode. Too small, they vanish. These formulas come from variance analysis of matrix multiplications through layers — statistics meets linear algebra meets calculus.

Real-World Example: Quantum — Measurement is Probabilistic

A qubit in state $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ gives:

$P(\text{measure } 0) = |\alpha|^2 \qquad P(\text{measure } 1) = |\beta|^2$

Unlike classical computing, quantum outcomes are inherently probabilistic. To estimate $P(0)$, you run the circuit many times and compute the frequency — the same statistical estimation as sensor averaging. The more measurements, the more precise your estimate (by $\sim 1/\sqrt{N}$, the same as GPS averaging). Statistical thinking is essential for quantum computing.

Practice Problems

  1. If $P(A) = 0.3$ and $P(B) = 0.4$, with $A$ and $B$ independent, find $P(A \text{ and } B)$.
  2. A sensor gives readings: 5.1, 4.9, 5.0, 5.2, 4.8. Find mean and standard deviation.
  3. Using Bayes' theorem: A disease affects 1% of people. A test is 99% accurate (true positive and true negative). If you test positive, what's the probability you're actually sick?
  4. If measurements follow $\mathcal{N}(100, 4)$ (mean 100, $\sigma = 2$), what percentage fall between 96 and 104?
Show Solutions
1. $P(A \cap B) = 0.3 \times 0.4 = 0.12$
2. Mean = 5.0. Deviations: 0.1, -0.1, 0, 0.2, -0.2. Variance = $(0.01+0.01+0+0.04+0.04)/5 = 0.02$. $\sigma \approx 0.141$.
3. $P(\text{sick}|+) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0099 + 0.0099} = 0.5$. Only 50%! The low base rate (1%) means false positives dominate.
4. 96 to 104 is $\mu \pm 2\sigma$, so approximately 95%.

12. The Kalman Filter

This uses everything before it. Algebra (equations), functions (state transitions), exponents (uncertainty growth), derivatives (system model), integrals (prediction), DEs (physics model), systems (state space), numerical methods (discrete propagation), control (correction), and statistics (Gaussian uncertainty). It is one of the most important algorithms in engineering.

The Core Problem

You have two imperfect sources of information:

  1. A physics model (DE): "I was going 20 m/s north, so I should be 20 m further north." But the model drifts.
  2. A sensor measurement (GPS, LiDAR, etc.): "You are at position X." But the sensor is noisy.

Neither is perfect. The Kalman filter answers: "What is the best estimate given both?"

The two drunk friends analogy: Imagine two friends, both lost, each pointing in a different direction to "home." One has a compass (but it's wobbly — the model), the other has a blurry map (the sensor). Individually, neither is reliable. But by combining their estimates, weighting each by their reliability, you get a much better answer. That's the Kalman filter.

The Two-Step Loop

Step 1: PREDICT (use the DE)

$$\hat{x}_{k|k-1} = A\hat{x}_{k-1} + Bu_k \qquad\text{(state prediction)}$$ $$P_{k|k-1} = AP_{k-1}A^T + Q \qquad\text{(uncertainty grows)}$$

Step 2: UPDATE (use the measurement)

$$K_k = \frac{P_{k|k-1}H^T}{HP_{k|k-1}H^T + R} \qquad\text{(Kalman gain)}$$ $$\hat{x}_k = \hat{x}_{k|k-1} + K_k(z_k - H\hat{x}_{k|k-1}) \qquad\text{(corrected estimate)}$$ $$P_k = (I - K_kH)P_{k|k-1} \qquad\text{(uncertainty shrinks)}$$
The Kalman gain $K$ is the magic: If sensor is precise ($R$ small), $K \to 1$ (trust sensor). If model is precise ($P$ small), $K \to 0$ (trust model). It automatically computes the optimal blend.

Predict → Measure → Compute Gain → Correct → Repeat forever

Worked 1D example: Tracking a car at constant velocity. State = position. $A = 1$ (position persists), $H = 1$ (we directly measure position), $Q = 0.1$ (small model noise), $R = 4$ (noisy GPS).

Init: $\hat{x}_0 = 0$, $P_0 = 4$ (initial uncertainty equals GPS noise)
Predict: $\hat{x}_{1|0} = 1 \cdot 0 + v \cdot dt = 1.0$. $P_{1|0} = 4 + 0.1 = 4.1$
Measure: GPS says $z_1 = 1.3$.
Gain: $K = \frac{4.1}{4.1 + 4} = 0.506$
Update: $\hat{x}_1 = 1.0 + 0.506(1.3 - 1.0) = 1.152$
New uncertainty: $P_1 = (1 - 0.506) \times 4.1 = 2.02$ (halved!)

After just one measurement, uncertainty dropped from 4.1 to 2.02. Each subsequent measurement reduces it further. This is the power of sensor fusion.

Interactive: 1D Kalman Filter

A car moves at constant velocity. GPS (red dots) is noisy. The Kalman filter (green) fuses the motion model with measurements. Adjust noise levels.

0.10 2.0
Real-World Example: GPS/INS Fusion

A self-driving car runs an Extended Kalman Filter at 100 Hz:

Predict (100 Hz, from IMU): Integrate acceleration & gyro. Uncertainty grows.
Update (10 Hz, from GPS): Compare predicted position to GPS fix. Large correction. Uncertainty shrinks.
Update (30 Hz, from camera): Compare predicted lane position to detected lines. Another correction.

In a tunnel (no GPS), uncertainty grows and grows. When GPS returns, the first fix causes a large correction (high gain × big innovation).

Real-World Example: ML — Bayesian Updating

The Kalman filter is Bayesian inference for linear Gaussian systems. The predict step is the prior. The update step applies Bayes' theorem with a Gaussian likelihood. The result is the posterior.

In ML, Bayesian neural networks, Gaussian processes, and variational inference all use this same predict-update loop conceptually. The Kalman filter is the simplest, most elegant instance of this pattern.

Real-World Example: Quantum — State Estimation

Quantum state tomography — reconstructing a quantum state from measurements — faces the same problem: noisy measurements of an uncertain state. Quantum Kalman filters exist for tracking continuously measured quantum systems, using the same predict-update structure but adapted for quantum noise (Heisenberg uncertainty principle adds a fundamental floor on $Q$ and $R$).

Practice Problems

  1. If $P = 2$ and $R = 8$, what is the Kalman gain $K$? Does the filter trust the model or the sensor more?
  2. If $Q$ is very large (model is bad), what happens to $K$ over time?
  3. A Kalman filter has $\hat{x} = 50$, prediction $= 52$, measurement $= 58$, $K = 0.4$. What is the updated estimate?
  4. Why can't the Kalman filter handle $\dot{x} = v\cos\theta$ directly? (Hint: what makes this nonlinear?)
Show Solutions
1. $K = P/(P+R) = 2/(2+8) = 0.2$. Low gain — trusts the model more (it has lower uncertainty).
2. Large $Q$ → $P$ grows quickly → $K$ stays high → trusts sensor more (model is unreliable).
3. $\hat{x} = 52 + 0.4(58 - 52) = 52 + 2.4 = 54.4$
4. $\cos\theta$ depends on state $\theta$, so the "A matrix" isn't constant — it changes with the state. This is nonlinear. The Extended Kalman Filter (EKF) linearises by computing the Jacobian (matrix of partial derivatives) at each timestep.

The Complete Map

Algebra (language) → Functions (I/O) → Limits (approaching) → Exp/Log (growth) → Derivatives (rates) → Integrals (accumulation) → DEs (change rules) → Systems (multi-var) → Numerical (computers) → Control (feedback) → Statistics (uncertainty) → Kalman (fusion)

Every topic feeds the next. The math isn't abstract — it's the literal code running inside every GPS receiver, every autonomous vehicle, every drone, every ML training loop, and every quantum computer. You now have the complete foundation.

Where to go from here:

Source: Engineering math curriculum for software engineers. See also: Trigonometry, Linear Algebra, Probability & Statistics, and Set Theory guides in this series.