Empirical Risk and the Likelihood Game
Chapter 2 left us with a question: ERM minimises $\hat{R}(h)$, but we actually care about $R(h)$. The gap between them requires measuring how far apart two probability distributions are. This chapter builds that measuring instrument — from information theory to KL divergence — and reveals that minimising the gap between distributions and maximising likelihood are secretly the same operation.
The Question Left Open
End of Chapter 2: we know $\hat{R}(h) \to R(h)$ as $N \to \infty$ by the Law of Large Numbers. But we also noted this is true for any fixed $h$ — the convergence of the function values does not automatically guarantee that the minimiser of $\hat{R}$ converges to the minimiser of $R$. The central question of generalisation is precisely: when does $\hat{h}^* \to h^*$?
Answering this requires a language for measuring how close two probability distributions are. Specifically, we want to measure how close our parametric model $P_\theta$ is to the true unknown distribution $P$. Once we have that measuring instrument, ERM and maximum likelihood will turn out to be two descriptions of the same optimisation problem.
This chapter builds that instrument from the ground up — starting from a single question: how surprised should you be when something happens?
Surprisal: Quantifying How Shocked You Should Be
Before we can measure the distance between distributions, we need to measure the information content of a single event. The right notion is surprisal — how much new information does an event $A$ provide when it occurs?
Two properties we want from any reasonable surprisal measure $\mathcal{I}(A)$: rare events should be more surprising (carry more information) than common ones, and the surprisal of two independent events together should equal the sum of their individual surprisals — information should add when events are independent.
There is essentially one function that satisfies both requirements.
$P(\text{heads}) = 0.5$. Surprisal $= -\log(0.5) = \log 2 \approx 0.693$ nats. Neither outcome is very surprising — you learn exactly one bit of information.
$P(\text{heads}) = 0.99$. Surprisal $= -\log(0.99) \approx 0.01$ nats. A biased coin landing heads carries almost no information — you expected it.
On a normal trading day, $P(\text{VIX} > 30) \approx 0.05$. Surprisal $= -\log(0.05) \approx 3.0$ nats. A VIX spike is genuinely surprising — it carries significant information about regime change.
$P(\text{VIX between 10 and 20}) \approx 0.65$. Surprisal $\approx 0.43$ nats. Low-volatility regimes are routine — not much information when they persist.
Entropy: The Average Surprise in a Distribution
A single event's surprisal is just one number. What we want is a property of the entire distribution $P_X$ — a measure of how uncertain or spread-out it is. The natural answer: take the expected surprisal.
If both teams are evenly matched: $P(\text{win}) = P(\text{loss}) = 0.5$. $H = -2 \times 0.5 \log 0.5 = \log 2 \approx 0.693$ nats. Maximum uncertainty — maximum entropy.
If one team almost always wins: $P(\text{win}) = 0.95$, $P(\text{loss}) = 0.05$. $H \approx 0.199$ nats. Low entropy — little uncertainty before the match.
A market with $P(\text{up}) = P(\text{down}) = 0.5$ has maximum entropy — a completely unpredictable index. A market with $P(\text{up}) = 0.9$ has lower entropy — more predictable, less information generated per day.
High entropy = hard to predict = large Bayes error. Low entropy = easier to predict. Entropy is the Bayes error in disguise for classification.
Cross-Entropy: When Your Model Disagrees with Reality
Entropy measures the average surprisal under the true distribution. But suppose we are using a model distribution $Q_X$ to describe a world that actually follows $P_X$. How surprised will we be on average, according to our model?
The difference $H(P,Q) - H(P)$ is the extra surprise incurred by using the wrong model $Q$ instead of the true distribution $P$. This extra surprise is always non-negative — your model can never be more efficient at encoding reality than reality itself. And it has a name.
KL Divergence: The Distance Between Distributions
The gap between cross-entropy and entropy is the most important divergence measure in all of information theory and machine learning.
KL divergence satisfies three critical properties — and one critical non-property.
$D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general. KL divergence is not symmetric. The "distance" from $P$ to $Q$ is not the same as the "distance" from $Q$ to $P$. It is technically a divergence, not a metric. In ML, we almost always minimise $D_{KL}(P \| Q)$ — where $P$ is the true distribution and $Q$ is our model — because this corresponds to MLE, as we're about to show.
True NIFTY daily returns follow a fat-tailed distribution $P$. You model them with a Gaussian $Q$. $D_{KL}(P \| Q) > 0$ — your model underestimates tail probabilities, leading to systematic surprisal whenever large moves occur. The KL divergence quantifies exactly how badly misspecified your model is.
$D_{KL}(P \| Q)$ is large when $P$ puts mass in regions where $Q$ assigns near-zero probability — your model is caught completely off-guard. $D_{KL}(Q \| P)$ is large when $Q$ spreads mass over regions $P$ considers unlikely. These measure different kinds of model failure.
The Big Connection: Minimising KL Divergence = Maximising Likelihood
Here is the single most important equivalence in all of statistical ML. We will derive it carefully, because it deserves to be understood and not just stated.
We have $N$ samples $D = \{v_1, v_2, \ldots, v_N\}$ drawn i.i.d. from an unknown distribution $P_v$. We choose a parametric model family $P_\theta$ and want the best $\theta$. "Best" means closest to the truth — so we minimise $D_{KL}(P_v \| P_\theta)$ over $\theta$.
We want:
Expand the $\log$ of a ratio into a difference of logs:
The first integral is $-H(P_v)$ — the entropy of the true distribution. It does not depend on $\theta$ at all. So minimising over $\theta$:
The remaining term is $-\mathbb{E}_{P_v}[\log P_\theta(v)]$. Minimising the negative is the same as maximising the positive:
We don't know $P_v$, so we can't compute $\mathbb{E}_{P_v}[\log P_\theta(v)]$ exactly. But by the LLN, with $N$ i.i.d. samples:
Plugging back in, the KL-minimising estimator is:
Recognise $P_\theta(v_i)$ as the likelihood of observation $v_i$ under model $P_\theta$. The sum $\sum_i \log P_\theta(v_i) = \log \prod_i P_\theta(v_i)$ is the log of the joint likelihood of the entire dataset (since samples are i.i.d.). Therefore:
$\underset{\theta}{\arg\min}\; D_{KL}(P_v \| P_\theta)$
Distance to truth
$\underset{\theta}{\arg\max}\; \frac{1}{N}\sum_i \log P_\theta(v_i)$
Probability of the data
$\underset{\theta}{\arg\min}\; -\frac{1}{N}\sum_i \log P_\theta(v_i)$
NLL as empirical risk
Squared Error Is MLE in Disguise
The equivalence above isn't just abstract. It immediately explains why squared error loss is not an arbitrary choice for regression — it is the natural loss function when you assume Gaussian noise.
Consider supervised regression: $x \in \mathbb{R}^d$, $y \in \mathbb{R}$, dataset $D = \{(x_i, y_i)\}_{i=1}^N$ i.i.d. from $P_{XY}$. We want to estimate $P_\theta(y|x)$. The model choice:
We assume the label $y$ is the model's prediction $h_\theta(x)$ plus Gaussian noise. Under this model, the log-likelihood of a single observation $(x_i, y_i)$ is:
Summing over the dataset, MLE becomes:
Maximising the Gaussian log-likelihood is identical to minimising the sum of squared errors. Linear regression is MLE under a Gaussian noise assumption. When you minimise squared error, you are implicitly assuming the residuals are Gaussian — whether you know it or not.
Suppose NIFTY daily returns fall into two regimes: up days and down days, each with three sub-outcomes — small, medium, large move. Let $P$ be the empirically estimated true distribution and $Q$ be a simplified model:
Sum all terms:
$D_{KL} \approx 0.021$ nats. This is a relatively small divergence — the model $Q$ is a reasonable approximation to $P$. The largest contributor is "small up" days (0.0547), where $Q$ underestimates the probability (0.25 vs true 0.30). The model thinks small up-moves are less common than they are.
Suppose we model NIFTY daily log-returns as $v_i \sim \mathcal{N}(\mu, \sigma^2)$, i.e. $P_\theta(v) = \mathcal{N}(v;\,\mu,\,\sigma^2)$ where $\theta = (\mu, \sigma^2)$. We observe $N$ returns $v_1, \ldots, v_N$. Find the MLE for $\theta$.
Since samples are i.i.d., the joint log-likelihood is the sum of individual log-likelihoods:
Differentiate with respect to $\mu$ and set to zero:
The framework is now complete for the unsupervised case. Chapter 4 takes this further — into the supervised setting, where we estimate $P_\theta(y|x)$ rather than $P_\theta(v)$ — and derives the MLE estimators for the canonical models: linear regression, logistic regression, and beyond. The tools are all here. Chapter 4 puts them to work.
A NIFTY analyst fits a Gaussian model to daily returns using maximum likelihood. She then claims she is "minimising the distance between her model and the true return distribution." Which distance measure is she implicitly minimising?
As the derivation showed, minimising $D_{KL}(P_{\text{true}} \| P_\theta)$ over $\theta$ drops the entropy term (constant in $\theta$) and reduces to maximising $\mathbb{E}_{P_{\text{true}}}[\log P_\theta(v)]$, which by the LLN is approximated by the sample log-likelihood $\frac{1}{N}\sum_i \log P_\theta(v_i)$. MLE therefore minimises $D_{KL}(P_{\text{true}} \| P_\theta)$ — the divergence measured from the true distribution to the model.
Option D is the reverse direction — $D_{KL}(P_\theta \| P_{\text{true}})$ — which is a different quantity and corresponds to a different estimator (the moment-matching or M-projection). The direction matters: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general.
A model $Q$ assigns probability $Q(x) = 0$ to an outcome $x$ that the true distribution $P$ assigns $P(x) = 0.05$. What happens to $D_{KL}(P \| Q)$?
The KL divergence term for this outcome is $P(x)\log\frac{P(x)}{Q(x)} = 0.05 \times \log\frac{0.05}{0}$. Since $\log(a/0) = +\infty$ for any $a > 0$, this single term drives the entire divergence to $+\infty$.
This is not a technical nuisance — it has deep practical meaning. A model that assigns zero probability to something that actually happens is infinitely wrong. It has been caught in a category error: declaring something impossible that the world regularly produces. In ML terms, such a model will assign $\log P_\theta(v_i) = -\infty$ to any observation $v_i$ in the zero-probability region, making the log-likelihood $-\infty$. This is why models must have support everywhere the data might live — a Gaussian model is often used precisely because it assigns non-zero probability to all of $\mathbb{R}$, making $D_{KL}$ finite even if large.
Two Zomato delivery time models are being compared. $P$ = true distribution: $P(fast) = 0.6$, $P(slow) = 0.4$. Model $Q$: $Q(fast) = 0.5$, $Q(slow) = 0.5$. Model $R$: $R(fast) = 0.6$, $R(slow) = 0.4$. What are $D_{KL}(P \| Q)$ and $D_{KL}(P \| R)$? Which model would MLE select?
$D_{KL}(P \| Q) = 0.6\log\frac{0.6}{0.5} + 0.4\log\frac{0.4}{0.5} = 0.6\log(1.2) + 0.4\log(0.8)$
$= 0.6 \times 0.1823 + 0.4 \times (-0.2231) = 0.1094 - 0.0893 = 0.0201$ nats.
$D_{KL}(P \| R) = 0.6\log\frac{0.6}{0.6} + 0.4\log\frac{0.4}{0.4} = 0.6 \times 0 + 0.4 \times 0 = 0$.
Model $R$ is exactly equal to $P$, so its KL divergence from $P$ is zero. MLE will select $R$, because $R$ assigns the highest probability to data drawn from $P$. Option D is wrong: when a model exactly matches the true distribution, $D_{KL} = 0$. The KL divergence is zero if and only if $P = Q$ everywhere.
A quant fits a linear regression model to NIFTY returns using squared error loss. A colleague says "you're assuming Gaussian noise." The quant replies "I'm just minimising prediction error, I haven't assumed anything probabilistic." Who is right, and why?
As the derivation showed: if $P_\theta(y|x) = \mathcal{N}(y; h_\theta(x), I)$, then maximising the conditional log-likelihood reduces exactly to minimising $\sum_i (y_i - h_\theta(x_i))^2$. The converse is also true: if you are minimising squared error, you are — whether knowingly or not — finding the MLE under a Gaussian noise assumption.
The quant's model has an implicit probabilistic interpretation. The Gaussian assumption has consequences: it implies residuals should be symmetric, homoscedastic (constant variance), and have light tails. For NIFTY returns, all three are violated. The model is not wrong to use — squared error is often a reasonable starting point — but the quant should know what they are assuming, because those assumptions determine when and how the model will fail. Implicit assumptions are the most dangerous kind.
Q1–3 test direct computation. Q4–7 require you to connect entropy, KL divergence, and MLE. Q8–10 are harder — they ask you to reason carefully about when the equivalences hold, break, or generalise.