Foundational Chapter 03 of 08 · Mathematical Foundations of ML

Empirical Risk and the Likelihood Game

Chapter 2 left us with a question: ERM minimises $\hat{R}(h)$, but we actually care about $R(h)$. The gap between them requires measuring how far apart two probability distributions are. This chapter builds that measuring instrument — from information theory to KL divergence — and reveals that minimising the gap between distributions and maximising likelihood are secretly the same operation.

The Question Left Open

End of Chapter 2: we know $\hat{R}(h) \to R(h)$ as $N \to \infty$ by the Law of Large Numbers. But we also noted this is true for any fixed $h$ — the convergence of the function values does not automatically guarantee that the minimiser of $\hat{R}$ converges to the minimiser of $R$. The central question of generalisation is precisely: when does $\hat{h}^* \to h^*$?

Answering this requires a language for measuring how close two probability distributions are. Specifically, we want to measure how close our parametric model $P_\theta$ is to the true unknown distribution $P$. Once we have that measuring instrument, ERM and maximum likelihood will turn out to be two descriptions of the same optimisation problem.

This chapter builds that instrument from the ground up — starting from a single question: how surprised should you be when something happens?

Surprisal: Quantifying How Shocked You Should Be

Before we can measure the distance between distributions, we need to measure the information content of a single event. The right notion is surprisal — how much new information does an event $A$ provide when it occurs?

Two properties we want from any reasonable surprisal measure $\mathcal{I}(A)$: rare events should be more surprising (carry more information) than common ones, and the surprisal of two independent events together should equal the sum of their individual surprisals — information should add when events are independent.

There is essentially one function that satisfies both requirements.

The surprisal (or self-information) of an event $A \in \mathcal{F}$ is: $$\mathcal{I}(A) \;\triangleq\; -\log P(A)$$ For a discrete random variable $X$ taking value $x_i$: $\mathcal{I}(x_i) = -\log P(x_i)$. Logarithms are base $e$ (nats) or base 2 (bits) depending on context — the theory is identical.

🎲 Tossing a Fair Coin

$P(\text{heads}) = 0.5$. Surprisal $= -\log(0.5) = \log 2 \approx 0.693$ nats. Neither outcome is very surprising — you learn exactly one bit of information.

$P(\text{heads}) = 0.99$. Surprisal $= -\log(0.99) \approx 0.01$ nats. A biased coin landing heads carries almost no information — you expected it.

🇮🇳 INDIA VIX Spike

On a normal trading day, $P(\text{VIX} > 30) \approx 0.05$. Surprisal $= -\log(0.05) \approx 3.0$ nats. A VIX spike is genuinely surprising — it carries significant information about regime change.

$P(\text{VIX between 10 and 20}) \approx 0.65$. Surprisal $\approx 0.43$ nats. Low-volatility regimes are routine — not much information when they persist.

Surprisal has the right qualitative behaviour. Certain events ($P=1$) have zero surprisal — you learn nothing when something inevitable happens. Impossible events ($P=0$) have infinite surprisal — if they somehow occurred, it would overturn everything you thought you knew. And crucially: $\mathcal{I}(A \cap B) = \mathcal{I}(A) + \mathcal{I}(B)$ when $A$ and $B$ are independent, because $-\log P(A \cap B) = -\log P(A) - \log P(B)$. Information is additive over independent events. Logarithms are the unique function with this property.

Entropy: The Average Surprise in a Distribution

A single event's surprisal is just one number. What we want is a property of the entire distribution $P_X$ — a measure of how uncertain or spread-out it is. The natural answer: take the expected surprisal.

The entropy of a discrete distribution $P_X$ is the expected surprisal: $$H(P_X) \;\triangleq\; \mathbb{E}_{P_X}\!\left[-\log P_X(x)\right] = -\sum_i P_X(x_i)\,\log P_X(x_i)$$ Entropy measures the average amount of information gained — equivalently, the average uncertainty — when a random variable $X \sim P_X$ is observed.

🏏 IPL Match Result

If both teams are evenly matched: $P(\text{win}) = P(\text{loss}) = 0.5$. $H = -2 \times 0.5 \log 0.5 = \log 2 \approx 0.693$ nats. Maximum uncertainty — maximum entropy.

If one team almost always wins: $P(\text{win}) = 0.95$, $P(\text{loss}) = 0.05$. $H \approx 0.199$ nats. Low entropy — little uncertainty before the match.

🇮🇳 NIFTY Direction

A market with $P(\text{up}) = P(\text{down}) = 0.5$ has maximum entropy — a completely unpredictable index. A market with $P(\text{up}) = 0.9$ has lower entropy — more predictable, less information generated per day.

High entropy = hard to predict = large Bayes error. Low entropy = easier to predict. Entropy is the Bayes error in disguise for classification.

Entropy is maximised when all outcomes are equally likely — maximum uncertainty. It is minimised (at zero) when one outcome has probability 1 — no uncertainty at all. For a fair $k$-sided die, $H = \log k$: the more sides, the more uncertainty per roll. For NIFTY, entropy quantifies how much information each trading day generates. A low-entropy regime (strong trend) is more predictable; a high-entropy regime (choppy, mean-reverting) generates more surprisal per day and is harder to trade systematically.

Cross-Entropy: When Your Model Disagrees with Reality

Entropy measures the average surprisal under the true distribution. But suppose we are using a model distribution $Q_X$ to describe a world that actually follows $P_X$. How surprised will we be on average, according to our model?

The cross-entropy of $Q$ relative to $P$ is the expected surprisal under $P$ when surprisal is measured using $Q$: $$H(P, Q) \;\triangleq\; \mathbb{E}_{P_X}\!\left[-\log Q_X(x)\right] = -\sum_i P_X(x_i)\,\log Q_X(x_i)$$ Samples come from $P$ (reality) but we assign probabilities using $Q$ (our model). Cross-entropy is always at least as large as entropy: $H(P,Q) \geq H(P)$.

The difference $H(P,Q) - H(P)$ is the extra surprise incurred by using the wrong model $Q$ instead of the true distribution $P$. This extra surprise is always non-negative — your model can never be more efficient at encoding reality than reality itself. And it has a name.

KL Divergence: The Distance Between Distributions

The gap between cross-entropy and entropy is the most important divergence measure in all of information theory and machine learning.

The Kullback-Leibler (KL) divergence from $P$ to $Q$ is: $$D_{KL}(P \| Q) \;\triangleq\; H(P, Q) - H(P) = \mathbb{E}_{P}\!\left[\log \frac{P(x)}{Q(x)}\right] = \sum_i P(x_i)\,\log \frac{P(x_i)}{Q(x_i)}$$ For continuous distributions: $$D_{KL}(P_X \| Q_X) = \int_{\mathcal{X}} P_X(x)\,\log \frac{P_X(x)}{Q_X(x)}\,dx$$ It measures how much extra surprisal you incur by using $Q$ to encode samples from $P$.

KL divergence satisfies three critical properties — and one critical non-property.

$D_{KL}(P \| Q) \geq 0$, with equality if and only if $P = Q$. This follows from Jensen's inequality applied to the convex function $-\log$. There is never a benefit to using the wrong model — the extra surprisal is always non-negative, and zero only when your model is exactly correct.

$D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general. KL divergence is not symmetric. The "distance" from $P$ to $Q$ is not the same as the "distance" from $Q$ to $P$. It is technically a divergence, not a metric. In ML, we almost always minimise $D_{KL}(P \| Q)$ — where $P$ is the true distribution and $Q$ is our model — because this corresponds to MLE, as we're about to show.

📊 Gaussian Fit to Returns

True NIFTY daily returns follow a fat-tailed distribution $P$. You model them with a Gaussian $Q$. $D_{KL}(P \| Q) > 0$ — your model underestimates tail probabilities, leading to systematic surprisal whenever large moves occur. The KL divergence quantifies exactly how badly misspecified your model is.

🔁 Asymmetry in Practice

$D_{KL}(P \| Q)$ is large when $P$ puts mass in regions where $Q$ assigns near-zero probability — your model is caught completely off-guard. $D_{KL}(Q \| P)$ is large when $Q$ spreads mass over regions $P$ considers unlikely. These measure different kinds of model failure.

The Big Connection: Minimising KL Divergence = Maximising Likelihood

Here is the single most important equivalence in all of statistical ML. We will derive it carefully, because it deserves to be understood and not just stated.

We have $N$ samples $D = \{v_1, v_2, \ldots, v_N\}$ drawn i.i.d. from an unknown distribution $P_v$. We choose a parametric model family $P_\theta$ and want the best $\theta$. "Best" means closest to the truth — so we minimise $D_{KL}(P_v \| P_\theta)$ over $\theta$.

Derivation · KL Minimisation = Maximum Likelihood Estimation The Central Equivalence

Step 1 — Write out the optimisation

We want:

\theta^* = \underset{\theta}{\arg\min}\; D_{KL}(P_v \| P_\theta) = \underset{\theta}{\arg\min} \int P_v(v)\,\log \frac{P_v(v)}{P_\theta(v)}\,dv

Step 2 — Split the integrand

Expand the $\log$ of a ratio into a difference of logs:

D_{KL}(P_v \| P_\theta) = \int P_v(v)\,\log P_v(v)\,dv - \int P_v(v)\,\log P_\theta(v)\,dv

The first integral is $-H(P_v)$ — the entropy of the true distribution. It does not depend on $\theta$ at all. So minimising over $\theta$:

\theta^* = \underset{\theta}{\arg\min}\left[-H(P_v) - \int P_v(v)\,\log P_\theta(v)\,dv\right] = \underset{\theta}{\arg\min} - \int P_v(v)\,\log P_\theta(v)\,dv

The entropy $H(P_v)$ is a constant with respect to $\theta$ — it drops out of the minimisation.

Step 3 — Recognise the expectation

The remaining term is $-\mathbb{E}_{P_v}[\log P_\theta(v)]$. Minimising the negative is the same as maximising the positive:

\theta^* = \underset{\theta}{\arg\max}\; \mathbb{E}_{P_v}\!\left[\log P_\theta(v)\right]

Step 4 — Apply the Law of Large Numbers

We don't know $P_v$, so we can't compute $\mathbb{E}_{P_v}[\log P_\theta(v)]$ exactly. But by the LLN, with $N$ i.i.d. samples:

\mathbb{E}_{P_v}\!\left[\log P_\theta(v)\right] \;\approx\; \frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i) \quad \text{as } N \to \infty

Step 5 — Arrive at MLE

Plugging back in, the KL-minimising estimator is:

\theta^* = \underset{\theta}{\arg\max}\; \frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i)

Recognise $P_\theta(v_i)$ as the likelihood of observation $v_i$ under model $P_\theta$. The sum $\sum_i \log P_\theta(v_i) = \log \prod_i P_\theta(v_i)$ is the log of the joint likelihood of the entire dataset (since samples are i.i.d.). Therefore:

\theta^* = \underset{\theta}{\arg\max}\;\log \ell_\theta(D) = \underset{\theta}{\arg\max}\;\ell_\theta(D)

The KL-minimising estimator is the Maximum Likelihood Estimator. Minimising the distance between your model and the true distribution — measured by KL divergence — is exactly the same as maximising the probability that your model assigned to the observed data. These are two descriptions of the same optimisation problem. ERM, KL minimisation, and MLE are members of the same family.

This equivalence is the reason MLE is so central to statistics: it is not just a convenient computational trick. It is the principled answer to "which parameters make my model as close as possible to the true distribution?" The answer is: the parameters that assigned the highest probability to the data you actually saw. The data speaks, and MLE listens.

The Same Optimisation, Three Languages

KL Minimisation
$\underset{\theta}{\arg\min}\; D_{KL}(P_v \| P_\theta)$
Distance to truth

≡

Maximum Likelihood
$\underset{\theta}{\arg\max}\; \frac{1}{N}\sum_i \log P_\theta(v_i)$
Probability of the data

≡

ERM
$\underset{\theta}{\arg\min}\; -\frac{1}{N}\sum_i \log P_\theta(v_i)$
NLL as empirical risk

All three optimise the same objective. The language changes. The solution does not.

Squared Error Is MLE in Disguise

The equivalence above isn't just abstract. It immediately explains why squared error loss is not an arbitrary choice for regression — it is the natural loss function when you assume Gaussian noise.

Consider supervised regression: $x \in \mathbb{R}^d$, $y \in \mathbb{R}$, dataset $D = \{(x_i, y_i)\}_{i=1}^N$ i.i.d. from $P_{XY}$. We want to estimate $P_\theta(y|x)$. The model choice:

P_\theta(y \mid x) = \mathcal{N}(y;\; h_\theta(x),\; I)

We assume the label $y$ is the model's prediction $h_\theta(x)$ plus Gaussian noise. Under this model, the log-likelihood of a single observation $(x_i, y_i)$ is:

\log P_\theta(y_i \mid x_i) = \log \mathcal{N}(y_i;\; h_\theta(x_i),\; I) \;\propto\; -\frac{1}{2}(y_i - h_\theta(x_i))^2

Summing over the dataset, MLE becomes:

\theta^* = \underset{\theta}{\arg\max}\;\sum_{i=1}^N \log P_\theta(y_i \mid x_i) \;\propto\; \underset{\theta}{\arg\min}\;\sum_{i=1}^N (y_i - h_\theta(x_i))^2

Maximising the Gaussian log-likelihood is identical to minimising the sum of squared errors. Linear regression is MLE under a Gaussian noise assumption. When you minimise squared error, you are implicitly assuming the residuals are Gaussian — whether you know it or not.

This is a model assumption, not a universal truth. NIFTY daily returns are not Gaussian — they have fat tails, skew, and volatility clustering. When you fit a squared-error regression model to financial returns, you are implicitly making a Gaussian noise assumption that is known to be wrong. The model will systematically underestimate the probability of large moves. This is not a reason to never use squared error — it is a reason to know what you're assuming when you do.

Every options pricing model that uses squared error loss (or equivalently, Gaussian returns) inherits this assumption. Black-Scholes assumes lognormal returns — Gaussian log-returns — and this is why it systematically misprices deep OTM options on NIFTY. The volatility smile you see in NIFTY option chains is the market's correction for exactly this Gaussian misspecification. The true distribution of NIFTY returns has fatter tails than any Gaussian. The KL divergence between true NIFTY returns and any Gaussian is strictly positive and non-trivial — measured, empirically, in years of P&L.

Worked Example 1 · Computing KL Divergence 📊 Two Market Regime Models

Suppose NIFTY daily returns fall into two regimes: up days and down days, each with three sub-outcomes — small, medium, large move. Let $P$ be the empirically estimated true distribution and $Q$ be a simplified model:

\begin{array}{lccc} \text{Outcome} & P(x) & Q(x) & P(x)\log\frac{P(x)}{Q(x)} \\ \hline \text{Small up} & 0.30 & 0.25 & 0.30 \times \log(1.20) = 0.0547 \\ \text{Medium up} & 0.20 & 0.25 & 0.20 \times \log(0.80) = -0.0446 \\ \text{Large up} & 0.08 & 0.05 & 0.08 \times \log(1.60) = 0.0376 \\ \text{Small down} & 0.25 & 0.25 & 0.25 \times \log(1.00) = 0 \\ \text{Medium down} & 0.12 & 0.15 & 0.12 \times \log(0.80) = -0.0268 \\ \text{Large down} & 0.05 & 0.05 & 0.05 \times \log(1.00) = 0 \end{array}

Compute $D_{KL}(P \| Q)$

Sum all terms:

D_{KL}(P \| Q) = 0.0547 - 0.0446 + 0.0376 + 0 - 0.0268 + 0 \approx 0.0209 \text{ nats}

Interpret

$D_{KL} \approx 0.021$ nats. This is a relatively small divergence — the model $Q$ is a reasonable approximation to $P$. The largest contributor is "small up" days (0.0547), where $Q$ underestimates the probability (0.25 vs true 0.30). The model thinks small up-moves are less common than they are.

Note on asymmetry: if you computed $D_{KL}(Q \| P)$ instead, you would get a different number. Try it: the asymmetry comes from the fact that $Q$ assigns equal probability (0.25) to small and medium up-moves, while $P$ weights them 0.30 vs 0.20. Minimising $D_{KL}(P \| Q)$ (as MLE does) penalises the model for missing mass where $P$ is high. Minimising $D_{KL}(Q \| P)$ penalises the model for placing mass where $P$ is low. They solve different problems.

Worked Example 2 · MLE for a Gaussian Model 🇮🇳 Fitting NIFTY Daily Returns

Suppose we model NIFTY daily log-returns as $v_i \sim \mathcal{N}(\mu, \sigma^2)$, i.e. $P_\theta(v) = \mathcal{N}(v;\,\mu,\,\sigma^2)$ where $\theta = (\mu, \sigma^2)$. We observe $N$ returns $v_1, \ldots, v_N$. Find the MLE for $\theta$.

Step 1 — Write the log-likelihood

Since samples are i.i.d., the joint log-likelihood is the sum of individual log-likelihoods:

\hat{\ell}(\theta) = \sum_{i=1}^N \log \mathcal{N}(v_i;\,\mu,\,\sigma^2) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^N (v_i - \mu)^2

Step 2 — Maximise over $\mu$

Differentiate with respect to $\mu$ and set to zero:

\frac{\partial \hat{\ell}}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^N(v_i - \mu) = 0 \;\implies\; \hat{\mu} = \frac{1}{N}\sum_{i=1}^N v_i = \bar{v}

Step 3 — Maximise over $\sigma^2$

\frac{\partial \hat{\ell}}{\partial \sigma^2} = -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^N(v_i-\mu)^2 = 0 \;\implies\; \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(v_i - \bar{v})^2

The MLE for a Gaussian model is the sample mean and sample variance. This is not a coincidence or a definition — it is the principled result of minimising KL divergence between the true distribution and the Gaussian family. Every time a quant fits a normal distribution to historical NIFTY returns using sample statistics, they are performing MLE, whether they label it that way or not. The resulting model is the closest Gaussian (in KL divergence) to whatever the true distribution actually is.

The framework is now complete for the unsupervised case. Chapter 4 takes this further — into the supervised setting, where we estimate $P_\theta(y|x)$ rather than $P_\theta(v)$ — and derives the MLE estimators for the canonical models: linear regression, logistic regression, and beyond. The tools are all here. Chapter 4 puts them to work.

Practice Problems

4 questions · Chapter 03

ml / mathematical-foundations / ch03 / q01 ★ Conceptual

A NIFTY analyst fits a Gaussian model to daily returns using maximum likelihood. She then claims she is "minimising the distance between her model and the true return distribution." Which distance measure is she implicitly minimising?

Euclidean distance between mean and variance estimates.

Total variation distance between the fitted and empirical distributions.

KL divergence $D_{KL}(P_{\text{true}} \| P_\theta)$ from the true distribution to the model.

KL divergence $D_{KL}(P_\theta \| P_{\text{true}})$ from the model to the true distribution.

Answer: C.

As the derivation showed, minimising $D_{KL}(P_{\text{true}} \| P_\theta)$ over $\theta$ drops the entropy term (constant in $\theta$) and reduces to maximising $\mathbb{E}_{P_{\text{true}}}[\log P_\theta(v)]$, which by the LLN is approximated by the sample log-likelihood $\frac{1}{N}\sum_i \log P_\theta(v_i)$. MLE therefore minimises $D_{KL}(P_{\text{true}} \| P_\theta)$ — the divergence measured from the true distribution to the model.

Option D is the reverse direction — $D_{KL}(P_\theta \| P_{\text{true}})$ — which is a different quantity and corresponds to a different estimator (the moment-matching or M-projection). The direction matters: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ in general.

ml / mathematical-foundations / ch03 / q02 ★ Conceptual

A model $Q$ assigns probability $Q(x) = 0$ to an outcome $x$ that the true distribution $P$ assigns $P(x) = 0.05$. What happens to $D_{KL}(P \| Q)$?

$D_{KL}(P \| Q) = 0$ — a rare event doesn't affect the divergence significantly.

$D_{KL}(P \| Q)$ increases by exactly 0.05 nats.

$D_{KL}(P \| Q) = +\infty$ — a model assigning zero probability to a possible event has infinite divergence from the truth.

$D_{KL}(P \| Q)$ is undefined — the divergence cannot be computed when $Q(x) = 0$.

Answer: C.

The KL divergence term for this outcome is $P(x)\log\frac{P(x)}{Q(x)} = 0.05 \times \log\frac{0.05}{0}$. Since $\log(a/0) = +\infty$ for any $a > 0$, this single term drives the entire divergence to $+\infty$.

This is not a technical nuisance — it has deep practical meaning. A model that assigns zero probability to something that actually happens is infinitely wrong. It has been caught in a category error: declaring something impossible that the world regularly produces. In ML terms, such a model will assign $\log P_\theta(v_i) = -\infty$ to any observation $v_i$ in the zero-probability region, making the log-likelihood $-\infty$. This is why models must have support everywhere the data might live — a Gaussian model is often used precisely because it assigns non-zero probability to all of $\mathbb{R}$, making $D_{KL}$ finite even if large.

ml / mathematical-foundations / ch03 / q03 ★★ Computational

Two Zomato delivery time models are being compared. $P$ = true distribution: $P(fast) = 0.6$, $P(slow) = 0.4$. Model $Q$: $Q(fast) = 0.5$, $Q(slow) = 0.5$. Model $R$: $R(fast) = 0.6$, $R(slow) = 0.4$. What are $D_{KL}(P \| Q)$ and $D_{KL}(P \| R)$? Which model would MLE select?

$D_{KL}(P\|Q) \approx 0.020$, $D_{KL}(P\|R) = 0.020$. Both models are equally good.

$D_{KL}(P\|Q) \approx 0.020$, $D_{KL}(P\|R) = 0$. MLE selects Model $R$.

$D_{KL}(P\|Q) = 0$, $D_{KL}(P\|R) \approx 0.020$. MLE selects Model $Q$.

Both divergences are positive since no model can perfectly represent the true distribution.

Answer: B.

$D_{KL}(P \| Q) = 0.6\log\frac{0.6}{0.5} + 0.4\log\frac{0.4}{0.5} = 0.6\log(1.2) + 0.4\log(0.8)$
$= 0.6 \times 0.1823 + 0.4 \times (-0.2231) = 0.1094 - 0.0893 = 0.0201$ nats.

$D_{KL}(P \| R) = 0.6\log\frac{0.6}{0.6} + 0.4\log\frac{0.4}{0.4} = 0.6 \times 0 + 0.4 \times 0 = 0$.

Model $R$ is exactly equal to $P$, so its KL divergence from $P$ is zero. MLE will select $R$, because $R$ assigns the highest probability to data drawn from $P$. Option D is wrong: when a model exactly matches the true distribution, $D_{KL} = 0$. The KL divergence is zero if and only if $P = Q$ everywhere.

ml / mathematical-foundations / ch03 / q04 ★★ Conceptual

A quant fits a linear regression model to NIFTY returns using squared error loss. A colleague says "you're assuming Gaussian noise." The quant replies "I'm just minimising prediction error, I haven't assumed anything probabilistic." Who is right, and why?

The quant is right — squared error is a purely computational choice with no probabilistic interpretation.

The colleague is right — squared error loss only makes sense if you explicitly assume Gaussian noise.

The colleague is right in a precise sense: minimising squared error is equivalent to MLE under a Gaussian noise model, whether or not the quant intended this. The assumption is implicit, not optional.

Both are right — probabilistic and computational interpretations are completely separate and never interact.

Answer: C.

As the derivation showed: if $P_\theta(y|x) = \mathcal{N}(y; h_\theta(x), I)$, then maximising the conditional log-likelihood reduces exactly to minimising $\sum_i (y_i - h_\theta(x_i))^2$. The converse is also true: if you are minimising squared error, you are — whether knowingly or not — finding the MLE under a Gaussian noise assumption.

The quant's model has an implicit probabilistic interpretation. The Gaussian assumption has consequences: it implies residuals should be symmetric, homoscedastic (constant variance), and have light tails. For NIFTY returns, all three are violated. The model is not wrong to use — squared error is often a reasonable starting point — but the quant should know what they are assuming, because those assumptions determine when and how the model will fail. Implicit assumptions are the most dangerous kind.

Terminal Questions — Chapter 03 10 problems · No answers given

Q1–3 test direct computation. Q4–7 require you to connect entropy, KL divergence, and MLE. Q8–10 are harder — they ask you to reason carefully about when the equivalences hold, break, or generalise.

Compute the entropy $H(P)$ for the following distributions: (a) Uniform over 4 outcomes. (b) $P = (0.9, 0.1)$. (c) $P = (0.25, 0.25, 0.25, 0.25)$. Which has the highest entropy? Explain why in one sentence. Easy

Let $P = (0.5, 0.3, 0.2)$ and $Q = (0.4, 0.4, 0.2)$. Compute $D_{KL}(P \| Q)$ and $D_{KL}(Q \| P)$. Verify that they are not equal and state which outcome drives the larger difference. Easy

You observe 5 NIFTY daily log-returns: $\{0.4\%, -0.2\%, 0.8\%, -0.5\%, 0.1\%\}$. Assuming $P_\theta(v) = \mathcal{N}(\mu, \sigma^2)$, compute the MLE estimates $\hat{\mu}$ and $\hat{\sigma}^2$. Then write out the log-likelihood value at the MLE. Easy

Show from first principles that $H(P) \leq \log k$ for any distribution over $k$ outcomes, with equality when $P$ is uniform. (Hint: use the fact that $D_{KL}(P \| U) \geq 0$ where $U$ is the uniform distribution, and expand the KL divergence.) Medium

Explain the following claim: "Minimising cross-entropy $H(P, Q)$ over $Q$ is equivalent to minimising KL divergence $D_{KL}(P \| Q)$ over $Q$." Is this always true? State the condition under which the two objectives are identical. Medium

A classification model outputs probabilities $P_\theta(y|x)$ for $y \in \{0,1\}$. Show that minimising the cross-entropy loss $-\frac{1}{N}\sum_i [y_i \log P_\theta(1|x_i) + (1-y_i)\log P_\theta(0|x_i)]$ is equivalent to MLE for the Bernoulli model $P_\theta(y|x) = P_\theta(1|x)^y (1-P_\theta(1|x))^{1-y}$. This is logistic regression in disguise. Medium

Suppose you fit a Laplace distribution $P_\theta(v) = \frac{1}{2b}\exp(-|v-\mu|/b)$ to NIFTY returns using MLE instead of a Gaussian. What loss function does this correspond to in ERM? Why might this be more appropriate for financial returns than squared error? What assumption about the noise structure does it imply? Medium

The derivation of KL $\to$ MLE used the LLN to approximate $\mathbb{E}_{P_v}[\log P_\theta(v)]$ by $\frac{1}{N}\sum_i \log P_\theta(v_i)$. This approximation requires i.i.d. samples. Describe two specific scenarios in financial data where this i.i.d. assumption fails, and explain qualitatively how MLE is affected when it does. Hard

The KL divergence $D_{KL}(P \| Q)$ is not a metric — it violates the symmetry axiom and the triangle inequality. Propose a symmetrised version of KL divergence and show it satisfies symmetry. Then explain why the original asymmetric KL is used in ML despite not being a metric, while your symmetrised version is not used for MLE. Hard

Consider the minimum KL divergence estimator $\theta^* = \arg\min_\theta D_{KL}(P_{\text{true}} \| P_\theta)$ when the true distribution $P_{\text{true}}$ is not in the parametric family $\{P_\theta\}$ — i.e., there is no $\theta$ such that $P_\theta = P_{\text{true}}$. What does $\theta^*$ converge to in this case? Is the resulting estimator still useful? This is the concept of model misspecification — discuss using a concrete example from financial modelling. Hard

← The Learning Machine Next: Maximum Likelihood — Let the Data Speak →