Maximum Likelihood โ
Let the Data Speak
The Bayes classifier is the theoretical ceiling. But implementing it demands something we don't have: the actual densities $P_X$, $P_Y$, and $P_{X|Y}$. This chapter is about estimating those densities from data โ and showing that the principled way to do it always leads to the same place.
Why We Need Density Estimation
Chapter 2 gave us the Bayes classifier โ the theoretically optimal decision rule under zero-one loss: $$h_B(x) = \underset{y}{\arg\max}\; P(y \mid x)$$ Beautiful in theory. To use it in practice, we need $P(y|x)$ โ the posterior probability of each class given the input. And we don't have it. We have data.
To implement the Bayes classifier from data $D = \{(x_i, y_i)\}$, we need to estimate three quantities:
From these, Bayes' theorem gives us $P(y|x) \propto P_{X|Y}(x|y) \cdot P_Y(y)$. All three quantities are density estimation problems. Maximum Likelihood Estimation is the principled tool for all of them โ and this chapter works out what MLE actually gives us for the most important model families.
MLE: The Formal Setup
Chapter 3 established the key equivalence. We restate it cleanly as our working definition for this chapter.
Three model families dominate ML in practice. We work out $\theta^*$ for each. The derivations are different in technique but identical in spirit: write $\ell(\theta)$, differentiate, set to zero, solve.
MLE for a Gaussian Model
The simplest continuous case. Model: $P_\theta(v) = \mathcal{N}(v;\, \theta,\, I)$ where $v \in \mathbb{R}^d$ and $\theta \in \mathbb{R}^d$ is the mean. Covariance is fixed at identity for now.
The Gaussian density is $P_\theta(v) = \frac{1}{(2\pi)^{d/2}}\exp\!\left(-\frac{1}{2}(v-\theta)^\top(v-\theta)\right)$. Taking the log:
Maximising $\ell(\theta)$ over $\theta$ is equivalent to minimising $\frac{1}{N}\sum_i \|v_i - \theta\|_2^2$ โ the mean squared distance from $\theta$ to the data. The constant $-\frac{d}{2}\log(2\pi)$ drops out.
Ages assumed $\mathcal{N}(\mu, \sigma^2)$. After observing 200 patients, MLE gives $\hat{\mu} = \bar{\text{age}}$ โ the average age in the dataset. Nothing more, nothing less. The sample mean is the principled answer.
Model $v_i \sim \mathcal{N}(\mu, \sigma^2)$. MLE: $\hat{\mu} = \frac{1}{N}\sum v_i$ (sample mean return), $\hat{\sigma}^2 = \frac{1}{N}\sum(v_i - \hat{\mu})^2$ (sample variance). Every quant who fits a normal distribution to returns is doing MLE, whether they use that language or not.
MLE for a Discrete Distribution
When $v_i$ takes values in a finite set $\{a_1, a_2, \ldots, a_m\}$ with unknown probabilities $\{p_1, p_2, \ldots, p_m\}$, the MLE problem is constrained: we need $p_j \geq 0$ and $\sum_j p_j = 1$.
To write the likelihood cleanly, we use the one-hot encoding trick. For each observation $v_i$, define an auxiliary vector $z_i \in \{0,1\}^m$ where $z_i^j = 1$ if $v_i = a_j$ and 0 otherwise. Exactly one entry of $z_i$ is 1 โ the one corresponding to the observed outcome. Then:
The log-likelihood becomes:
Maximise $\ell(\theta) = \frac{1}{N}\sum_i \sum_j z_i^j \log p_j$ subject to $\sum_j p_j = 1$ and $p_j \geq 0$. The normalisation constraint is binding โ without it the optimal solution would push all mass to the most frequent outcome and set others to zero (or worse, to infinity via the $\log$). We enforce it with a Lagrange multiplier $\lambda$.
Summing over $j$: $\sum_j p_j = -\frac{1}{N\lambda}\sum_j \sum_i z_i^j = -\frac{1}{N\lambda}\cdot N = -\frac{1}{\lambda} = 1$, so $\lambda = -1$.
Observed outcomes from 8 rolls: $\{2, 3, 4, 1, 6, 3, 2, 3\}$. MLE: $\hat{p}_3 = 3/8$, $\hat{p}_2 = 2/8$, $\hat{p}_1 = \hat{p}_4 = \hat{p}_6 = 1/8$, $\hat{p}_5 = 0$. The die appears biased toward 3. (With only 8 rolls, this estimate has high variance โ but it is the principled MLE.)
Categorise 40 weekly expiries: Large Up 8, Small Up 14, Flat 6, Small Down 7, Large Down 5. MLE: $\hat{p}_{\text{LU}} = 0.20$, $\hat{p}_{\text{SU}} = 0.35$, $\hat{p}_{\text{F}} = 0.15$, $\hat{p}_{\text{SD}} = 0.175$, $\hat{p}_{\text{LD}} = 0.125$. This empirical distribution is the real-world measure โ what Bayesian options pricing would use as $\mathbb{P}$ instead of the risk-neutral $\mathbb{Q}$.
The Multimodal Problem: When One Gaussian Is Not Enough
Both MLE derivations above assumed a single parametric family โ one Gaussian, one discrete distribution. Real data is often more complex. Many datasets are multimodal โ they have multiple clusters, subpopulations, or regimes. A single Gaussian, however well fit by MLE, cannot capture this structure.
Fit a single Gaussian to a bimodal dataset. The MLE places the mean somewhere between the two modes โ a location where almost no actual data points live. The model assigns high probability to a region the data avoids, and underestimates probability in both actual clusters. $D_{KL}(P_{\text{true}} \| P_\theta)$ remains large no matter how many data points you use โ the Gaussian family simply cannot express the true shape.
This is not a data problem. It is a model family problem. The solution: expand $\mathcal{H}$.
Patients split between young adults (viral infections, mean age 28) and elderly (chronic conditions, mean age 68). Single Gaussian MLE: mean โ 48 โ describes nobody in the ward. The bimodal truth requires two Gaussians.
VIX spends most of its time in a calm regime (12โ16) and occasionally spikes into a fear regime (25โ45). A single Gaussian fit puts the mean around 16 with large variance โ missing both regimes. A two-component mixture captures the structure naturally: one narrow Gaussian for calm, one wide Gaussian for fear.
Mixture Density Models
The solution to the multimodal problem is to define the model as a weighted sum of simpler component densities.
The mixing weight $\alpha_j$ is the prior probability that a randomly drawn data point came from component $j$. If you think of the data as generated by first choosing a component (a "type" of data point โ a regime, a subpopulation) with probability $\alpha_j$, then drawing from that component's distribution, you recover the mixture density. This generative story introduces a hidden variable: which component each data point came from.
Why Mixture MLE Is Hard
Take the GMM log-likelihood and try to maximise it directly:
The problem is the log of a sum. For a single Gaussian, the log collapsed the product into a tractable quadratic. Here the sum inside the log couples all $M$ components together. Setting $\frac{\partial \ell}{\partial \mu_j} = 0$ gives:
This equation involves $\mu_j$ on both sides โ in the numerator directly, and in the denominator through the sum over all components. There is no closed-form solution. The MLE problem for a GMM has no analytical answer.
Why it's fundamental, not numerical: If we knew which component each $v_i$ came from โ if there were hidden labels $z_i \in \{1, \ldots, M\}$ telling us "this point came from component $j$" โ the log-likelihood would factorise into $M$ independent Gaussian MLEs, each with a closed form. The difficulty is entirely caused by the missing labels.
Direct gradient ascent fails: The objective is non-convex in $\theta$. Gradient ascent gets trapped in local optima. The updates have no clean form. A fundamentally different algorithmic idea is needed.
Five daily NIFTY log-returns observed: $v = \{+0.4\%,\; -0.3\%,\; +1.1\%,\; -0.2\%,\; +0.6\%\}$. Model: $P_\theta(v) = \mathcal{N}(v;\, \mu,\, \sigma^2)$. Find the MLE for both $\mu$ and $\sigma^2$.
From the derivation above, $\hat{\mu} = \frac{1}{N}\sum v_i$:
Differentiating $\ell(\theta)$ w.r.t. $\sigma^2$ and setting to zero gives the sample variance (dividing by $N$, not $N-1$):
So $\hat{\sigma} \approx 0.52\%$ daily volatility.
Over 40 weekly NIFTY expiries, outcomes are categorised as follows. Compute the MLE (empirical distribution) and compare to what a Black-Scholes Gaussian model would imply.
The empirical distribution shows positive skew โ large up moves (20%) are more common than large down moves (12.5%) in this sample. The Gaussian model used by Black-Scholes with the same $\hat{\mu}$ and $\hat{\sigma}$ would assign roughly equal probability to symmetric tails. The discrepancy โ real data is skewed, Gaussian is symmetric โ is one source of the volatility skew in NIFTY option markets: OTM puts are overpriced relative to Black-Scholes because the market knows large down moves are genuinely less probable than they would be under a symmetric Gaussian.
We now have MLE in hand for the two fundamental model families. But the GMM โ the model we need for multimodal data โ has no closed-form MLE. The obstacle is the log-of-sum: without knowing which component each data point came from, the optimisation couples all components together in an intractable way.
What if we introduced the hidden component assignments $z_i$ explicitly as a latent variable, derived a lower bound on $\ell(\theta)$ that is actually tractable (the ELBO), and then alternated between inferring the latent variables and updating the parameters? This is the Expectation-Maximisation algorithm โ one of the most elegant ideas in all of statistical ML. The machinery to understand it โ Jensen's inequality, the ELBO, and the geometry of latent variable models โ is all in Chapter 5.
A quant fits a Gaussian to NIFTY daily log-returns and reports $\hat{\mu} = 0.04\%$ and $\hat{\sigma} = 1.2\%$. A colleague says "that's just the sample mean and variance โ you haven't done any real estimation." Which response is most accurate?
The colleague is not wrong โ $\hat{\mu}$ and $\hat{\sigma}$ are indeed the sample mean and variance. But calling them "just" sample statistics misses the principled reason they are the right quantities: they are the MLE for the Gaussian model, derived by maximising the log-likelihood (equivalently, minimising KL divergence). The sample mean is the unique solution to $\frac{\partial \ell}{\partial \mu} = 0$; the sample variance solves $\frac{\partial \ell}{\partial \sigma^2} = 0$.
Option D is wrong: MLE for a Gaussian uses mean and variance. Median and MAD minimise different objectives (absolute error and a robust variant) and correspond to MLE under different model families (Laplace and Cauchy respectively).
Six coin flips are observed: $\{H, H, T, H, T, H\}$. Model: $P_\theta(H) = p$, $P_\theta(T) = 1-p$. What is the MLE for $p$, and why is the Lagrangian needed?
With 4 heads and 2 tails, the log-likelihood is $\ell(p) = \frac{1}{6}(4\log p + 2\log(1-p))$. For a two-outcome case, we have the implicit constraint $p + (1-p) = 1$, which is automatically satisfied by parameterising with a single $p$. But for $m > 2$ outcomes, the Lagrangian is essential to enforce $\sum_j p_j = 1$.
Option C is partially right about the answer but wrong about the Lagrangian: even for the two-outcome case, the Lagrangian framework gives the same answer and generalises naturally to $m$ outcomes. Option D is wrong: unconstrained maximisation of $\log p$ alone (without the normalisation constraint enforced by the Lagrangian or the $1-p$ term) would indeed push $p \to 1$, but the constraint prevents this.
Why does the log-likelihood for a Gaussian Mixture Model have no closed-form maximiser, while the log-likelihood for a single Gaussian does? What is the structural difference?
The key structural difference is the log-of-sum. For a single Gaussian: $\log P_\theta(v_i) = \log \mathcal{N}(v_i;\mu,\Sigma) = \text{const} - \frac{1}{2}(v_i-\mu)^\top \Sigma^{-1}(v_i-\mu)$. The log converts the exponential into a quadratic โ tractable, with a closed-form gradient zero. For a GMM: $\log P_\theta(v_i) = \log \sum_j \alpha_j \mathcal{N}(v_i;\mu_j,\Sigma_j)$. The sum is inside the log. The gradient w.r.t. $\mu_j$ involves the ratio $\frac{\alpha_j \mathcal{N}(v_i;\mu_j,\Sigma_j)}{\sum_k \alpha_k \mathcal{N}(v_i;\mu_k,\Sigma_k)}$, which depends on all other component parameters simultaneously. Setting this to zero gives an equation where $\mu_j$ appears on both sides โ no closed form.
Option B has it backwards: the Gaussian log-likelihood is concave (which makes it easy to maximise). The GMM log-likelihood is non-concave (which makes it hard). Option A is wrong: the difficulty is structural, not a matter of computational complexity.
A NIFTY return model fit as a single Gaussian consistently underprices deep OTM puts. Which property of the true return distribution is the Gaussian failing to capture, and what model would address it?
Deep OTM puts are underpriced by a Gaussian model because the Gaussian assigns too little probability to large negative returns โ its tails decay as $e^{-x^2/2}$, much faster than empirically observed. NIFTY returns exhibit: (1) negative skew โ large down moves are more frequent than large up moves of the same magnitude; (2) excess kurtosis โ extreme moves are more common than the Gaussian predicts. Both properties require a fatter-tailed model.
A two-component GMM โ one narrow Gaussian for normal trading days and one wide Gaussian for volatile/crash days โ captures both features. The crash component's wide variance assigns meaningful probability to the $-3\%$ to $-5\%$ daily moves that the single Gaussian treats as near-impossible. This is exactly why the volatility smile exists: the market prices options using an implied distribution that is fatter-tailed and left-skewed relative to any Gaussian.
Option B is wrong: simply increasing $\sigma$ in the same Gaussian would overprice ATM options while still underfitting the tail shape. The problem is distributional form, not parameter value.
Q1โ3 are direct computations. Q4โ7 require connecting MLE to the broader framework. Q8โ10 ask you to reason about model misspecification, consistency, and the limits of MLE.