/
Machine Learning โ€บ Mathematical Foundations โ€บ Maximum Likelihood
Progress
4 / 8
โ† All Chapters
Foundational Chapter 04 of 08 ยท Mathematical Foundations of ML

Maximum Likelihood โ€”
Let the Data Speak

The Bayes classifier is the theoretical ceiling. But implementing it demands something we don't have: the actual densities $P_X$, $P_Y$, and $P_{X|Y}$. This chapter is about estimating those densities from data โ€” and showing that the principled way to do it always leads to the same place.

Why We Need Density Estimation

Chapter 2 gave us the Bayes classifier โ€” the theoretically optimal decision rule under zero-one loss: $$h_B(x) = \underset{y}{\arg\max}\; P(y \mid x)$$ Beautiful in theory. To use it in practice, we need $P(y|x)$ โ€” the posterior probability of each class given the input. And we don't have it. We have data.

To implement the Bayes classifier from data $D = \{(x_i, y_i)\}$, we need to estimate three quantities:

$$P_X \;\text{โ€” marginal density of inputs} \qquad P_Y \;\text{โ€” class priors} \qquad P_{X|Y} \;\text{โ€” class-conditional densities}$$

From these, Bayes' theorem gives us $P(y|x) \propto P_{X|Y}(x|y) \cdot P_Y(y)$. All three quantities are density estimation problems. Maximum Likelihood Estimation is the principled tool for all of them โ€” and this chapter works out what MLE actually gives us for the most important model families.

This exact pipeline is behind fraud detection at every major Indian payment gateway โ€” PhonePe, Paytm, HDFC, ICICI. Given a transaction $x$ (amount, merchant category, location, time, device), estimate $P_{X|Y=\text{fraud}}$ and $P_{X|Y=\text{legit}}$ from historical labelled data using MLE, then use the Bayes classifier to flag suspicious transactions. The Naive Bayes classifier โ€” which assumes features are conditionally independent given the class โ€” makes this tractable at scale. Millions of transactions are classified per second using density estimates fit by MLE on historical fraud data.

MLE: The Formal Setup

Chapter 3 established the key equivalence. We restate it cleanly as our working definition for this chapter.

Given $D = \{v_1, \ldots, v_N\}$ i.i.d. from unknown $P_v$, and a parametric model family $\{P_\theta : \theta \in \Theta\}$, the Maximum Likelihood Estimator is: $$\theta^* = \underset{\theta}{\arg\max}\; \frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i) = \underset{\theta}{\arg\max}\; \ell(\theta)$$ where $\ell(\theta)$ is the average log-likelihood. Equivalently, $\theta^*$ minimises $D_{KL}(P_v \| P_\theta)$ โ€” it is the closest model in the parametric family to the true distribution, measured by KL divergence.

Three model families dominate ML in practice. We work out $\theta^*$ for each. The derivations are different in technique but identical in spirit: write $\ell(\theta)$, differentiate, set to zero, solve.

MLE for a Gaussian Model

The simplest continuous case. Model: $P_\theta(v) = \mathcal{N}(v;\, \theta,\, I)$ where $v \in \mathbb{R}^d$ and $\theta \in \mathbb{R}^d$ is the mean. Covariance is fixed at identity for now.

Derivation ยท MLE for Gaussian Model Closed Form ยท Sample Mean
Step 1 โ€” Write the log-likelihood

The Gaussian density is $P_\theta(v) = \frac{1}{(2\pi)^{d/2}}\exp\!\left(-\frac{1}{2}(v-\theta)^\top(v-\theta)\right)$. Taking the log:

$$\ell(\theta) = \frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i) = -\frac{d}{2}\log(2\pi) - \frac{1}{N}\cdot\frac{1}{2}\sum_{i=1}^N \|v_i - \theta\|_2^2$$

Maximising $\ell(\theta)$ over $\theta$ is equivalent to minimising $\frac{1}{N}\sum_i \|v_i - \theta\|_2^2$ โ€” the mean squared distance from $\theta$ to the data. The constant $-\frac{d}{2}\log(2\pi)$ drops out.

Maximising log-likelihood = minimising sum of squared distances from the mean. MLE and least squares are the same problem here.
Step 2 โ€” Differentiate and set to zero
$$\frac{\partial \ell}{\partial \theta} = \frac{1}{N}\sum_{i=1}^N \frac{\partial}{\partial \theta}\left[-\frac{1}{2}(v_i - \theta)^\top(v_i - \theta)\right] = \frac{1}{N}\sum_{i=1}^N (v_i - \theta) = 0$$
Step 3 โ€” Solve
$$\sum_{i=1}^N v_i - N\theta = 0 \implies \boxed{\theta^* = \frac{1}{N}\sum_{i=1}^N v_i}$$
The MLE for the mean of a Gaussian is the sample mean. Not surprising intuitively โ€” but now we know exactly why: the sample mean is the parameter that maximises the probability of the observed data under the Gaussian model, and equivalently, the closest Gaussian (in KL divergence) to whatever distribution actually generated the data.
๐Ÿฅ Hospital Patient Ages

Ages assumed $\mathcal{N}(\mu, \sigma^2)$. After observing 200 patients, MLE gives $\hat{\mu} = \bar{\text{age}}$ โ€” the average age in the dataset. Nothing more, nothing less. The sample mean is the principled answer.

๐Ÿ‡ฎ๐Ÿ‡ณ NIFTY Log-Returns

Model $v_i \sim \mathcal{N}(\mu, \sigma^2)$. MLE: $\hat{\mu} = \frac{1}{N}\sum v_i$ (sample mean return), $\hat{\sigma}^2 = \frac{1}{N}\sum(v_i - \hat{\mu})^2$ (sample variance). Every quant who fits a normal distribution to returns is doing MLE, whether they use that language or not.

The MLE gives the best Gaussian fit to the data โ€” best within the Gaussian family, measured by KL divergence. If the true distribution is fat-tailed (as NIFTY daily returns demonstrably are), the MLE Gaussian is still wrong. It is the least wrong Gaussian, not the right model. The sample mean and variance are correct estimates of the Gaussian parameters โ€” the question is whether a Gaussian was the right model to begin with. Model misspecification is always the first thing to check.

MLE for a Discrete Distribution

When $v_i$ takes values in a finite set $\{a_1, a_2, \ldots, a_m\}$ with unknown probabilities $\{p_1, p_2, \ldots, p_m\}$, the MLE problem is constrained: we need $p_j \geq 0$ and $\sum_j p_j = 1$.

To write the likelihood cleanly, we use the one-hot encoding trick. For each observation $v_i$, define an auxiliary vector $z_i \in \{0,1\}^m$ where $z_i^j = 1$ if $v_i = a_j$ and 0 otherwise. Exactly one entry of $z_i$ is 1 โ€” the one corresponding to the observed outcome. Then:

$$P_\theta(v_i) = \prod_{j=1}^m p_j^{z_i^j} \qquad \log P_\theta(v_i) = \sum_{j=1}^m z_i^j \log p_j$$

The log-likelihood becomes:

$$\ell(\theta) = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^m z_i^j \log p_j$$
Derivation ยท MLE for Discrete Distribution via Lagrangian Constrained Optimisation ยท Empirical Frequency
Step 1 โ€” Set up the constrained problem

Maximise $\ell(\theta) = \frac{1}{N}\sum_i \sum_j z_i^j \log p_j$ subject to $\sum_j p_j = 1$ and $p_j \geq 0$. The normalisation constraint is binding โ€” without it the optimal solution would push all mass to the most frequent outcome and set others to zero (or worse, to infinity via the $\log$). We enforce it with a Lagrange multiplier $\lambda$.

Step 2 โ€” Write the Lagrangian
$$\mathcal{L}(p_1,\ldots,p_m,\lambda) = \frac{1}{N}\sum_{i=1}^N\sum_{j=1}^m z_i^j \log p_j + \lambda\!\left(\sum_{j=1}^m p_j - 1\right)$$
Step 3 โ€” Differentiate w.r.t. $p_j$ and set to zero
$$\frac{\partial \mathcal{L}}{\partial p_j} = \frac{1}{N}\cdot\frac{\sum_{i=1}^N z_i^j}{p_j} + \lambda = 0 \implies p_j = -\frac{\sum_i z_i^j}{N\lambda}$$
Step 4 โ€” Use the normalisation constraint to find $\lambda$

Summing over $j$: $\sum_j p_j = -\frac{1}{N\lambda}\sum_j \sum_i z_i^j = -\frac{1}{N\lambda}\cdot N = -\frac{1}{\lambda} = 1$, so $\lambda = -1$.

Step 5 โ€” Substitute back
$$\boxed{p_j^* = \frac{\sum_{i=1}^N z_i^j}{N} = \frac{\text{number of times } a_j \text{ occurred in } D}{N}}$$
The MLE for a discrete distribution is the empirical frequency. The fraction of times each outcome occurred in the data. The Lagrangian does the real work: it enforces probability normalisation and automatically distributes the mass proportionally to observed counts. Without the constraint, this result would not hold.
๐ŸŽฒ Biased Die

Observed outcomes from 8 rolls: $\{2, 3, 4, 1, 6, 3, 2, 3\}$. MLE: $\hat{p}_3 = 3/8$, $\hat{p}_2 = 2/8$, $\hat{p}_1 = \hat{p}_4 = \hat{p}_6 = 1/8$, $\hat{p}_5 = 0$. The die appears biased toward 3. (With only 8 rolls, this estimate has high variance โ€” but it is the principled MLE.)

๐Ÿ‡ฎ๐Ÿ‡ณ NIFTY Expiry Outcomes

Categorise 40 weekly expiries: Large Up 8, Small Up 14, Flat 6, Small Down 7, Large Down 5. MLE: $\hat{p}_{\text{LU}} = 0.20$, $\hat{p}_{\text{SU}} = 0.35$, $\hat{p}_{\text{F}} = 0.15$, $\hat{p}_{\text{SD}} = 0.175$, $\hat{p}_{\text{LD}} = 0.125$. This empirical distribution is the real-world measure โ€” what Bayesian options pricing would use as $\mathbb{P}$ instead of the risk-neutral $\mathbb{Q}$.

The Lagrangian is the standard technique for constrained optimisation: to maximise $f(x)$ subject to $g(x) = 0$, form $\mathcal{L}(x, \lambda) = f(x) + \lambda g(x)$ and solve $\nabla_x \mathcal{L} = 0$ and $\nabla_\lambda \mathcal{L} = 0$ simultaneously. The multiplier $\lambda$ enforces the constraint at the optimum. Here the constraint is probability normalisation โ€” $\sum_j p_j = 1$ โ€” and the Lagrangian automatically distributes the total probability mass according to observed frequencies. This same technique appears in SVMs, regularised regression, and portfolio optimisation. It is worth understanding once and recognising everywhere.

The Multimodal Problem: When One Gaussian Is Not Enough

Both MLE derivations above assumed a single parametric family โ€” one Gaussian, one discrete distribution. Real data is often more complex. Many datasets are multimodal โ€” they have multiple clusters, subpopulations, or regimes. A single Gaussian, however well fit by MLE, cannot capture this structure.

๐Ÿ”ด The Single-Gaussian Failure

Fit a single Gaussian to a bimodal dataset. The MLE places the mean somewhere between the two modes โ€” a location where almost no actual data points live. The model assigns high probability to a region the data avoids, and underestimates probability in both actual clusters. $D_{KL}(P_{\text{true}} \| P_\theta)$ remains large no matter how many data points you use โ€” the Gaussian family simply cannot express the true shape.

This is not a data problem. It is a model family problem. The solution: expand $\mathcal{H}$.

๐Ÿฅ Hospital Age Distribution

Patients split between young adults (viral infections, mean age 28) and elderly (chronic conditions, mean age 68). Single Gaussian MLE: mean โ‰ˆ 48 โ€” describes nobody in the ward. The bimodal truth requires two Gaussians.

๐Ÿ‡ฎ๐Ÿ‡ณ INDIA VIX Regimes

VIX spends most of its time in a calm regime (12โ€“16) and occasionally spikes into a fear regime (25โ€“45). A single Gaussian fit puts the mean around 16 with large variance โ€” missing both regimes. A two-component mixture captures the structure naturally: one narrow Gaussian for calm, one wide Gaussian for fear.

NIFTY daily returns are not unimodal. In trending bull markets, the return distribution is right-skewed with a positive mean. In volatile, mean-reverting regimes, it is approximately symmetric with much higher variance. A Gaussian Mixture Model with two or three components captures this regime structure far better than any single Gaussian. The mixing weights $\alpha_j$ become the estimated probabilities of being in each regime โ€” a natural volatility regime detector built directly from return data.

Mixture Density Models

The solution to the multimodal problem is to define the model as a weighted sum of simpler component densities.

A mixture density model with $M$ components is: $$P_\theta(v) = \sum_{j=1}^M \alpha_j\, P_{\theta_j}(v)$$ where $\alpha_j \in [0,1]$, $\sum_j \alpha_j = 1$, and each $P_{\theta_j}$ is a component density with its own parameters $\theta_j$. The full parameter set is $\theta = \{\alpha_j, \theta_j\}_{j=1}^M$. For a Gaussian Mixture Model (GMM): each component is $P_{\theta_j}(v) = \mathcal{N}(v;\, \mu_j,\, \Sigma_j)$, so $\theta = \{\alpha_j, \mu_j, \Sigma_j\}_{j=1}^M$.

The mixing weight $\alpha_j$ is the prior probability that a randomly drawn data point came from component $j$. If you think of the data as generated by first choosing a component (a "type" of data point โ€” a regime, a subpopulation) with probability $\alpha_j$, then drawing from that component's distribution, you recover the mixture density. This generative story introduces a hidden variable: which component each data point came from.

GMMs are universal density approximators for continuous random variables. With enough components, a Gaussian mixture can approximate any smooth continuous density to arbitrary precision. This is their power: they can model any shape โ€” unimodal, bimodal, skewed, fat-tailed โ€” given enough components and enough data. Neural networks have a similar universality property for function approximation; GMMs have it for density estimation. The trade-off is computational: fitting a GMM requires solving a much harder optimisation problem than fitting a single Gaussian.

Why Mixture MLE Is Hard

Take the GMM log-likelihood and try to maximise it directly:

$$\ell(\theta) = \frac{1}{N}\sum_{i=1}^N \log P_\theta(v_i) = \frac{1}{N}\sum_{i=1}^N \log \sum_{j=1}^M \alpha_j\, \mathcal{N}(v_i;\, \mu_j,\, \Sigma_j)$$

The problem is the log of a sum. For a single Gaussian, the log collapsed the product into a tractable quadratic. Here the sum inside the log couples all $M$ components together. Setting $\frac{\partial \ell}{\partial \mu_j} = 0$ gives:

$$\frac{\partial \ell}{\partial \mu_j} = \frac{1}{N}\sum_{i=1}^N \frac{\alpha_j\, \mathcal{N}(v_i;\mu_j,\Sigma_j)}{\sum_{k} \alpha_k\, \mathcal{N}(v_i;\mu_k,\Sigma_k)} \cdot \Sigma_j^{-1}(v_i - \mu_j) = 0$$

This equation involves $\mu_j$ on both sides โ€” in the numerator directly, and in the denominator through the sum over all components. There is no closed-form solution. The MLE problem for a GMM has no analytical answer.

๐Ÿ”ด The Log-of-Sum Obstacle

Why it's fundamental, not numerical: If we knew which component each $v_i$ came from โ€” if there were hidden labels $z_i \in \{1, \ldots, M\}$ telling us "this point came from component $j$" โ€” the log-likelihood would factorise into $M$ independent Gaussian MLEs, each with a closed form. The difficulty is entirely caused by the missing labels.

Direct gradient ascent fails: The objective is non-convex in $\theta$. Gradient ascent gets trapped in local optima. The updates have no clean form. A fundamentally different algorithmic idea is needed.

Worked Example 1 ยท Gaussian MLE on NIFTY Returns ๐Ÿ‡ฎ๐Ÿ‡ณ Full Derivation

Five daily NIFTY log-returns observed: $v = \{+0.4\%,\; -0.3\%,\; +1.1\%,\; -0.2\%,\; +0.6\%\}$. Model: $P_\theta(v) = \mathcal{N}(v;\, \mu,\, \sigma^2)$. Find the MLE for both $\mu$ and $\sigma^2$.

Step 1 โ€” MLE for $\mu$

From the derivation above, $\hat{\mu} = \frac{1}{N}\sum v_i$:

$$\hat{\mu} = \frac{0.4 + (-0.3) + 1.1 + (-0.2) + 0.6}{5} = \frac{1.6}{5} = 0.32\%$$
Step 2 โ€” MLE for $\sigma^2$

Differentiating $\ell(\theta)$ w.r.t. $\sigma^2$ and setting to zero gives the sample variance (dividing by $N$, not $N-1$):

$$\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N (v_i - \hat{\mu})^2$$ $$= \frac{(0.4-0.32)^2 + (-0.3-0.32)^2 + (1.1-0.32)^2 + (-0.2-0.32)^2 + (0.6-0.32)^2}{5}$$ $$= \frac{0.0064 + 0.3844 + 0.6084 + 0.2704 + 0.0784}{5} = \frac{1.348}{5} = 0.2696\%^2$$

So $\hat{\sigma} \approx 0.52\%$ daily volatility.

Interpret the result
The fitted model: $\hat{P}(v) = \mathcal{N}(v;\; 0.32\%,\; 0.2696\%^2)$. This is the closest Gaussian to the true NIFTY return distribution, measured by KL divergence โ€” given just these 5 observations. With 5 data points, the estimates have enormous uncertainty. With 250 trading days (one year), the estimates would be far more reliable. The MLE is consistent: as $N \to \infty$, $\hat{\mu} \to \mu_{\text{true}}$ and $\hat{\sigma}^2 \to \sigma^2_{\text{true}}$ โ€” but only if the true distribution is actually Gaussian. If it isn't, the MLE converges to the best Gaussian approximation, not to the truth.
Worked Example 2 ยท Discrete MLE on NIFTY Weekly Expiry Outcomes ๐Ÿ‡ฎ๐Ÿ‡ณ Empirical Distribution vs Black-Scholes

Over 40 weekly NIFTY expiries, outcomes are categorised as follows. Compute the MLE (empirical distribution) and compare to what a Black-Scholes Gaussian model would imply.

Observed data
$$\begin{array}{lcc} \text{Outcome} & \text{Count} & \hat{p}_j^{\text{MLE}} \\ \hline \text{Large Up } (>+1\%) & 8 & 8/40 = 0.200 \\ \text{Small Up } (0\%\text{ to }+1\%) & 14 & 14/40 = 0.350 \\ \text{Flat } (-0.5\%\text{ to }0\%) & 6 & 6/40 = 0.150 \\ \text{Small Down } (-1\%\text{ to }-0.5\%) & 7 & 7/40 = 0.175 \\ \text{Large Down } (<-1\%) & 5 & 5/40 = 0.125 \\ \hline \text{Total} & 40 & 1.000 \end{array}$$
What does this tell us?

The empirical distribution shows positive skew โ€” large up moves (20%) are more common than large down moves (12.5%) in this sample. The Gaussian model used by Black-Scholes with the same $\hat{\mu}$ and $\hat{\sigma}$ would assign roughly equal probability to symmetric tails. The discrepancy โ€” real data is skewed, Gaussian is symmetric โ€” is one source of the volatility skew in NIFTY option markets: OTM puts are overpriced relative to Black-Scholes because the market knows large down moves are genuinely less probable than they would be under a symmetric Gaussian.

The key insight: the MLE empirical distribution is the real-world measure $\mathbb{P}$. Options are priced under the risk-neutral measure $\mathbb{Q}$, which differs from $\mathbb{P}$ by the market price of risk. The gap between $\hat{\mathbb{P}}$ (what the data says) and $\mathbb{Q}$ (what the options market prices) is where systematic trading strategies live. MLE on historical data gives you $\hat{\mathbb{P}}$. The market gives you $\mathbb{Q}$. Compare the two.

We now have MLE in hand for the two fundamental model families. But the GMM โ€” the model we need for multimodal data โ€” has no closed-form MLE. The obstacle is the log-of-sum: without knowing which component each data point came from, the optimisation couples all components together in an intractable way.

Chapter 5 Preview

What if we introduced the hidden component assignments $z_i$ explicitly as a latent variable, derived a lower bound on $\ell(\theta)$ that is actually tractable (the ELBO), and then alternated between inferring the latent variables and updating the parameters? This is the Expectation-Maximisation algorithm โ€” one of the most elegant ideas in all of statistical ML. The machinery to understand it โ€” Jensen's inequality, the ELBO, and the geometry of latent variable models โ€” is all in Chapter 5.


Practice Problems
4 questions ยท Chapter 04
ml / mathematical-foundations / ch04 / q01 โ˜… Conceptual

A quant fits a Gaussian to NIFTY daily log-returns and reports $\hat{\mu} = 0.04\%$ and $\hat{\sigma} = 1.2\%$. A colleague says "that's just the sample mean and variance โ€” you haven't done any real estimation." Which response is most accurate?

A
The colleague is right โ€” sample statistics are not proper statistical estimators.
B
The quant is right, but the Gaussian assumption needs to be stated explicitly to be valid.
C
The colleague is technically correct but misses the deeper point: the sample mean and variance are the MLE for a Gaussian model, derived from minimising KL divergence to the true distribution.
D
Neither is right โ€” the MLE for a Gaussian uses median and MAD, not mean and variance.
Answer: C.

The colleague is not wrong โ€” $\hat{\mu}$ and $\hat{\sigma}$ are indeed the sample mean and variance. But calling them "just" sample statistics misses the principled reason they are the right quantities: they are the MLE for the Gaussian model, derived by maximising the log-likelihood (equivalently, minimising KL divergence). The sample mean is the unique solution to $\frac{\partial \ell}{\partial \mu} = 0$; the sample variance solves $\frac{\partial \ell}{\partial \sigma^2} = 0$.

Option D is wrong: MLE for a Gaussian uses mean and variance. Median and MAD minimise different objectives (absolute error and a robust variant) and correspond to MLE under different model families (Laplace and Cauchy respectively).
ml / mathematical-foundations / ch04 / q02 โ˜… Computational

Six coin flips are observed: $\{H, H, T, H, T, H\}$. Model: $P_\theta(H) = p$, $P_\theta(T) = 1-p$. What is the MLE for $p$, and why is the Lagrangian needed?

A
$\hat{p} = 0.5$ โ€” the MLE always defaults to the uniform distribution.
B
$\hat{p} = 4/6 = 2/3$ โ€” the empirical frequency of heads. The Lagrangian enforces $p + (1-p) = 1$, which gives $\lambda = -1$ and yields the frequency formula directly.
C
$\hat{p} = 4/6 = 2/3$, but no Lagrangian is needed since this is a single-parameter problem.
D
$\hat{p} = 1$ โ€” the MLE assigns all probability to the most common outcome.
Answer: B.

With 4 heads and 2 tails, the log-likelihood is $\ell(p) = \frac{1}{6}(4\log p + 2\log(1-p))$. For a two-outcome case, we have the implicit constraint $p + (1-p) = 1$, which is automatically satisfied by parameterising with a single $p$. But for $m > 2$ outcomes, the Lagrangian is essential to enforce $\sum_j p_j = 1$.

Option C is partially right about the answer but wrong about the Lagrangian: even for the two-outcome case, the Lagrangian framework gives the same answer and generalises naturally to $m$ outcomes. Option D is wrong: unconstrained maximisation of $\log p$ alone (without the normalisation constraint enforced by the Lagrangian or the $1-p$ term) would indeed push $p \to 1$, but the constraint prevents this.
ml / mathematical-foundations / ch04 / q03 โ˜…โ˜… Conceptual

Why does the log-likelihood for a Gaussian Mixture Model have no closed-form maximiser, while the log-likelihood for a single Gaussian does? What is the structural difference?

A
GMMs have more parameters, making the gradient computation too complex to solve analytically.
B
The Gaussian log-likelihood is concave everywhere, while the GMM log-likelihood is convex, making the GMM harder to maximise.
C
For a single Gaussian, the log converts the product into a tractable sum. For a GMM, a sum appears inside the log, which cannot be simplified โ€” the gradient equations involve the parameters on both sides and have no analytical solution.
D
GMMs require computing matrix inverses, which have no closed form in high dimensions.
Answer: C.

The key structural difference is the log-of-sum. For a single Gaussian: $\log P_\theta(v_i) = \log \mathcal{N}(v_i;\mu,\Sigma) = \text{const} - \frac{1}{2}(v_i-\mu)^\top \Sigma^{-1}(v_i-\mu)$. The log converts the exponential into a quadratic โ€” tractable, with a closed-form gradient zero. For a GMM: $\log P_\theta(v_i) = \log \sum_j \alpha_j \mathcal{N}(v_i;\mu_j,\Sigma_j)$. The sum is inside the log. The gradient w.r.t. $\mu_j$ involves the ratio $\frac{\alpha_j \mathcal{N}(v_i;\mu_j,\Sigma_j)}{\sum_k \alpha_k \mathcal{N}(v_i;\mu_k,\Sigma_k)}$, which depends on all other component parameters simultaneously. Setting this to zero gives an equation where $\mu_j$ appears on both sides โ€” no closed form.

Option B has it backwards: the Gaussian log-likelihood is concave (which makes it easy to maximise). The GMM log-likelihood is non-concave (which makes it hard). Option A is wrong: the difficulty is structural, not a matter of computational complexity.
ml / mathematical-foundations / ch04 / q04 โ˜…โ˜… Conceptual

A NIFTY return model fit as a single Gaussian consistently underprices deep OTM puts. Which property of the true return distribution is the Gaussian failing to capture, and what model would address it?

A
The Gaussian overestimates mean return. A model with lower $\mu$ would fix the put pricing.
B
The Gaussian underestimates variance. Increasing $\hat{\sigma}$ in the same Gaussian model would fix it.
C
The Gaussian has thin tails and cannot model the excess kurtosis and negative skew of true NIFTY returns. A GMM with a low-volatility component and a high-volatility crash component would assign more probability to large negative moves.
D
The Gaussian is symmetric, so OTM puts and calls would both be mispriced equally. The problem must be in the interest rate assumption, not the return model.
Answer: C.

Deep OTM puts are underpriced by a Gaussian model because the Gaussian assigns too little probability to large negative returns โ€” its tails decay as $e^{-x^2/2}$, much faster than empirically observed. NIFTY returns exhibit: (1) negative skew โ€” large down moves are more frequent than large up moves of the same magnitude; (2) excess kurtosis โ€” extreme moves are more common than the Gaussian predicts. Both properties require a fatter-tailed model.

A two-component GMM โ€” one narrow Gaussian for normal trading days and one wide Gaussian for volatile/crash days โ€” captures both features. The crash component's wide variance assigns meaningful probability to the $-3\%$ to $-5\%$ daily moves that the single Gaussian treats as near-impossible. This is exactly why the volatility smile exists: the market prices options using an implied distribution that is fatter-tailed and left-skewed relative to any Gaussian.

Option B is wrong: simply increasing $\sigma$ in the same Gaussian would overprice ATM options while still underfitting the tail shape. The problem is distributional form, not parameter value.

Terminal Questions โ€” Chapter 04 10 problems ยท No answers given

Q1โ€“3 are direct computations. Q4โ€“7 require connecting MLE to the broader framework. Q8โ€“10 ask you to reason about model misspecification, consistency, and the limits of MLE.

1
Compute the MLE for $\mu$ and $\sigma^2$ given NIFTY weekly returns: $\{+2.1\%, -0.8\%, +1.4\%, -1.2\%, +0.3\%, +1.9\%, -0.5\%\}$. State the fitted Gaussian model and interpret $\hat{\sigma}$ in plain English. Easy
2
In 20 IPL matches, team A won 13 and lost 7. Treating wins as i.i.d. Bernoulli($p$), compute the MLE for $p$. Then write down the log-likelihood $\ell(p)$, differentiate it, and verify your answer emerges from setting $\frac{d\ell}{dp} = 0$. Easy
3
A discrete RV takes values $\{1, 2, 3, 4, 5, 6\}$. Dataset of 12 observations: $\{3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 2\}$. (a) Write the one-hot encoding for observation $v_1 = 3$. (b) Compute the MLE probabilities for all six outcomes. (c) Verify $\sum_j \hat{p}_j = 1$. Easy
4
Derive the MLE for $\sigma^2$ in the Gaussian model $\mathcal{N}(\mu, \sigma^2 I)$ when both $\mu$ and $\sigma^2$ are unknown. Your derivation should differentiate $\ell(\theta)$ w.r.t. $\sigma^2$ and solve. Note that the MLE divides by $N$, not $N-1$ โ€” why? What is the consequence for small samples? Medium
5
Show that the MLE for an exponential distribution $P_\theta(v) = \theta e^{-\theta v}$, $v \geq 0$ is $\hat{\theta} = 1/\bar{v}$ where $\bar{v}$ is the sample mean. Interpret this: if average inter-trade time on NSE is 0.5 seconds, what does MLE say about the rate parameter of the exponential model? Medium
6
You observe 5 NIFTY returns. Without computing anything, explain why a GMM with $M = 5$ components can achieve $\ell(\theta) \to \infty$ (i.e., infinite log-likelihood). Why is this a problem, and what does it say about the MLE for GMMs when the number of components is not constrained? Medium
7
Explain why implementing the Bayes classifier in practice requires density estimation. Given dataset $D = \{(x_i, y_i)\}$ with $y_i \in \{0, 1\}$, write down exactly which density estimation problems need to be solved, and state the MLE estimator for each under a Gaussian class-conditional model $P_{X|Y=k} = \mathcal{N}(\mu_k, \Sigma_k)$. Medium
8
The MLE is consistent: $\hat{\theta}^* \xrightarrow{p} \theta^*$ as $N \to \infty$ when the true distribution $P_v$ is in the model family $\{P_\theta\}$. What does $\hat{\theta}^*$ converge to when $P_v$ is not in the family โ€” i.e., under model misspecification? State your answer precisely, and illustrate with the case of fitting a Gaussian to fat-tailed NIFTY returns. Hard
9
A Gaussian Mixture Model with $M = 2$ components is proposed for NIFTY daily returns: a "calm" regime $\mathcal{N}(\mu_1, \sigma_1^2)$ and a "volatile" regime $\mathcal{N}(\mu_2, \sigma_2^2)$ with mixing weights $(\alpha, 1-\alpha)$. Propose plausible values for all five parameters $\{\mu_1, \sigma_1, \mu_2, \sigma_2, \alpha\}$ based on your knowledge of NIFTY market behaviour. Justify each choice. Then write out the full log-likelihood $\ell(\theta)$ for $N$ observations and explain why the log-of-sum prevents a closed-form MLE. Hard
10
The MLE for a discrete distribution over $m$ outcomes can assign $\hat{p}_j = 0$ to outcomes that never appeared in the training data. In the context of a NIFTY regime classifier trained on 2 years of data, explain why $\hat{p}_j = 0$ for unseen regimes is catastrophically wrong in live trading โ€” and describe two principled solutions (hint: Laplace smoothing and Bayesian priors both address this). Which solution does Ch5's MAP framework formalise? Hard