Intermediate Chapter 05 of 08 · Mathematical Foundations of ML

Hidden Variables,
Prior Beliefs, and the EM Algorithm

Some structure in data is visible. Some is not. The EM algorithm is the principled answer to learning from data when half of what matters is hidden — and MAP estimation is the answer to learning when you have more prior knowledge than data.

The Problem Left Open

Chapter 4 ended with an obstacle. The GMM log-likelihood $\ell(\theta) = \frac{1}{N}\sum_i \log \sum_j \alpha_j \mathcal{N}(v_i; \mu_j, \Sigma_j)$ has no closed-form maximiser because the log sits outside a sum, coupling all components together. The root cause: we don't know which component generated each data point. That hidden information — the component assignment $z_i$ — is the latent variable.

This chapter builds the machinery to handle it, and introduces a second related problem: what happens when MLE gives catastrophically overconfident answers because $N$ is too small? Both problems share the same cure — bringing in information beyond what the data alone provides.

Latent Variable Models

A latent variable model posits that data is generated by a two-stage process. First, a hidden variable $Z$ is drawn from some distribution. Then, the observed data $x$ is drawn from a distribution conditioned on $Z$. The data $x$ is what we see. The variable $Z$ is never observed.

A latent variable model defines the marginal distribution of observed data by summing (or integrating) over all possible values of the hidden variable $Z$: $$P_\theta(x) = \sum_z P_\theta(z) \cdot P_\theta(x \mid z) \qquad \text{(discrete } Z\text{)}$$ $$P_\theta(x) = \int P_\theta(z) \cdot P_\theta(x \mid z)\, dz \qquad \text{(continuous } Z\text{)}$$ The full parameter set $\theta$ governs both the prior over $Z$ and the conditional distribution of $x$ given $Z$.

For a GMM specifically: $Z \in \{1, \ldots, M\}$ is the component label, $P_\theta(Z=j) = \alpha_j$ is the mixing weight, and $P_\theta(x \mid Z=j) = \mathcal{N}(x; \mu_j, \Sigma_j)$ is the component density. Marginalising over $Z$ recovers the mixture: $$P_\theta(x) = \sum_{j=1}^M \alpha_j\, \mathcal{N}(x; \mu_j, \Sigma_j)$$

🏥 Hospital Patients

$Z$ = disease subtype (never directly observed on admission). $x$ = symptom profile. The mixture over undiagnosed subtypes generates the observed symptom distribution. Knowing $Z$ would make diagnosis straightforward — the whole problem is that we don't.

🇮🇳 NIFTY Returns

$Z$ = market regime: calm trending, normal, high-volatility crash. $x$ = daily log-return. The regime is never announced — you infer it from returns. A 3-component GMM models the three regimes, with $\alpha_j$ as the prior probability of each regime.

A latent variable model says: "the data has hidden structure. There is a variable $Z$ that, if known, would make the problem tractable — each conditional $P_\theta(x|Z=j)$ is a simple Gaussian. The difficulty is that $Z$ is unobserved. We must reason about it probabilistically rather than directly." The whole point of EM is to reason about $Z$ even though we can never observe it.

The ELBO: A Tractable Lower Bound

The MLE objective $\ell(\theta) = \log P_\theta(x) = \log \sum_z P_\theta(x,z)$ has the log outside the sum — intractable. The key move: introduce an arbitrary distribution $q(z)$ over the latent variable, and use it to construct a lower bound that is tractable.

Derivation · The Evidence Lower Bound (ELBO) Jensen's Inequality · Log is Concave

Step 1 — Multiply and divide by q(z)

For any distribution $q(z)$ with the same support as $P_\theta(z|x)$:

\ell(\theta) = \log \sum_z P_\theta(x,z) = \log \sum_z q(z)\cdot\frac{P_\theta(x,z)}{q(z)} = \log\, \mathbb{E}_{q(z)}\!\left[\frac{P_\theta(x,z)}{q(z)}\right]

Step 2 — Apply Jensen's inequality

Since $\log$ is concave, Jensen's inequality gives $\log\,\mathbb{E}[Y] \geq \mathbb{E}[\log Y]$:

\ell(\theta) = \log\,\mathbb{E}_{q(z)}\!\left[\frac{P_\theta(x,z)}{q(z)}\right] \;\geq\; \mathbb{E}_{q(z)}\!\left[\log\frac{P_\theta(x,z)}{q(z)}\right] \;\triangleq\; F_\theta(q)

Step 3 — Interpret the ELBO

Expand $F_\theta(q)$ into two interpretable terms:

F_\theta(q) = \underbrace{\mathbb{E}_{q(z)}[\log P_\theta(x,z)]}_{\text{expected complete-data log-likelihood}} - \underbrace{\mathbb{E}_{q(z)}[\log q(z)]}_{-H(q)\;\text{(entropy of }q\text{)}}

The first term measures how well $\theta$ explains both observed and latent data. The second term is the entropy of $q$ — it rewards spreading probability mass, preventing $q$ from collapsing to a point.

$F_\theta(q) \leq \ell(\theta)$ for all $q$. The ELBO is a lower bound on the log-likelihood for any choice of $q$. Crucially, $F_\theta(q)$ has a "sum inside the log" structure — the expectation is over $q$, and $\log P_\theta(x,z)$ is the complete-data log-likelihood which decomposes cleanly for a GMM. The bound is computable.

When Is the ELBO Tight?

The gap between the ELBO and the true log-likelihood has an exact expression — and it tells us precisely what $q$ to choose.

Derivation · The ELBO Gap is KL Divergence Tightness Condition

Compute the gap

\ell(\theta) - F_\theta(q) = \log P_\theta(x) - \mathbb{E}_{q(z)}\!\left[\log\frac{P_\theta(x,z)}{q(z)}\right]

Use $P_\theta(x,z) = P_\theta(x)\cdot P_\theta(z|x)$ to split the joint:

= \log P_\theta(x) - \mathbb{E}_{q(z)}\!\left[\log P_\theta(x) + \log P_\theta(z|x) - \log q(z)\right]$$ $$= \mathbb{E}_{q(z)}\!\left[\log\frac{q(z)}{P_\theta(z|x)}\right] = D_{KL}\!\left(q(z)\,\|\,P_\theta(z|x)\right)

Tightness condition

Since $D_{KL}(q \| P_\theta(z|x)) \geq 0$ with equality if and only if $q = P_\theta(z|x)$:

\ell(\theta) - F_\theta(q) = D_{KL}(q(z)\,\|\,P_\theta(z|x)) \geq 0

The ELBO is tight — equals $\ell(\theta)$ — when $q^*(z) = P_\theta(z|x)$. The optimal $q$ is the true posterior over the latent variable given the observed data and current parameters. This is the definition of the E-step.

The gap between $\ell(\theta)$ and $F_\theta(q)$ is the KL divergence between our guess $q(z)$ and the true posterior $P_\theta(z|x)$. When our guess is perfect — when we know exactly the posterior distribution over the hidden variable — the lower bound is tight and equals the log-likelihood. The EM algorithm alternates between making $q$ as good as possible (E-step, which sets the gap to zero) and making $\theta$ as good as possible (M-step, which increases the ELBO).

The EM Algorithm

Instead of maximising $\ell(\theta)$ directly (intractable), EM maximises the ELBO $F_\theta(q)$ jointly over both $\theta$ and $q$ by alternating between two tractable steps.

Algorithm · Expectation-Maximisation (EM) Monotone · Convergence Guaranteed

Initialise $\theta^0$ (random or heuristic)

For $t = 1, 2, \ldots$ until convergence do

E-STEP Set $q^{t+1}(z) = P_{\theta^t}(z \mid x)$ ← posterior over latent variables

Compute $F_\theta(q^{t+1}) = \mathbb{E}_{q^{t+1}(z)}\!\left[\log\dfrac{P_\theta(x,z)}{q^{t+1}(z)}\right]$

M-STEP Find $\theta^{t+1} = \underset{\theta}{\arg\max}\; F_\theta(q^{t+1})$ ← maximise ELBO over $\theta$

End For

The E-step sets $q$ to the true posterior under current $\theta^t$, making the ELBO tight: $F_{\theta^t}(q^{t+1}) = \ell(\theta^t)$. The M-step then finds the $\theta$ that maximises this tight bound, necessarily increasing $\ell(\theta)$ or leaving it unchanged. Every iteration the log-likelihood is non-decreasing.

Convergence Proof

Theorem · EM Never Decreases the Log-Likelihood $\ell(\theta^{t+1}) \geq \ell(\theta^t)$

Step 1 — E-step makes the ELBO tight

The E-step sets $q^{t+1}(z) = P_{\theta^t}(z|x)$. By the tightness condition derived above, the KL gap vanishes:

\ell(\theta^t) = F_{\theta^t}(q^{t+1})

Step 2 — M-step increases the ELBO

The M-step finds $\theta^{t+1} = \arg\max_\theta F_\theta(q^{t+1})$, so by definition of the argmax:

F_{\theta^{t+1}}(q^{t+1}) \;\geq\; F_{\theta^t}(q^{t+1}) = \ell(\theta^t)

Step 3 — ELBO is always a lower bound on the likelihood

By Jensen's inequality, the ELBO is a lower bound on the true log-likelihood for any $q$:

\ell(\theta^{t+1}) \;\geq\; F_{\theta^{t+1}}(q^{t+1})

Step 4 — Chain the inequalities

\ell(\theta^{t+1}) \;\geq\; F_{\theta^{t+1}}(q^{t+1}) \;\geq\; F_{\theta^t}(q^{t+1}) \;=\; \ell(\theta^t)

$\ell(\theta^{t+1}) \geq \ell(\theta^t)$ at every iteration. EM monotonically increases (or maintains) the log-likelihood. The proof uses nothing beyond the tightness of the ELBO at the E-step and Jensen's inequality at the M-step.

□

EM guarantees monotone increase — not convergence to the global maximum. The GMM log-likelihood is non-convex, riddled with local optima. EM climbs reliably from wherever it starts, but "wherever it starts" matters enormously. In practice: run EM from 10–20 random initialisations, keep the solution with highest $\ell(\theta)$. This is the standard approach in all GMM implementations (scikit-learn's GaussianMixture does exactly this with the $n\_init$ parameter).

EM for GMMs: The Responsibility Formula

The general EM framework becomes concrete and beautiful when applied to Gaussian Mixture Models.

E-step — compute responsibilities. The posterior $P_{\theta^t}(Z=j \mid x_i)$ is computed by Bayes' theorem — prior times likelihood, normalised:

r_{ij}^t \;\triangleq\; P_{\theta^t}(Z=j \mid x_i) = \frac{\alpha_j^t \cdot \mathcal{N}(x_i;\,\mu_j^t,\,\Sigma_j^t)}{\displaystyle\sum_{k=1}^M \alpha_k^t \cdot \mathcal{N}(x_i;\,\mu_k^t,\,\Sigma_k^t)}

$r_{ij}^t$ is the responsibility of component $j$ for data point $i$ at iteration $t$. It is a soft assignment: every component has some responsibility for every data point. $\sum_j r_{ij}^t = 1$ for each $i$.

M-step — update parameters. With responsibilities fixed, the ELBO separates into $M$ independent weighted problems, each with a closed-form solution:

\mu_j^{t+1} = \frac{\displaystyle\sum_{i=1}^N r_{ij}^t\, x_i}{\displaystyle\sum_{i=1}^N r_{ij}^t} \qquad \Sigma_j^{t+1} = \frac{\displaystyle\sum_{i=1}^N r_{ij}^t\,(x_i - \mu_j^{t+1})(x_i - \mu_j^{t+1})^\top}{\displaystyle\sum_{i=1}^N r_{ij}^t} \qquad \alpha_j^{t+1} = \frac{\displaystyle\sum_{i=1}^N r_{ij}^t}{N}

Each component's new mean is the responsibility-weighted average of all data points. Each component's new mixing weight is its average responsibility. Each component's new covariance is the responsibility-weighted scatter matrix.

Compare EM to k-means clustering. k-means assigns each point to exactly one cluster (hard assignment $r_{ij} \in \{0,1\}$), then recomputes centroids. EM assigns each point to every cluster with a probability (soft assignment $r_{ij} \in [0,1]$), then recomputes all parameters. EM is probabilistic k-means. As component variances $\Sigma_j \to \epsilon I$ with $\epsilon \to 0$, the soft responsibilities sharpen to hard assignments and EM for GMMs converges to the k-means algorithm exactly.

When MLE Fails: The Small Data Problem

Everything we have built so far treats $\theta$ as a fixed unknown number — a deterministic parameter to be estimated from data. This works beautifully when $N$ is large. With small $N$, MLE can be dangerously overconfident.

The canonical example: You observe $D = \{1, 1, 1, 1\}$ — four coin flips, all heads. The MLE is: $$\hat{\theta}_{ML} = \frac{\text{number of heads}}{n} = \frac{4}{4} = 1$$

The MLE declares the coin is perfectly biased — every future flip will land heads. If your prior knowledge says $\theta \approx 0.5$ (a fair coin is plausible), this is a dangerously wrong answer. Four flips is not enough evidence to completely overturn that belief.

🏥 Clinical Trial

4 patients treated, all recover. MLE: the drug cures 100% of patients. No doctor would conclude this. The prior that drugs rarely cure every patient matters — you need far more data before abandoning it.

🇮🇳 NIFTY Streak

NIFTY has 5 consecutive up-days. MLE: $P(\text{up}) = 1$. No trader bets their book on this. The prior that daily returns are roughly 50-50 directionally should survive 5 observations. MLE ignores that prior entirely.

Bayesian Estimation: Bringing In Prior Knowledge

The Bayesian fix is conceptually simple: instead of treating $\theta$ as a fixed unknown, treat it as a random variable with a distribution — the prior $P(\theta)$. The prior encodes your beliefs about $\theta$ before seeing any data. Bayes' theorem combines prior and likelihood to give the posterior:

P(\theta \mid x) = \frac{P(x \mid \theta)\cdot P(\theta)}{P(x)} \;\propto\; P(x \mid \theta)\cdot P(\theta)

The posterior is a complete distribution over $\theta$ given the data. It is the full Bayesian answer. Two common ways to turn the posterior into a point estimate:

Estimator	Definition	When to use
MLE	$\arg\max_\theta\, P(x\|\theta)$	Large $N$; no reliable prior
MAP	$\arg\max_\theta\, P(\theta\|x)$	Small $N$; prior knowledge available
Posterior mean	$\mathbb{E}[\theta\|x]$	Full Bayesian; minimises squared error

MAP Estimation and Conjugate Priors

The Maximum A Posteriori (MAP) estimate takes the mode of the posterior: $$\theta_{MAP} = \underset{\theta}{\arg\max}\; P(\theta \mid x) = \underset{\theta}{\arg\max}\;\left[\log P(x \mid \theta) + \log P(\theta)\right]$$ MAP = MLE + a log-prior penalty. The prior acts as a regulariser — it penalises parameter values that are implausible before seeing data.

A prior is called conjugate to a likelihood if the posterior has the same functional form as the prior — just with updated parameters. Conjugate priors make MAP computation analytically tractable and are the primary tool for Bayesian ML in practice.

The Beta-Bernoulli example. For $x \sim \text{Bernoulli}(\theta)$, the conjugate prior is the Beta distribution: $$P(\theta) = \text{Beta}(\theta;\, \alpha,\, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad \theta \in [0,1]$$

After observing $n$ flips with $S = \sum_i x_i$ heads, the posterior is: $$P(\theta \mid x) \propto P(x|\theta)\cdot P(\theta) \propto \theta^{S}(1-\theta)^{n-S} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{S+\alpha-1}(1-\theta)^{n-S+\beta-1}$$

This is again a Beta distribution — Beta$(S+\alpha,\, n-S+\beta)$. The MAP estimate (mode of Beta$(a,b)$ is $\frac{a-1}{a+b-2}$):

\theta_{MAP} = \frac{S + \alpha - 1}{n + \alpha + \beta - 2} \qquad \text{vs} \qquad \theta_{ML} = \frac{S}{n}

The Beta prior with parameters $(\alpha, \beta)$ acts like adding $\alpha - 1$ pseudo-heads and $\beta - 1$ pseudo-tails to the dataset before any real observations. For $\alpha = \beta = 2$: the prior contributes 1 pseudo-head and 1 pseudo-tail. Four real heads + one pseudo-head + one pseudo-tail = $\theta_{MAP} = 5/6 \approx 0.83$ instead of MLE's $4/4 = 1.0$. The prior pulls the estimate back toward 0.5. With $n \to \infty$, both MAP and MLE converge — the data overwhelms the prior. With small $n$, the prior dominates. This is correct behaviour.

MAP is Bayesian shrinkage applied to every serious quant model in practice. When a new options strategy is backtested on 3 months of NSE data and shows 40% annualised returns, MLE says the expected return is 40%. MAP says: combine this with the prior belief that most strategies return 8–12% — the posterior estimate is probably closer to 15–20%. Shrinkage estimators, Stein's paradox, and James-Stein shrinkage are all manifestations of the same Bayesian principle: when data is scarce, pull estimates toward a prior. Every systematic trading firm shrinks its parameter estimates — MAP is the mathematical formalisation.

Worked Example 1 · One EM Iteration on a 2-Component GMM 🇮🇳 NIFTY Return Regimes

Six daily NIFTY log-returns observed: $x = \{-1.2, -0.8, -0.5, +0.9, +1.3, +1.8\}$ (in %). We fit a 2-component GMM with initial parameters: $\alpha_1^0 = \alpha_2^0 = 0.5$, $\mu_1^0 = -0.5\%$, $\mu_2^0 = +1.0\%$, $\sigma_1 = \sigma_2 = 1.0\%$ (fixed for this example).

E-Step — compute responsibilities

For each data point $x_i$, compute $r_{i1}^0 = \frac{\alpha_1^0 \mathcal{N}(x_i;\mu_1^0,\sigma^2)}{\alpha_1^0 \mathcal{N}(x_i;\mu_1^0,\sigma^2) + \alpha_2^0 \mathcal{N}(x_i;\mu_2^0,\sigma^2)}$. With equal mixing weights, this reduces to the ratio of Gaussian densities:

r_{i1}^0 = \frac{\exp\!\left(-\frac{(x_i - \mu_1^0)^2}{2}\right)}{\exp\!\left(-\frac{(x_i-\mu_1^0)^2}{2}\right) + \exp\!\left(-\frac{(x_i-\mu_2^0)^2}{2}\right)}

\begin{array}{lcccc} x_i & (x_i - \mu_1^0)^2 & (x_i - \mu_2^0)^2 & r_{i1}^0 & r_{i2}^0 \\ \hline -1.2 & 0.49 & 4.84 & 0.979 & 0.021 \\ -0.8 & 0.09 & 3.24 & 0.966 & 0.034 \\ -0.5 & 0.00 & 2.25 & 0.919 & 0.081 \\ +0.9 & 2.00 & 0.01 & 0.111 & 0.889 \\ +1.3 & 3.24 & 0.09 & 0.034 & 0.966 \\ +1.8 & 5.29 & 0.64 & 0.007 & 0.993 \end{array}

The three negative returns are assigned mostly to component 1 (calm/negative regime); the three positive returns are assigned mostly to component 2 (positive regime). The soft assignments reflect proximity to each component mean.

M-Step — update means

Responsibility-weighted means:

\mu_1^1 = \frac{0.979(-1.2)+0.966(-0.8)+0.919(-0.5)+0.111(0.9)+0.034(1.3)+0.007(1.8)}{0.979+0.966+0.919+0.111+0.034+0.007}$$ $$= \frac{-1.175-0.773-0.460+0.100+0.044+0.013}{3.016} = \frac{-2.251}{3.016} \approx -0.746\%

\mu_2^1 \approx \frac{0.021(-1.2)+0.034(-0.8)+0.081(-0.5)+0.889(0.9)+0.966(1.3)+0.993(1.8)}{2.984} \approx +1.302\%

After one iteration: The means have moved from $(-0.5\%, +1.0\%)$ toward $(-0.75\%, +1.30\%)$ — converging toward the two natural clusters in the data. The responsibilities cleanly separate the data into a low-return regime and a high-return regime. Continuing to convergence would give means close to the empirical averages of each cluster, with mixing weights reflecting the fraction of data in each regime.

Worked Example 2 · MAP vs MLE on NIFTY Direction 🇮🇳 Beta Prior · Small Data Regime

Over the last 6 trading days, NIFTY closed up on 5 and down on 1. A trader wants to estimate $\theta = P(\text{NIFTY up next day})$.

MLE estimate

\hat{\theta}_{ML} = \frac{S}{n} = \frac{5}{6} \approx 0.833

The MLE says there is an 83% chance of an up-day tomorrow. Based on 6 observations.

MAP with Beta(3, 3) prior

This prior peaks at $\theta = 0.5$ and encodes a moderate belief that the market is roughly balanced. It contributes 2 pseudo-heads and 2 pseudo-tails:

\theta_{MAP} = \frac{S + \alpha - 1}{n + \alpha + \beta - 2} = \frac{5 + 3 - 1}{6 + 3 + 3 - 2} = \frac{7}{10} = 0.70

MAP with Beta(2, 2) prior

A weaker prior — contributes only 1 pseudo-head and 1 pseudo-tail:

\theta_{MAP} = \frac{5 + 2 - 1}{6 + 2 + 2 - 2} = \frac{6}{8} = 0.75

What happens as $n$ grows?

Suppose after 100 trading days we observe 58 up-days ($S = 58$):

\theta_{ML} = 58/100 = 0.580 \qquad \theta_{MAP}^{\text{Beta(3,3)}} = \frac{58+2}{100+4} = 60/104 \approx 0.577

With large $n$, MAP and MLE converge — the data overwhelms the prior.

The lesson: With 6 observations, the choice of prior matters enormously. A Beta(3,3) prior pulls the estimate from 83% to 70% — a 13 percentage point difference. A quant trading on a 70% vs 83% directional probability would size positions very differently. Always ask: how much data do I actually have, and does my estimator respect that?

Chapter 6 Preview

We have now seen two paradigms for density estimation: parametric (MLE/MAP — assume a model family, estimate parameters) and latent variable (EM — handle hidden structure within a parametric family). Chapter 6 abandons the parametric assumption entirely. What if we estimate the density at any point directly from the data, with no model family assumed — letting the data speak its own shape?

Practice Problems

4 questions · Chapter 05

ml / mathematical-foundations / ch05 / q01 ★ Conceptual

In the E-step of EM, what is being computed, and why does it make the ELBO exactly equal to the log-likelihood at the current $\theta^t$?

The gradient of the ELBO with respect to $\theta$, which sets the derivative to zero and finds the optimum.

The posterior $P_{\theta^t}(z|x)$, which sets $q^{t+1} = P_{\theta^t}(z|x)$, driving the KL gap $D_{KL}(q \| P_{\theta^t}(z|x))$ to zero and making the ELBO tight.

The marginal likelihood $P_{\theta^t}(x)$ by summing over all latent variable values.

A random sample of $z$ values from the prior $P_\theta(z)$, which approximates the posterior by Monte Carlo.

Answer: B.

The E-step computes the posterior $q^{t+1}(z) = P_{\theta^t}(z|x)$. We showed that $\ell(\theta) - F_\theta(q) = D_{KL}(q(z) \| P_\theta(z|x))$. When $q = P_{\theta^t}(z|x)$, the KL divergence between $q$ and $P_{\theta^t}(z|x)$ is zero (they are equal). Therefore the gap $\ell(\theta^t) - F_{\theta^t}(q^{t+1}) = 0$, making the ELBO tight: $F_{\theta^t}(q^{t+1}) = \ell(\theta^t)$.

Option A describes the M-step, not the E-step. Option C describes computing the marginal directly, which is what we're trying to avoid (it involves the intractable log-sum). Option D is a Monte Carlo approach (used in variational inference variants) but not standard EM.

ml / mathematical-foundations / ch05 / q02 ★ Conceptual

EM guarantees $\ell(\theta^{t+1}) \geq \ell(\theta^t)$ at every iteration. Does this mean EM always finds the global maximum of $\ell(\theta)$?

Yes — since the log-likelihood is non-decreasing and bounded above, it must converge to the global maximum.

Yes — the ELBO is a tight lower bound, so maximising it is equivalent to maximising the log-likelihood globally.

No — EM converges to a local maximum or saddle point. The GMM log-likelihood is non-convex, and EM only climbs from its starting point. Multiple random initialisations are needed in practice.

No — EM can decrease the log-likelihood at certain iterations when the ELBO approximation is poor.

Answer: C.

The convergence guarantee ($\ell(\theta^{t+1}) \geq \ell(\theta^t)$) says EM never moves downhill. But non-decreasing does not imply convergence to the global maximum — it only means EM climbs toward whatever local maximum or saddle point it can reach from its starting point. The GMM log-likelihood has many local optima: different initialisations can produce completely different fitted mixture components.

Option A is wrong: bounded and non-decreasing guarantees convergence to a limit, but that limit may be a local maximum. Option B is wrong: the ELBO being tight at each iterate does not mean the global optimum is found. Option D is wrong: the convergence proof rigorously shows EM never decreases the log-likelihood — this is guaranteed, not approximate.

ml / mathematical-foundations / ch05 / q03 ★★ Computational

In a 2-component GMM, the current responsibilities for 4 data points are $r_{i1} = \{0.9, 0.8, 0.2, 0.1\}$ and the data is $x = \{-2, -1, +3, +4\}$. What is the M-step update for $\mu_1$?

$\mu_1^{\text{new}} = \frac{-2 + (-1)}{2} = -1.5$ — simple average of the two negative points.

$\mu_1^{\text{new}} = \frac{-2 + (-1) + 3 + 4}{4} = 1.0$ — unweighted mean of all data.

$\mu_1^{\text{new}} = \frac{0.9(-2) + 0.8(-1) + 0.2(3) + 0.1(4)}{0.9 + 0.8 + 0.2 + 0.1} = \frac{-1.8 - 0.8 + 0.6 + 0.4}{2.0} = \frac{-1.6}{2.0} = -0.8$

$\mu_1^{\text{new}} = \frac{0.9(-2) + 0.8(-1)}{0.9 + 0.8} = \frac{-2.6}{1.7} \approx -1.53$ — only points with responsibility above 0.5.

Answer: C.

The M-step update is: $\mu_1^{\text{new}} = \frac{\sum_i r_{i1} x_i}{\sum_i r_{i1}}$. Every data point contributes to the update, weighted by its responsibility. Numerator: $0.9(-2) + 0.8(-1) + 0.2(3) + 0.1(4) = -1.8 - 0.8 + 0.6 + 0.4 = -1.6$. Denominator: $0.9 + 0.8 + 0.2 + 0.1 = 2.0$. Result: $-0.8$.

Option A is wrong: it ignores data points with low (but nonzero) responsibility. Option D is wrong: EM uses all data points with their responsibility weights — hard thresholding is k-means, not EM. The soft assignment is the core distinction between EM and k-means.

ml / mathematical-foundations / ch05 / q04 ★★ Conceptual

A Bernoulli model is fit to $n = 3$ observations, all 1s. Under MLE $\hat{\theta} = 1$. Under MAP with a Beta$(2,2)$ prior, $\hat{\theta} = 3/5 = 0.6$. A colleague says "the MAP estimate is wrong because the data clearly shows $\theta = 1$." What is the best response?

The colleague is right — the data has spoken and we must trust it. MAP adds unnecessary bias.

MAP is only valid when the true parameter is known in advance, which it is not here.

The colleague confuses the data with the truth. Three observations are insufficient to conclude $\theta = 1$. The Beta(2,2) prior contributes 2 pseudo-observations near 0.5 — with only 3 real observations, this prior is entirely reasonable and the MAP estimate is more reliable.

Both estimates are equally valid — the choice between MLE and MAP is purely a matter of personal preference.

Answer: C.

The colleague commits the classic small-sample error: treating the MLE as the truth because it matches the data exactly. But MLE is a statistic — a function of the data — not the true parameter. With $n = 3$ observations, the MLE has enormous variance: if the true $\theta = 0.7$, three consecutive heads would occur with probability $0.7^3 = 0.343$ — over a third of the time. The data is not conclusive.

The Beta(2,2) prior adds one pseudo-head and one pseudo-tail, contributing the equivalent of 2 "imaginary" observations near the centre. With only 3 real observations, this is a reasonable amount of regularisation. As $n$ grows, the MAP converges to MLE and the prior becomes irrelevant — with $n = 1000$ and 1000 ones, $\theta_{MAP} \approx 1001/1003 \approx 0.998$, essentially the same as MLE = 1. Option D is wrong: the choice has concrete consequences for estimation quality and is not arbitrary.

Terminal Questions — Chapter 05 10 problems · No answers given

Q1–3 are direct computations. Q4–7 ask you to connect EM and MAP to the broader framework. Q8–10 require careful reasoning about convergence, model selection, and the prior-as-regulariser connection.

For a 2-component GMM with $\alpha_1 = 0.4$, $\mu_1 = 0$, $\sigma_1 = 1$ and $\alpha_2 = 0.6$, $\mu_2 = 3$, $\sigma_2 = 1$, compute the responsibilities $r_{i1}$ and $r_{i2}$ for data points $x = \{-0.5, 1.5, 3.5\}$. Show full working. Which component does each point most likely belong to?Easy

You observe 7 NIFTY expiry outcomes: up, up, down, up, up, down, up (5 ups, 2 downs). Compute (a) the MLE for $\theta = P(\text{up})$, (b) the MAP estimate with Beta(3,3) prior, and (c) the MAP estimate with Beta(5,5) prior. Interpret how the strength of the prior affects the estimate.Easy

Verify numerically that the ELBO lower bound holds for the following case: single Gaussian model $P_\theta(x) = \mathcal{N}(x; \theta, 1)$ with $\theta = 0$, single observation $x = 1$, and variational distribution $q(z)$ chosen as $\mathcal{N}(z; 0.5, 0.5)$. Compute both $\ell(\theta) = \log P_\theta(x)$ and $F_\theta(q)$ and confirm $\ell(\theta) \geq F_\theta(q)$.Easy

Show that EM for a single-component Gaussian model (no mixture, no latent variable) reduces to computing the sample mean in a single iteration, regardless of initialisation. What does this say about the role of latent variables in EM's complexity?Medium

Prove that MAP estimation reduces to MLE when the prior $P(\theta)$ is uniform over all $\theta$. Then argue: is a uniform prior truly "non-informative"? What property of the uniform prior makes it problematic as a prior of ignorance?Medium

Explain why conjugate priors are computationally convenient by working through the Gaussian-Gaussian conjugate pair: if $x|\mu \sim \mathcal{N}(\mu, \sigma^2)$ (known $\sigma^2$) and the prior is $\mu \sim \mathcal{N}(\mu_0, \tau^2)$, show that the posterior is also Gaussian and derive the MAP estimate. Interpret the MAP estimate as a weighted average of the prior mean and the sample mean.Medium

Derive the M-step update formula for the mixing weights $\alpha_j$ in a GMM from scratch. Your derivation should: write out the ELBO, differentiate with respect to $\alpha_j$ subject to $\sum_j \alpha_j = 1$ using a Lagrange multiplier, and show that $\alpha_j^{\text{new}} = \frac{\sum_i r_{ij}}{N}$.Medium

EM is coordinate ascent on the ELBO viewed as a function of $(q, \theta)$: the E-step maximises over $q$ with $\theta$ fixed, and the M-step maximises over $\theta$ with $q$ fixed. Given this interpretation, explain why EM converges to a stationary point. What property of the ELBO makes this coordinate ascent guaranteed to converge, and what would break that guarantee?Hard

You are fitting a 3-component GMM to NIFTY daily returns. Propose a principled initialisation strategy that reduces the chance of poor local optima. Your answer should: (a) identify what makes a bad initialisation for a GMM on financial returns, (b) propose at least two initialisation methods (e.g. k-means++, random restarts, spectral initialisation), and (c) explain how you would choose between competing local optima after convergence.Hard

MAP estimation with a Gaussian prior $P(\theta) = \mathcal{N}(\theta; 0, \lambda^{-1}I)$ is equivalent to ridge regression. Derive this connection explicitly: write the MAP objective for a linear regression model $P_\theta(y|x) = \mathcal{N}(y; \theta^\top x, \sigma^2)$ with this Gaussian prior, and show it equals the ridge regression objective $\min_\theta \sum_i (y_i - \theta^\top x_i)^2 + \lambda\|\theta\|^2$. What does the regularisation parameter $\lambda$ correspond to in the Bayesian picture?Hard

← Maximum Likelihood Next: Non-parametric Density Estimation →

Hidden Variables,Prior Beliefs, and the EM Algorithm

The Problem Left Open

Latent Variable Models

The ELBO: A Tractable Lower Bound

When Is the ELBO Tight?

The EM Algorithm

Convergence Proof

EM for GMMs: The Responsibility Formula

When MLE Fails: The Small Data Problem

Bayesian Estimation: Bringing In Prior Knowledge

MAP Estimation and Conjugate Priors

Hidden Variables,
Prior Beliefs, and the EM Algorithm