Hidden Variables,
Prior Beliefs, and the EM Algorithm
Some structure in data is visible. Some is not. The EM algorithm is the principled answer to learning from data when half of what matters is hidden โ and MAP estimation is the answer to learning when you have more prior knowledge than data.
The Problem Left Open
Chapter 4 ended with an obstacle. The GMM log-likelihood $\ell(\theta) = \frac{1}{N}\sum_i \log \sum_j \alpha_j \mathcal{N}(v_i; \mu_j, \Sigma_j)$ has no closed-form maximiser because the log sits outside a sum, coupling all components together. The root cause: we don't know which component generated each data point. That hidden information โ the component assignment $z_i$ โ is the latent variable.
This chapter builds the machinery to handle it, and introduces a second related problem: what happens when MLE gives catastrophically overconfident answers because $N$ is too small? Both problems share the same cure โ bringing in information beyond what the data alone provides.
Latent Variable Models
A latent variable model posits that data is generated by a two-stage process. First, a hidden variable $Z$ is drawn from some distribution. Then, the observed data $x$ is drawn from a distribution conditioned on $Z$. The data $x$ is what we see. The variable $Z$ is never observed.
For a GMM specifically: $Z \in \{1, \ldots, M\}$ is the component label, $P_\theta(Z=j) = \alpha_j$ is the mixing weight, and $P_\theta(x \mid Z=j) = \mathcal{N}(x; \mu_j, \Sigma_j)$ is the component density. Marginalising over $Z$ recovers the mixture: $$P_\theta(x) = \sum_{j=1}^M \alpha_j\, \mathcal{N}(x; \mu_j, \Sigma_j)$$
$Z$ = disease subtype (never directly observed on admission). $x$ = symptom profile. The mixture over undiagnosed subtypes generates the observed symptom distribution. Knowing $Z$ would make diagnosis straightforward โ the whole problem is that we don't.
$Z$ = market regime: calm trending, normal, high-volatility crash. $x$ = daily log-return. The regime is never announced โ you infer it from returns. A 3-component GMM models the three regimes, with $\alpha_j$ as the prior probability of each regime.
The ELBO: A Tractable Lower Bound
The MLE objective $\ell(\theta) = \log P_\theta(x) = \log \sum_z P_\theta(x,z)$ has the log outside the sum โ intractable. The key move: introduce an arbitrary distribution $q(z)$ over the latent variable, and use it to construct a lower bound that is tractable.
For any distribution $q(z)$ with the same support as $P_\theta(z|x)$:
Since $\log$ is concave, Jensen's inequality gives $\log\,\mathbb{E}[Y] \geq \mathbb{E}[\log Y]$:
Expand $F_\theta(q)$ into two interpretable terms:
The first term measures how well $\theta$ explains both observed and latent data. The second term is the entropy of $q$ โ it rewards spreading probability mass, preventing $q$ from collapsing to a point.
When Is the ELBO Tight?
The gap between the ELBO and the true log-likelihood has an exact expression โ and it tells us precisely what $q$ to choose.
Use $P_\theta(x,z) = P_\theta(x)\cdot P_\theta(z|x)$ to split the joint:
Since $D_{KL}(q \| P_\theta(z|x)) \geq 0$ with equality if and only if $q = P_\theta(z|x)$:
The EM Algorithm
Instead of maximising $\ell(\theta)$ directly (intractable), EM maximises the ELBO $F_\theta(q)$ jointly over both $\theta$ and $q$ by alternating between two tractable steps.
The E-step sets $q$ to the true posterior under current $\theta^t$, making the ELBO tight: $F_{\theta^t}(q^{t+1}) = \ell(\theta^t)$. The M-step then finds the $\theta$ that maximises this tight bound, necessarily increasing $\ell(\theta)$ or leaving it unchanged. Every iteration the log-likelihood is non-decreasing.
Convergence Proof
The E-step sets $q^{t+1}(z) = P_{\theta^t}(z|x)$. By the tightness condition derived above, the KL gap vanishes:
The M-step finds $\theta^{t+1} = \arg\max_\theta F_\theta(q^{t+1})$, so by definition of the argmax:
By Jensen's inequality, the ELBO is a lower bound on the true log-likelihood for any $q$:
EM for GMMs: The Responsibility Formula
The general EM framework becomes concrete and beautiful when applied to Gaussian Mixture Models.
E-step โ compute responsibilities. The posterior $P_{\theta^t}(Z=j \mid x_i)$ is computed by Bayes' theorem โ prior times likelihood, normalised:
$r_{ij}^t$ is the responsibility of component $j$ for data point $i$ at iteration $t$. It is a soft assignment: every component has some responsibility for every data point. $\sum_j r_{ij}^t = 1$ for each $i$.
M-step โ update parameters. With responsibilities fixed, the ELBO separates into $M$ independent weighted problems, each with a closed-form solution:
Each component's new mean is the responsibility-weighted average of all data points. Each component's new mixing weight is its average responsibility. Each component's new covariance is the responsibility-weighted scatter matrix.
When MLE Fails: The Small Data Problem
Everything we have built so far treats $\theta$ as a fixed unknown number โ a deterministic parameter to be estimated from data. This works beautifully when $N$ is large. With small $N$, MLE can be dangerously overconfident.
The canonical example: You observe $D = \{1, 1, 1, 1\}$ โ four coin flips, all heads. The MLE is: $$\hat{\theta}_{ML} = \frac{\text{number of heads}}{n} = \frac{4}{4} = 1$$
The MLE declares the coin is perfectly biased โ every future flip will land heads. If your prior knowledge says $\theta \approx 0.5$ (a fair coin is plausible), this is a dangerously wrong answer. Four flips is not enough evidence to completely overturn that belief.
4 patients treated, all recover. MLE: the drug cures 100% of patients. No doctor would conclude this. The prior that drugs rarely cure every patient matters โ you need far more data before abandoning it.
NIFTY has 5 consecutive up-days. MLE: $P(\text{up}) = 1$. No trader bets their book on this. The prior that daily returns are roughly 50-50 directionally should survive 5 observations. MLE ignores that prior entirely.
Bayesian Estimation: Bringing In Prior Knowledge
The Bayesian fix is conceptually simple: instead of treating $\theta$ as a fixed unknown, treat it as a random variable with a distribution โ the prior $P(\theta)$. The prior encodes your beliefs about $\theta$ before seeing any data. Bayes' theorem combines prior and likelihood to give the posterior:
The posterior is a complete distribution over $\theta$ given the data. It is the full Bayesian answer. Two common ways to turn the posterior into a point estimate:
| Estimator | Definition | When to use |
|---|---|---|
| MLE | $\arg\max_\theta\, P(x|\theta)$ | Large $N$; no reliable prior |
| MAP | $\arg\max_\theta\, P(\theta|x)$ | Small $N$; prior knowledge available |
| Posterior mean | $\mathbb{E}[\theta|x]$ | Full Bayesian; minimises squared error |
MAP Estimation and Conjugate Priors
A prior is called conjugate to a likelihood if the posterior has the same functional form as the prior โ just with updated parameters. Conjugate priors make MAP computation analytically tractable and are the primary tool for Bayesian ML in practice.
The Beta-Bernoulli example. For $x \sim \text{Bernoulli}(\theta)$, the conjugate prior is the Beta distribution: $$P(\theta) = \text{Beta}(\theta;\, \alpha,\, \beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad \theta \in [0,1]$$
After observing $n$ flips with $S = \sum_i x_i$ heads, the posterior is: $$P(\theta \mid x) \propto P(x|\theta)\cdot P(\theta) \propto \theta^{S}(1-\theta)^{n-S} \cdot \theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{S+\alpha-1}(1-\theta)^{n-S+\beta-1}$$
This is again a Beta distribution โ Beta$(S+\alpha,\, n-S+\beta)$. The MAP estimate (mode of Beta$(a,b)$ is $\frac{a-1}{a+b-2}$):
Six daily NIFTY log-returns observed: $x = \{-1.2, -0.8, -0.5, +0.9, +1.3, +1.8\}$ (in %). We fit a 2-component GMM with initial parameters: $\alpha_1^0 = \alpha_2^0 = 0.5$, $\mu_1^0 = -0.5\%$, $\mu_2^0 = +1.0\%$, $\sigma_1 = \sigma_2 = 1.0\%$ (fixed for this example).
For each data point $x_i$, compute $r_{i1}^0 = \frac{\alpha_1^0 \mathcal{N}(x_i;\mu_1^0,\sigma^2)}{\alpha_1^0 \mathcal{N}(x_i;\mu_1^0,\sigma^2) + \alpha_2^0 \mathcal{N}(x_i;\mu_2^0,\sigma^2)}$. With equal mixing weights, this reduces to the ratio of Gaussian densities:
The three negative returns are assigned mostly to component 1 (calm/negative regime); the three positive returns are assigned mostly to component 2 (positive regime). The soft assignments reflect proximity to each component mean.
Responsibility-weighted means:
Over the last 6 trading days, NIFTY closed up on 5 and down on 1. A trader wants to estimate $\theta = P(\text{NIFTY up next day})$.
The MLE says there is an 83% chance of an up-day tomorrow. Based on 6 observations.
This prior peaks at $\theta = 0.5$ and encodes a moderate belief that the market is roughly balanced. It contributes 2 pseudo-heads and 2 pseudo-tails:
A weaker prior โ contributes only 1 pseudo-head and 1 pseudo-tail:
Suppose after 100 trading days we observe 58 up-days ($S = 58$):
With large $n$, MAP and MLE converge โ the data overwhelms the prior.
We have now seen two paradigms for density estimation: parametric (MLE/MAP โ assume a model family, estimate parameters) and latent variable (EM โ handle hidden structure within a parametric family). Chapter 6 abandons the parametric assumption entirely. What if we estimate the density at any point directly from the data, with no model family assumed โ letting the data speak its own shape?
In the E-step of EM, what is being computed, and why does it make the ELBO exactly equal to the log-likelihood at the current $\theta^t$?
The E-step computes the posterior $q^{t+1}(z) = P_{\theta^t}(z|x)$. We showed that $\ell(\theta) - F_\theta(q) = D_{KL}(q(z) \| P_\theta(z|x))$. When $q = P_{\theta^t}(z|x)$, the KL divergence between $q$ and $P_{\theta^t}(z|x)$ is zero (they are equal). Therefore the gap $\ell(\theta^t) - F_{\theta^t}(q^{t+1}) = 0$, making the ELBO tight: $F_{\theta^t}(q^{t+1}) = \ell(\theta^t)$.
Option A describes the M-step, not the E-step. Option C describes computing the marginal directly, which is what we're trying to avoid (it involves the intractable log-sum). Option D is a Monte Carlo approach (used in variational inference variants) but not standard EM.
EM guarantees $\ell(\theta^{t+1}) \geq \ell(\theta^t)$ at every iteration. Does this mean EM always finds the global maximum of $\ell(\theta)$?
The convergence guarantee ($\ell(\theta^{t+1}) \geq \ell(\theta^t)$) says EM never moves downhill. But non-decreasing does not imply convergence to the global maximum โ it only means EM climbs toward whatever local maximum or saddle point it can reach from its starting point. The GMM log-likelihood has many local optima: different initialisations can produce completely different fitted mixture components.
Option A is wrong: bounded and non-decreasing guarantees convergence to a limit, but that limit may be a local maximum. Option B is wrong: the ELBO being tight at each iterate does not mean the global optimum is found. Option D is wrong: the convergence proof rigorously shows EM never decreases the log-likelihood โ this is guaranteed, not approximate.
In a 2-component GMM, the current responsibilities for 4 data points are $r_{i1} = \{0.9, 0.8, 0.2, 0.1\}$ and the data is $x = \{-2, -1, +3, +4\}$. What is the M-step update for $\mu_1$?
The M-step update is: $\mu_1^{\text{new}} = \frac{\sum_i r_{i1} x_i}{\sum_i r_{i1}}$. Every data point contributes to the update, weighted by its responsibility. Numerator: $0.9(-2) + 0.8(-1) + 0.2(3) + 0.1(4) = -1.8 - 0.8 + 0.6 + 0.4 = -1.6$. Denominator: $0.9 + 0.8 + 0.2 + 0.1 = 2.0$. Result: $-0.8$.
Option A is wrong: it ignores data points with low (but nonzero) responsibility. Option D is wrong: EM uses all data points with their responsibility weights โ hard thresholding is k-means, not EM. The soft assignment is the core distinction between EM and k-means.
A Bernoulli model is fit to $n = 3$ observations, all 1s. Under MLE $\hat{\theta} = 1$. Under MAP with a Beta$(2,2)$ prior, $\hat{\theta} = 3/5 = 0.6$. A colleague says "the MAP estimate is wrong because the data clearly shows $\theta = 1$." What is the best response?
The colleague commits the classic small-sample error: treating the MLE as the truth because it matches the data exactly. But MLE is a statistic โ a function of the data โ not the true parameter. With $n = 3$ observations, the MLE has enormous variance: if the true $\theta = 0.7$, three consecutive heads would occur with probability $0.7^3 = 0.343$ โ over a third of the time. The data is not conclusive.
The Beta(2,2) prior adds one pseudo-head and one pseudo-tail, contributing the equivalent of 2 "imaginary" observations near the centre. With only 3 real observations, this is a reasonable amount of regularisation. As $n$ grows, the MAP converges to MLE and the prior becomes irrelevant โ with $n = 1000$ and 1000 ones, $\theta_{MAP} \approx 1001/1003 \approx 0.998$, essentially the same as MLE = 1. Option D is wrong: the choice has concrete consequences for estimation quality and is not arbitrary.
Q1โ3 are direct computations. Q4โ7 ask you to connect EM and MAP to the broader framework. Q8โ10 require careful reasoning about convergence, model selection, and the prior-as-regulariser connection.