Linear Models, Generalisation,
and the Bias-Variance Tradeoff
We return to parametric models with everything we've learned. Linear regression is not a simple heuristic โ it is ERM, MLE, and the Bayes-optimal predictor under one roof. And the bias-variance decomposition gives us the first rigorous answer to the question every practitioner asks: why does my model work in training and fail in deployment?
ERM and Distribution Estimation Are the Same Thing
The course has presented two seemingly different perspectives on learning. Chapter 2 framed it as ERM: minimise $\hat{R}(h) = \frac{1}{N}\sum_i L(h(x_i), y_i)$ over a hypothesis class $\mathcal{H}$. Chapters 3โ5 framed it as distribution estimation: minimise $D_{KL}(P_{\text{data}} \| P_\theta)$ over a parametric family. It is time to make explicit what we have been hinting at all along: these are the same optimisation problem.
This equivalence is the reason we can move freely between "minimise the loss" and "maximise the likelihood" โ they describe the same parameter. The choice of loss function is the choice of noise model.
Linear Regression: ERM with a Linear Hypothesis Class
The simplest and most important parametric family. The hypothesis class is all affine functions of $x$: $$\mathcal{H} = \{h_\theta : h_\theta(x) = \theta^\top x + \theta_0,\; \theta \in \mathbb{R}^d,\; \theta_0 \in \mathbb{R}\}$$
Using augmented notation โ append a 1 to each input $x \leftarrow [x;\, 1] \in \mathbb{R}^{d+1}$ and fold $\theta_0$ into $\theta \in \mathbb{R}^{d+1}$ โ we write $h_\theta(x) = \theta^\top x$ cleanly. The ERM objective with squared error is:
This has a closed-form solution. Stack all inputs as rows of a matrix $X \in \mathbb{R}^{N \times (d+1)}$ and all labels as $y \in \mathbb{R}^N$. The objective becomes $\frac{1}{N}\|X\theta - y\|_2^2$. Differentiating and setting to zero:
This is the ordinary least squares (OLS) solution โ the closed-form global minimiser of the squared error over all linear hypotheses.
Generalised Linear Models: Basis Functions
Linear regression is limited to linear relationships between $x$ and $y$. What if the true relationship is nonlinear? The hypothesis class $\mathcal{H}$ = linear functions may have too much bias.
The fix: apply a fixed nonlinear transformation $\Phi: \mathbb{R}^d \to \mathbb{R}^p$ to the inputs, then fit a linear model in the transformed space. The transformed dataset is $D_\Phi = \{(\Phi(x_i), y_i)\}$, and we solve:
where $\Phi \in \mathbb{R}^{N \times p}$ is the design matrix with rows $\Phi(x_i)^\top$. The model is linear in $\hat{\theta}$ but nonlinear in the original $x$ โ a generalised linear model.
$\Phi(x) = [1, x, x^2, x^3, \ldots, x^p]$. Fit a degree-$p$ polynomial by solving linear regression in the feature space. Choose $p$ larger for more flexibility, smaller for less. This is exactly where bias-variance lives.
$\Phi(x) = [\text{VIX}, \text{VIX}^2, \text{PCR}, \log(\text{OI}), \ldots]$. Adding nonlinear transformations of raw features allows the model to capture regime-dependent relationships. The OLS solution still applies in the transformed space.
Logistic Regression: Linear Models for Classification
For classification with $y \in \{0, 1\}$, a linear model $h_\theta(x) = \theta^\top x$ can predict any real value โ but probabilities must lie in $[0,1]$. We need to squash the output into a probability.
The sigmoid has a natural probabilistic interpretation. If $P_\theta(y=1|x) = \sigma(\theta^\top x)$, then the log-likelihood under this model is the cross-entropy loss: $$-\frac{1}{N}\sum_{i=1}^N \left[y_i \log \sigma(\theta^\top x_i) + (1-y_i)\log(1-\sigma(\theta^\top x_i))\right]$$
Minimising cross-entropy is MLE for the Bernoulli model with sigmoid link โ logistic regression is ERM with cross-entropy loss and a linear hypothesis class. Unlike OLS, this has no closed-form solution and requires numerical optimisation (gradient descent).
K-class classification. For $y \in \{0, 1, \ldots, K-1\}$, replace the sigmoid with the softmax function. Define $z = \Theta x$ where $\Theta \in \mathbb{R}^{K \times d}$:
Softmax transforms a $K$-dimensional real vector into a $K$-dimensional probability vector: all entries in $[0,1]$, summing to 1. The model outputs a full probability distribution over classes. Training: minimise cross-entropy loss numerically. This is the multinomial logistic classifier โ the standard linear classifier for multi-class problems.
Generalisation: The Train-Test Gap
We now have models. The question we deferred from Chapter 2 is: how do we know if a model learned from $D_{\text{train}}$ will perform well on new data?
Formally: sample two datasets i.i.d. from $P_{XY}$ โ a training set $D_{\text{train}} = \{(x_i, y_i)\}_{i=1}^N$ and a test set $D_{\text{test}} = \{(x_i', y_i')\}_{i=1}^{N'}$. Perform ERM on $D_{\text{train}}$ to obtain $\hat{h}^*$. The performance of $\hat{h}^*$ on $D_{\text{test}}$ measures its generalisation.
Define the hypothesis learned from dataset $D$ as $h_D$ โ the output of ERM on $D$. The average hypothesis over all possible training sets:
This is not a single model โ it is the average prediction at $x$ across all possible training datasets drawn from $P_{XY}$. It captures what the learning algorithm systematically predicts, independent of the specific data it happened to see.
The key quantity is the expected risk of the learned hypothesis: $$\mathbb{E}_{P_D}\!\left[R(h_D)\right] = \mathbb{E}_{P_D}\, \mathbb{E}_{P_{XY}}\!\left[(h_D(x) - y)^2\right]$$
This is what a learning algorithm achieves on average, over all possible training sets. The bias-variance decomposition breaks this into three interpretable components.
The Bias-Variance Decomposition
We decompose the expected squared error $\mathbb{E}_{P_D}\,\mathbb{E}_{P_{XY}}[(h_D(x)-y)^2]$. The Bayes-optimal predictor under squared error is $h^*(x) = \mathbb{E}[y|x]$ (proved in Ch2). Define $\bar{h}(x) = \mathbb{E}_{P_D}[h_D(x)]$ as the average hypothesis.
Expand the square and take the expectation. The cross term $2\,\mathbb{E}_{P_D}[(h_D(x)-\bar{h}(x))](\bar{h}(x)-y) = 0$ because $\mathbb{E}_{P_D}[h_D(x)-\bar{h}(x)] = 0$ by definition of $\bar{h}$. So:
Expand the second term $(\bar{h}(x)-y)^2$ by adding and subtracting $h^*(x) = \mathbb{E}[y|x]$:
Taking expectation over $P_{XY}$, the cross term involves $\mathbb{E}_{P_{XY}}[(h^*(x)-y)] = \mathbb{E}_{y|x}[y - \mathbb{E}[y|x]] = 0$. So:
What Each Term Means
Bias quantifies the effect of the hypothesis class $\mathcal{H}$. If $\mathcal{H}$ contains only linear functions but $h^*$ is quadratic, then $\bar{h}$ is the best linear approximation to $h^*$ โ which may be far from $h^*$ everywhere. This is the error that persists even with infinite data.
Variance quantifies the effect of finite data. Even if $\mathcal{H}$ is perfectly expressive, a complex model fit to a small dataset will overfit: $h_D$ changes dramatically with the training set, while $\bar{h}$ averages out to something reasonable. The gap between $h_D$ and $\bar{h}$ is variance.
Noise is the Bayes error โ the irreducible randomness in $P_{XY}$. For NIFTY daily direction prediction, even a perfect model knowing all relevant information still faces genuine uncertainty about the next day's direction. This is the floor that no algorithm can beat.
Predict next-day NIFTY return $y$ from INDIA VIX $x$. Four training observations:
Suppose the true relationship is $y = 0.5x^2 - x + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 0.25)$. We compare three hypothesis classes on samples of size $N = 20$:
Cannot express the quadratic truth. The average hypothesis $\bar{h}$ is the best linear approximation to $0.5x^2 - x$ โ a line with positive slope through the middle of the parabola. Bias is high โ the model systematically misses the curve. But across different training sets of 20 points, the fitted line barely changes. Variance is low.
Matches the true function class exactly. The average hypothesis $\bar{h} \approx h^*$. Bias is near zero. With 20 points and only 3 parameters, the quadratic is well-determined. Variance is moderate but acceptable.
16 parameters for 20 data points. The model can pass through or near every training point โ training error approaches zero. But different training sets of 20 points produce wildly different degree-15 polynomials. Variance is enormous. Average hypothesis is roughly right but individual $h_D$'s oscillate violently.
We've seen that complex models overfit โ their variance dominates. Chapter 2 introduced this as the generalisation gap. Chapter 8 gives the surgical fix: regularisation. By adding a penalty $\Omega(\theta)$ to the ERM objective, we constrain the hypothesis class and control variance โ at the cost of slightly higher bias. Ridge regression, Lasso, and SVMs are all regularised ERM. And they all have a Bayesian interpretation through MAP estimation. The last chapter ties every thread together.
A linear regression model trained on 3 years of NIFTY data achieves training MSE of 0.12 and test MSE of 0.15. A degree-10 polynomial model achieves training MSE of 0.02 and test MSE of 0.89. Which statement best diagnoses each model?
The linear model's train-test gap is $0.15 - 0.12 = 0.03$ โ small. The model generalises well, though its training error of 0.12 suggests it may have some bias (missing nonlinear structure). The polynomial model's gap is $0.89 - 0.02 = 0.87$ โ enormous. This is the classic signature of high variance overfitting: the model memorised the training data (near-zero training error) but learned nothing generalisable.
In bias-variance terms: the linear model has moderate bias, low variance. The polynomial has very low bias but catastrophically high variance. For deployment, you would choose the linear model โ its test error is 0.15 vs 0.89.
In the bias-variance decomposition, the noise term $\mathbb{E}[(h^*(x) - y)^2]$ is called "irreducible." Why can no learning algorithm reduce it below zero?
$h^*(x) = \mathbb{E}[y|x]$ is the Bayes-optimal predictor โ it achieves the minimum possible expected squared error for any given $x$. But the conditional distribution $P(y|x)$ has variance: even knowing $x$ exactly, $y$ fluctuates around its conditional mean. This variance $\text{Var}(y|x) = \mathbb{E}[(y - \mathbb{E}[y|x])^2 | x]$ is the noise term, and it is a property of the data-generating process, not of the learning algorithm.
For NIFTY daily returns: even if we know today's VIX, PCR, FII flow, and every other relevant feature, tomorrow's return still has genuine randomness โ the market hasn't "decided" yet. That randomness is irreducible. The noise floor is the Bayes error; no algorithm, regardless of complexity or data volume, can do better than it.
A logistic classifier predicts $P(\text{NIFTY up} | \text{VIX}=18) = \sigma(\theta^\top x)$ where $\theta^\top x = 0.4$. What is the predicted probability of an up-day, and what is the hard classification at threshold $\tau = 0.5$?
The sigmoid function: $\sigma(t) = \frac{1}{1+e^{-t}}$. At $t = 0.4$: $e^{-0.4} \approx 0.670$, so $\sigma(0.4) = \frac{1}{1+0.670} = \frac{1}{1.670} \approx 0.599$.
Since $0.599 > \tau = 0.5$, the hard classification is "NIFTY Up." The positive score $\theta^\top x = 0.4 > 0$ always maps to probability $> 0.5$ via sigmoid โ the decision boundary is exactly at $\theta^\top x = 0$.
Option C is the formula for softmax with one class, not sigmoid. Note that $\sigma(t) = e^t/(1+e^t) = 1/(1+e^{-t})$ โ these are equivalent, but the numerical evaluation gives $\approx 0.599$, not 0.401.
A quant uses a degree-1 polynomial (linear model) to predict NIFTY intraday momentum. The model performs similarly on both training and test data, but both errors are large. What does the bias-variance framework say about this situation, and what should the quant do?
The signature of high bias (underfitting): both training and test errors are large, and they are similar in magnitude (small generalisation gap). The model is too simple โ it cannot capture the true structure of the data regardless of how much data you give it. The average hypothesis $\bar{h}$ is far from $h^*$ (high Biasยฒ), but $h_D$ is close to $\bar{h}$ regardless of the specific training set (low Variance).
The remedy is to increase model expressiveness: add polynomial features $\Phi(x) = [x, x^2, x^3, \ldots]$, include interaction terms, add more informative features, or use a non-linear model. This reduces bias at the cost of some variance โ but the net effect on expected error should be positive if the increase in complexity is appropriate for the available data.
Q1โ3 are direct computations. Q4โ7 require connecting linear models, ERM, and the bias-variance framework. Q8โ10 are harder โ they ask you to reason precisely about generalisation.