Intermediate Chapter 07 of 08 · Mathematical Foundations of ML

Linear Models, Generalisation,
and the Bias-Variance Tradeoff

We return to parametric models with everything we've learned. Linear regression is not a simple heuristic — it is ERM, MLE, and the Bayes-optimal predictor under one roof. And the bias-variance decomposition gives us the first rigorous answer to the question every practitioner asks: why does my model work in training and fail in deployment?

ERM and Distribution Estimation Are the Same Thing

The course has presented two seemingly different perspectives on learning. Chapter 2 framed it as ERM: minimise $\hat{R}(h) = \frac{1}{N}\sum_i L(h(x_i), y_i)$ over a hypothesis class $\mathcal{H}$. Chapters 3–5 framed it as distribution estimation: minimise $D_{KL}(P_{\text{data}} \| P_\theta)$ over a parametric family. It is time to make explicit what we have been hinting at all along: these are the same optimisation problem.

Equivalence of ERM and divergence minimisation. Suppose we want to estimate $P_\theta(y|x)$ — the conditional distribution of labels given inputs. Model: $P_\theta(y|x) = \mathcal{N}(y;\, h_\theta(x),\, I)$. The MLE objective is: $$\theta^* = \underset{\theta}{\arg\max}\; \frac{1}{N}\sum_{i=1}^N \log P_\theta(y_i|x_i) \;\propto\; \underset{\theta}{\arg\min}\; \frac{1}{N}\sum_{i=1}^N \|y_i - h_\theta(x_i)\|_2^2$$ This is ERM with squared error loss. Divergence minimisation $\equiv$ ERM. The probabilistic and optimisation perspectives are one.

This equivalence is the reason we can move freely between "minimise the loss" and "maximise the likelihood" — they describe the same parameter. The choice of loss function is the choice of noise model.

Linear Regression: ERM with a Linear Hypothesis Class

The simplest and most important parametric family. The hypothesis class is all affine functions of $x$: $$\mathcal{H} = \{h_\theta : h_\theta(x) = \theta^\top x + \theta_0,\; \theta \in \mathbb{R}^d,\; \theta_0 \in \mathbb{R}\}$$

Using augmented notation — append a 1 to each input $x \leftarrow [x;\, 1] \in \mathbb{R}^{d+1}$ and fold $\theta_0$ into $\theta \in \mathbb{R}^{d+1}$ — we write $h_\theta(x) = \theta^\top x$ cleanly. The ERM objective with squared error is:

\theta^* = \underset{\theta}{\arg\min}\; \frac{1}{N}\sum_{i=1}^N \|\theta^\top x_i - y_i\|_2^2

This has a closed-form solution. Stack all inputs as rows of a matrix $X \in \mathbb{R}^{N \times (d+1)}$ and all labels as $y \in \mathbb{R}^N$. The objective becomes $\frac{1}{N}\|X\theta - y\|_2^2$. Differentiating and setting to zero:

\frac{\partial}{\partial \theta}\|X\theta - y\|_2^2 = 2X^\top(X\theta - y) = 0 \implies X^\top X\theta = X^\top y$$ $$\boxed{\theta^* = (X^\top X)^{-1} X^\top y}

This is the ordinary least squares (OLS) solution — the closed-form global minimiser of the squared error over all linear hypotheses.

$(X^\top X)^{-1} X^\top$ is called the Moore-Penrose pseudoinverse of $X$. It projects the label vector $y$ onto the column space of $X$ — finding the linear combination of the feature columns that best approximates $y$. When $X^\top X$ is invertible (which requires $N \geq d+1$ and no linearly dependent features), this is unique. When it is not invertible — more features than data points, or collinear features — the solution is not unique and regularisation (Chapter 8) is needed.

OLS is the workhorse of quantitative finance in India. Factor models for NIFTY 50 stocks — fitting returns as a linear combination of market beta, size, value, momentum — are ordinary least squares regressions. The Fama-French 3-factor model, used by every serious fund manager, is $\theta^* = (X^\top X)^{-1}X^\top y$ applied to stock returns. When a quant at a Mumbai fund says "I ran a regression," they are computing this formula.

Generalised Linear Models: Basis Functions

Linear regression is limited to linear relationships between $x$ and $y$. What if the true relationship is nonlinear? The hypothesis class $\mathcal{H}$ = linear functions may have too much bias.

The fix: apply a fixed nonlinear transformation $\Phi: \mathbb{R}^d \to \mathbb{R}^p$ to the inputs, then fit a linear model in the transformed space. The transformed dataset is $D_\Phi = \{(\Phi(x_i), y_i)\}$, and we solve:

\hat{\theta}^* = \underset{\hat{\theta}}{\arg\min}\; \frac{1}{N}\sum_{i=1}^N \|\hat{\theta}^\top \Phi(x_i) - y_i\|_2^2 \implies \hat{\theta}^* = (\Phi^\top \Phi)^{-1}\Phi^\top y

where $\Phi \in \mathbb{R}^{N \times p}$ is the design matrix with rows $\Phi(x_i)^\top$. The model is linear in $\hat{\theta}$ but nonlinear in the original $x$ — a generalised linear model.

📐 Polynomial Features

$\Phi(x) = [1, x, x^2, x^3, \ldots, x^p]$. Fit a degree-$p$ polynomial by solving linear regression in the feature space. Choose $p$ larger for more flexibility, smaller for less. This is exactly where bias-variance lives.

🇮🇳 NIFTY Factor Models

$\Phi(x) = [\text{VIX}, \text{VIX}^2, \text{PCR}, \log(\text{OI}), \ldots]$. Adding nonlinear transformations of raw features allows the model to capture regime-dependent relationships. The OLS solution still applies in the transformed space.

The basis function $\Phi$ is a fixed, user-chosen transformation — not learned from data. This is a design choice, not an optimisation. Choosing $\Phi$ too simple (too few basis functions) gives high bias. Choosing $\Phi$ too complex (too many basis functions) gives high variance and overfitting. Chapter 8's regularisation addresses the latter. The right $\Phi$ is typically chosen by cross-validation.

Logistic Regression: Linear Models for Classification

For classification with $y \in \{0, 1\}$, a linear model $h_\theta(x) = \theta^\top x$ can predict any real value — but probabilities must lie in $[0,1]$. We need to squash the output into a probability.

The sigmoid function $\sigma: \mathbb{R} \to (0,1)$ is: $$\sigma(t) = \frac{1}{1 + e^{-t}}$$ The logistic classifier defines $h_\theta(x) = \sigma(\theta^\top x)$ — the predicted probability that $y = 1$. The final hard classification is $\hat{y} = 1$ if $h_\theta(x) > \tau$ (threshold $\tau$, typically 0.5), else $\hat{y} = 0$.

The sigmoid has a natural probabilistic interpretation. If $P_\theta(y=1|x) = \sigma(\theta^\top x)$, then the log-likelihood under this model is the cross-entropy loss: $$-\frac{1}{N}\sum_{i=1}^N \left[y_i \log \sigma(\theta^\top x_i) + (1-y_i)\log(1-\sigma(\theta^\top x_i))\right]$$

Minimising cross-entropy is MLE for the Bernoulli model with sigmoid link — logistic regression is ERM with cross-entropy loss and a linear hypothesis class. Unlike OLS, this has no closed-form solution and requires numerical optimisation (gradient descent).

K-class classification. For $y \in \{0, 1, \ldots, K-1\}$, replace the sigmoid with the softmax function. Define $z = \Theta x$ where $\Theta \in \mathbb{R}^{K \times d}$:

h_\Theta(x) = \text{Softmax}(\Theta x), \qquad \text{Softmax}(z)_j = \frac{\exp(z_j)}{\sum_{k=1}^K \exp(z_k)}

Softmax transforms a $K$-dimensional real vector into a $K$-dimensional probability vector: all entries in $[0,1]$, summing to 1. The model outputs a full probability distribution over classes. Training: minimise cross-entropy loss numerically. This is the multinomial logistic classifier — the standard linear classifier for multi-class problems.

Sigmoid is softmax for $K=2$. Softmax is the canonical way to convert an arbitrary score vector into a probability distribution. In a NIFTY direction model with 3 classes (up/flat/down), softmax takes the raw linear scores for each class and converts them to probabilities summing to 1. The class with the highest softmax output is the prediction. Every neural network classifier ends with a softmax layer for exactly this reason.

Generalisation: The Train-Test Gap

We now have models. The question we deferred from Chapter 2 is: how do we know if a model learned from $D_{\text{train}}$ will perform well on new data?

Formally: sample two datasets i.i.d. from $P_{XY}$ — a training set $D_{\text{train}} = \{(x_i, y_i)\}_{i=1}^N$ and a test set $D_{\text{test}} = \{(x_i', y_i')\}_{i=1}^{N'}$. Perform ERM on $D_{\text{train}}$ to obtain $\hat{h}^*$. The performance of $\hat{h}^*$ on $D_{\text{test}}$ measures its generalisation.

Define the hypothesis learned from dataset $D$ as $h_D$ — the output of ERM on $D$. The average hypothesis over all possible training sets:

\bar{h}(x) = \mathbb{E}_{P_D}\!\left[h_D(x)\right]

This is not a single model — it is the average prediction at $x$ across all possible training datasets drawn from $P_{XY}$. It captures what the learning algorithm systematically predicts, independent of the specific data it happened to see.

The key quantity is the expected risk of the learned hypothesis: $$\mathbb{E}_{P_D}\!\left[R(h_D)\right] = \mathbb{E}_{P_D}\, \mathbb{E}_{P_{XY}}\!\left[(h_D(x) - y)^2\right]$$

This is what a learning algorithm achieves on average, over all possible training sets. The bias-variance decomposition breaks this into three interpretable components.

The Bias-Variance Decomposition

Derivation · Bias-Variance-Noise Decomposition 4 Steps · Complete Proof

Setup

We decompose the expected squared error $\mathbb{E}_{P_D}\,\mathbb{E}_{P_{XY}}[(h_D(x)-y)^2]$. The Bayes-optimal predictor under squared error is $h^*(x) = \mathbb{E}[y|x]$ (proved in Ch2). Define $\bar{h}(x) = \mathbb{E}_{P_D}[h_D(x)]$ as the average hypothesis.

Step 1 — Add and subtract $\bar{h}(x)$

\mathbb{E}_{P_D}\!\left[(h_D(x)-y)^2\right] = \mathbb{E}_{P_D}\!\left[(h_D(x) - \bar{h}(x) + \bar{h}(x) - y)^2\right]

Expand the square and take the expectation. The cross term $2\,\mathbb{E}_{P_D}[(h_D(x)-\bar{h}(x))](\bar{h}(x)-y) = 0$ because $\mathbb{E}_{P_D}[h_D(x)-\bar{h}(x)] = 0$ by definition of $\bar{h}$. So:

= \underbrace{\mathbb{E}_{P_D}\!\left[(h_D(x)-\bar{h}(x))^2\right]}_{\text{Variance}} + (\bar{h}(x) - y)^2

Step 2 — Add and subtract $h^*(x)$

Expand the second term $(\bar{h}(x)-y)^2$ by adding and subtracting $h^*(x) = \mathbb{E}[y|x]$:

(\bar{h}(x)-y)^2 = (\bar{h}(x)-h^*(x)+h^*(x)-y)^2$$ $$= (\bar{h}(x)-h^*(x))^2 + (h^*(x)-y)^2 + 2(\bar{h}(x)-h^*(x))(h^*(x)-y)

Step 3 — The cross term vanishes under $\mathbb{E}_{P_{XY}}$

Taking expectation over $P_{XY}$, the cross term involves $\mathbb{E}_{P_{XY}}[(h^*(x)-y)] = \mathbb{E}_{y|x}[y - \mathbb{E}[y|x]] = 0$. So:

\mathbb{E}_{P_{XY}}\!\left[(\bar{h}(x)-y)^2\right] = \underbrace{(\bar{h}(x)-h^*(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_{P_{XY}}\!\left[(h^*(x)-y)^2\right]}_{\text{Noise (irreducible)}}

Step 4 — Assemble the full decomposition

\boxed{\mathbb{E}_{P_D}\,\mathbb{E}_{P_{XY}}\!\left[(h_D(x)-y)^2\right] = \underbrace{\mathbb{E}_{P_D}\!\left[(h_D(x)-\bar{h}(x))^2\right]}_{\text{Variance}} + \underbrace{(\bar{h}(x)-h^*(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_{P_{XY}}\!\left[(h^*(x)-y)^2\right]}_{\text{Noise}}}

The expected risk of any learned hypothesis decomposes exactly into three terms: Variance (sensitivity to training data), Bias² (systematic error of the hypothesis class), and Noise (irreducible error in the data-generating process). These three terms are independent — reducing one cannot always reduce another.

What Each Term Means

Bias²

$(\bar{h} - h^*)^2$

How far is the average hypothesis from the true optimum? Measures the effect of restricting $\mathcal{H}$. High when the hypothesis class is too simple to express $h^*$.

Variance

$\mathbb{E}[(h_D - \bar{h})^2]$

How much does $h_D$ vary across different training sets? High when the hypothesis class is too complex — small changes in data produce large changes in the learned model.

Noise

$\mathbb{E}[(h^* - y)^2]$

Irreducible error — even the Bayes-optimal predictor makes mistakes because $y$ has inherent randomness given $x$. No algorithm can reduce this below zero.

Bias quantifies the effect of the hypothesis class $\mathcal{H}$. If $\mathcal{H}$ contains only linear functions but $h^*$ is quadratic, then $\bar{h}$ is the best linear approximation to $h^*$ — which may be far from $h^*$ everywhere. This is the error that persists even with infinite data.

Variance quantifies the effect of finite data. Even if $\mathcal{H}$ is perfectly expressive, a complex model fit to a small dataset will overfit: $h_D$ changes dramatically with the training set, while $\bar{h}$ averages out to something reasonable. The gap between $h_D$ and $\bar{h}$ is variance.

Noise is the Bayes error — the irreducible randomness in $P_{XY}$. For NIFTY daily direction prediction, even a perfect model knowing all relevant information still faces genuine uncertainty about the next day's direction. This is the floor that no algorithm can beat.

The bias-variance tradeoff is the tension between bias and variance as you vary the complexity of $\mathcal{H}$. Simpler $\mathcal{H}$: high bias (can't express $h^*$), low variance (stable predictions). Complex $\mathcal{H}$: low bias (can express $h^*$), high variance (overfits to specific training data). The optimal $\mathcal{H}$ is the one that minimises Bias² + Variance — not the simplest or the most complex, but the one that balances both. This is the central engineering problem of supervised learning, and it has no universal answer — it depends on $n$, $d$, and the true complexity of $h^*$.

Every quant at a prop desk in Mumbai lives the bias-variance tradeoff daily. A simple linear model of NIFTY returns (2 features, OLS) has high bias — it misses the nonlinear regime-dependent structure — but low variance — it barely changes with the specific 6-month training window. A deep neural network with 100 features has low bias but high variance — it fits the training period beautifully and fails spectacularly out of sample. The right model is somewhere between these extremes, and finding it is the job. Walk-forward validation is empirical bias-variance estimation: train on rolling windows, test on the next window, observe how performance varies with model complexity.

Worked Example 1 · OLS Linear Regression on NIFTY Returns 🇮🇳 Closed-Form Solution

Predict next-day NIFTY return $y$ from INDIA VIX $x$. Four training observations:

\begin{array}{ccc} i & x_i\text{ (VIX)} & y_i\text{ (return \%)} \\ \hline 1 & 13 & +0.8 \\ 2 & 16 & +0.3 \\ 3 & 20 & -0.5 \\ 4 & 25 & -1.2 \end{array}

Step 1 — Build the design matrix (augmented with intercept)

X = \begin{bmatrix}13 & 1 \\ 16 & 1 \\ 20 & 1 \\ 25 & 1\end{bmatrix}, \quad y = \begin{bmatrix}0.8 \\ 0.3 \\ -0.5 \\ -1.2\end{bmatrix}

Step 2 — Compute $X^\top X$ and $X^\top y$

X^\top X = \begin{bmatrix}13^2+16^2+20^2+25^2 & 13+16+20+25 \\ 13+16+20+25 & 4\end{bmatrix} = \begin{bmatrix}1230 & 74 \\ 74 & 4\end{bmatrix}$$ $$X^\top y = \begin{bmatrix}13(0.8)+16(0.3)+20(-0.5)+25(-1.2) \\ 0.8+0.3-0.5-1.2\end{bmatrix} = \begin{bmatrix}-25.2 \\ -0.6\end{bmatrix}

Step 3 — Solve $\theta^* = (X^\top X)^{-1}X^\top y$

(X^\top X)^{-1} = \frac{1}{1230\times4 - 74^2}\begin{bmatrix}4 & -74 \\ -74 & 1230\end{bmatrix} = \frac{1}{444}\begin{bmatrix}4 & -74 \\ -74 & 1230\end{bmatrix}$$ $$\theta^* = \frac{1}{444}\begin{bmatrix}4(-25.2)+(-74)(-0.6) \\ -74(-25.2)+1230(-0.6)\end{bmatrix} = \frac{1}{444}\begin{bmatrix}-56.4 \\ 1124.8\end{bmatrix} \approx \begin{bmatrix}-0.127 \\ 2.534\end{bmatrix}

Fitted model: $\hat{y} = -0.127 \times \text{VIX} + 2.534$. The slope $-0.127$ says: every 1-point increase in INDIA VIX is associated with a 0.127% drop in next-day NIFTY return. At VIX = 18 (a typical level), the model predicts $-0.127(18) + 2.534 = 0.25\%$. The negative slope makes intuitive sense — higher volatility tends to accompany negative returns. Whether this linear relationship generalises out of sample is a different question, governed by bias-variance.

Worked Example 2 · Bias-Variance Tradeoff via Polynomial Models 📊 Complexity vs Generalisation

Suppose the true relationship is $y = 0.5x^2 - x + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 0.25)$. We compare three hypothesis classes on samples of size $N = 20$:

Model 1: Linear ($h_\theta(x) = \theta_0 + \theta_1 x$)

Cannot express the quadratic truth. The average hypothesis $\bar{h}$ is the best linear approximation to $0.5x^2 - x$ — a line with positive slope through the middle of the parabola. Bias is high — the model systematically misses the curve. But across different training sets of 20 points, the fitted line barely changes. Variance is low.

\text{Bias}^2 \approx 0.8, \quad \text{Variance} \approx 0.05, \quad \text{Expected Error} \approx 1.1

Model 2: Quadratic ($h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2$)

Matches the true function class exactly. The average hypothesis $\bar{h} \approx h^*$. Bias is near zero. With 20 points and only 3 parameters, the quadratic is well-determined. Variance is moderate but acceptable.

\text{Bias}^2 \approx 0.02, \quad \text{Variance} \approx 0.12, \quad \text{Expected Error} \approx 0.39

Model 3: Degree-15 Polynomial

16 parameters for 20 data points. The model can pass through or near every training point — training error approaches zero. But different training sets of 20 points produce wildly different degree-15 polynomials. Variance is enormous. Average hypothesis is roughly right but individual $h_D$'s oscillate violently.

\text{Bias}^2 \approx 0.01, \quad \text{Variance} \approx 2.8, \quad \text{Expected Error} \approx 3.06

The optimal model is the quadratic — not because it has the lowest bias (degree-15 does) or the lowest variance (linear does), but because it minimises their sum. This is the bias-variance tradeoff in action. In practice with real data, you never know the true complexity of $h^*$. Cross-validation estimates the expected error empirically for each model and selects the complexity that minimises it.

Chapter 8 Preview

We've seen that complex models overfit — their variance dominates. Chapter 2 introduced this as the generalisation gap. Chapter 8 gives the surgical fix: regularisation. By adding a penalty $\Omega(\theta)$ to the ERM objective, we constrain the hypothesis class and control variance — at the cost of slightly higher bias. Ridge regression, Lasso, and SVMs are all regularised ERM. And they all have a Bayesian interpretation through MAP estimation. The last chapter ties every thread together.

Practice Problems

4 questions · Chapter 07

ml / mathematical-foundations / ch07 / q01 ★ Conceptual

A linear regression model trained on 3 years of NIFTY data achieves training MSE of 0.12 and test MSE of 0.15. A degree-10 polynomial model achieves training MSE of 0.02 and test MSE of 0.89. Which statement best diagnoses each model?

Linear model: underfitting. Polynomial model: better, since it achieves lower training error.

Linear model: slight underfitting but good generalisation (small train-test gap). Polynomial model: severe overfitting — very low training error but huge test error indicates high variance.

Linear model: overfitting. Polynomial model: underfitting.

Both models are performing well — any test MSE below 1.0 is acceptable for financial data.

Answer: B.

The linear model's train-test gap is $0.15 - 0.12 = 0.03$ — small. The model generalises well, though its training error of 0.12 suggests it may have some bias (missing nonlinear structure). The polynomial model's gap is $0.89 - 0.02 = 0.87$ — enormous. This is the classic signature of high variance overfitting: the model memorised the training data (near-zero training error) but learned nothing generalisable.

In bias-variance terms: the linear model has moderate bias, low variance. The polynomial has very low bias but catastrophically high variance. For deployment, you would choose the linear model — its test error is 0.15 vs 0.89.

ml / mathematical-foundations / ch07 / q02 ★ Conceptual

In the bias-variance decomposition, the noise term $\mathbb{E}[(h^*(x) - y)^2]$ is called "irreducible." Why can no learning algorithm reduce it below zero?

Because $h^*(x) = \mathbb{E}[y|x]$ is not learnable from finite data.

Because linear models cannot approximate $h^*$ exactly.

Because $h^*(x) = \mathbb{E}[y|x]$ is the best possible predictor, yet $y$ still has randomness around its conditional mean. This randomness is a property of $P_{XY}$ itself — no model, however complex, can predict what is genuinely random.

Because the noise term depends on the training dataset size $N$, and with infinite data it approaches zero.

Answer: C.

$h^*(x) = \mathbb{E}[y|x]$ is the Bayes-optimal predictor — it achieves the minimum possible expected squared error for any given $x$. But the conditional distribution $P(y|x)$ has variance: even knowing $x$ exactly, $y$ fluctuates around its conditional mean. This variance $\text{Var}(y|x) = \mathbb{E}[(y - \mathbb{E}[y|x])^2 | x]$ is the noise term, and it is a property of the data-generating process, not of the learning algorithm.

For NIFTY daily returns: even if we know today's VIX, PCR, FII flow, and every other relevant feature, tomorrow's return still has genuine randomness — the market hasn't "decided" yet. That randomness is irreducible. The noise floor is the Bayes error; no algorithm, regardless of complexity or data volume, can do better than it.

ml / mathematical-foundations / ch07 / q03 ★★ Computational

A logistic classifier predicts $P(\text{NIFTY up} | \text{VIX}=18) = \sigma(\theta^\top x)$ where $\theta^\top x = 0.4$. What is the predicted probability of an up-day, and what is the hard classification at threshold $\tau = 0.5$?

$\sigma(0.4) = 0.4$, classify as Down (below 0.5 threshold).

$\sigma(0.4) = 1/(1+e^{-0.4}) \approx 0.599$, classify as Up (above 0.5 threshold).

$\sigma(0.4) = e^{0.4}/(1+e^{0.4}) \approx 0.401$, classify as Down.

$\sigma(0.4) = 0.5$ exactly — all positive inputs map to 0.5 in logistic regression.

Answer: B.

The sigmoid function: $\sigma(t) = \frac{1}{1+e^{-t}}$. At $t = 0.4$: $e^{-0.4} \approx 0.670$, so $\sigma(0.4) = \frac{1}{1+0.670} = \frac{1}{1.670} \approx 0.599$.

Since $0.599 > \tau = 0.5$, the hard classification is "NIFTY Up." The positive score $\theta^\top x = 0.4 > 0$ always maps to probability $> 0.5$ via sigmoid — the decision boundary is exactly at $\theta^\top x = 0$.

Option C is the formula for softmax with one class, not sigmoid. Note that $\sigma(t) = e^t/(1+e^t) = 1/(1+e^{-t})$ — these are equivalent, but the numerical evaluation gives $\approx 0.599$, not 0.401.

ml / mathematical-foundations / ch07 / q04 ★★ Conceptual

A quant uses a degree-1 polynomial (linear model) to predict NIFTY intraday momentum. The model performs similarly on both training and test data, but both errors are large. What does the bias-variance framework say about this situation, and what should the quant do?

High variance, low bias — the model needs regularisation to reduce overfitting.

High noise — the problem is inherently unpredictable and no model can help.

High bias, low variance — training and test errors are both large and similar, indicating the hypothesis class is too simple to capture the true relationship. The quant should increase model complexity: add more features, use basis functions, or try a more expressive hypothesis class.

The model is performing optimally — equal training and test error means perfect generalisation.

Answer: C.

The signature of high bias (underfitting): both training and test errors are large, and they are similar in magnitude (small generalisation gap). The model is too simple — it cannot capture the true structure of the data regardless of how much data you give it. The average hypothesis $\bar{h}$ is far from $h^*$ (high Bias²), but $h_D$ is close to $\bar{h}$ regardless of the specific training set (low Variance).

The remedy is to increase model expressiveness: add polynomial features $\Phi(x) = [x, x^2, x^3, \ldots]$, include interaction terms, add more informative features, or use a non-linear model. This reduces bias at the cost of some variance — but the net effect on expected error should be positive if the increase in complexity is appropriate for the available data.

Terminal Questions — Chapter 07 10 problems · No answers given

Q1–3 are direct computations. Q4–7 require connecting linear models, ERM, and the bias-variance framework. Q8–10 are harder — they ask you to reason precisely about generalisation.

Given data $\{(x_i, y_i)\} = \{(1, 2), (2, 3), (3, 5)\}$, compute the OLS solution $\theta^* = (X^\top X)^{-1}X^\top y$ for a linear model $h_\theta(x) = \theta_0 + \theta_1 x$ (with intercept). State the fitted model and compute its training MSE.Easy

Compute $\sigma(t)$ for $t \in \{-2, -1, 0, 1, 2\}$. Verify that $\sigma(0) = 0.5$ and that $\sigma(-t) = 1 - \sigma(t)$ (the sigmoid is antisymmetric around 0.5). Interpret the antisymmetry property in terms of the logistic classifier's decision boundary.Easy

A softmax classifier has $K=3$ classes and outputs raw scores $z = [1.2, 0.5, -0.3]$. Compute the softmax probabilities for each class. Which class is predicted? Verify that the probabilities sum to 1.Easy

Show that the OLS solution $\theta^* = (X^\top X)^{-1}X^\top y$ is also the MLE for the linear model $P_\theta(y|x) = \mathcal{N}(y;\, \theta^\top x,\, \sigma^2 I)$. Your derivation should write the log-likelihood, simplify, and show it reduces to the least squares objective.Medium

Explain the difference between the hypothesis class $\mathcal{H}_1 = \{h: h(x) = \theta^\top x\}$ and $\mathcal{H}_2 = \{h: h(x) = \theta^\top \Phi(x)\}$ where $\Phi(x) = [x, x^2, x^3]$. Which has higher bias? Which has higher variance? How does the bias-variance tradeoff manifest when you choose between these two classes on a dataset with $N=50$?Medium

The average hypothesis $\bar{h}(x) = \mathbb{E}_{P_D}[h_D(x)]$ is not a model you can ever compute exactly — it requires averaging over all possible training datasets. Explain why it is nonetheless useful as a theoretical concept in the bias-variance decomposition. What practical quantity approximates $\bar{h}$ when you use cross-validation?Medium

A quant trains a NIFTY return model on 6 months of data and finds training MSE = 0.05. He reports this as the model's expected out-of-sample performance. Explain precisely why this is wrong using the bias-variance decomposition. Describe how to get an honest estimate of $\mathbb{E}_{P_D}[R(h_D)]$ without additional data.Medium

Prove that the cross term in the bias-variance derivation vanishes: show that $\mathbb{E}_{P_{XY}}[(h^*(x)-y)(\bar{h}(x)-h^*(x))] = 0$, where $h^*(x) = \mathbb{E}[y|x]$. Your proof should use the law of iterated expectations to condition on $x$ first, then use the definition of $h^*$.Hard

For a linear hypothesis class $\mathcal{H} = \{h: h(x) = \theta^\top x\}$ and squared error loss, show that the bias equals the squared error of the best linear predictor from the true $h^*$: $\text{Bias}^2 = \mathbb{E}_x[(\bar{h}(x) - h^*(x))^2]$ where $\bar{h}$ is the best linear fit to $h^*$. Under what conditions is the bias zero even for linear models?Hard

The variance term $\mathbb{E}_{P_D}[(h_D(x)-\bar{h}(x))^2]$ measures sensitivity to training data. For linear regression with $N$ training points and $d+1$ parameters, the variance of each prediction $h_D(x)$ scales as $O((d+1)/N)$ — more parameters, less data, higher variance. Using this, argue quantitatively why a degree-15 polynomial on $N=20$ points has catastrophically high variance, while a degree-2 polynomial on the same data does not.Hard

← Non-parametric Density Estimation Next: Regularisation, SVMs, and the Kernel Trick →

Linear Models, Generalisation,and the Bias-Variance Tradeoff

ERM and Distribution Estimation Are the Same Thing

Linear Regression: ERM with a Linear Hypothesis Class

Generalised Linear Models: Basis Functions

Logistic Regression: Linear Models for Classification

Generalisation: The Train-Test Gap

The Bias-Variance Decomposition

What Each Term Means

Linear Models, Generalisation,
and the Bias-Variance Tradeoff