Foundational Chapter 02 of 08 · Mathematical Foundations of ML

The Learning Machine

You can't improve what you can't measure. Before a machine can learn anything, we need to answer one precise question: what does it mean for a hypothesis to be good? The answer involves a loss function, an expectation, and one of the most important theorems in all of statistics.

The Space of All Possible Answers

Chapter 1 ended with a goal: find a function $\hat{f}$ that approximates the unknown $f: \mathcal{X} \to \mathcal{Y}$. But we were deliberately vague about where we look. Do we search over all possible functions? A function from $\mathbb{R}^5$ to $\mathbb{R}$ could be anything — a polynomial, a sine wave, a piecewise constant monster with discontinuities at every rational number. Searching over all of them is computationally hopeless and, as we'll see, statistically dangerous.

The first design decision in any ML problem is to restrict the search to a manageable set — the hypothesis class $\mathcal{H}$.

The hypothesis class $\mathcal{H}$ is a set of candidate functions $h: \mathcal{X} \to \mathcal{Y}$ that we are willing to consider as approximations to $f$. Our learning algorithm searches over $\mathcal{H}$ and returns the best $h$ it finds. $$\mathcal{H} \subseteq \{\text{all functions } \mathcal{X} \to \mathcal{Y}\}$$ Examples: the set of all linear functions, the set of all decision trees of depth $\leq 5$, the set of all neural networks with a fixed architecture.

The hypothesis class is your prior belief about the shape of the world, expressed mathematically. Choosing $\mathcal{H}$ = linear functions says "I believe the relationship between features and labels is roughly linear." Choosing $\mathcal{H}$ = deep neural networks says "I believe the relationship is complex and I want maximum flexibility."

🏏 IPL — Classification

$\mathcal{H}$ = all linear classifiers on match conditions. $h_\theta(x) = \mathbb{1}[\theta^\top x > 0]$. You're betting the boundary between "team scores above 180" and "below 180" is a hyperplane in the feature space of pitch report, toss, team form.

🇮🇳 NIFTY — Regression

$\mathcal{H}$ = all functions of the form $h_\theta(x) = \theta^\top x$. You're betting tomorrow's NIFTY return is a linear combination of today's VIX, FII flow, PCR, and other features. Simple. Auditable. Probably wrong. But a principled start.

Neither a rich nor a simple $\mathcal{H}$ is universally better. A very simple $\mathcal{H}$ (linear functions) may miss the true $f$ entirely — high bias. A very rich $\mathcal{H}$ (all neural networks) can fit anything, including noise — high variance. This tradeoff between expressiveness and reliability is the deepest tension in ML, and it will occupy us for the rest of this course. For now, simply note that the choice of $\mathcal{H}$ is one of the most consequential decisions a practitioner makes — and most people make it by habit rather than principle.

Measuring Goodness: The Loss Function

Once we have a hypothesis $h \in \mathcal{H}$, we need to measure how good it is on a single data point $(x, y)$. We need a number: "this prediction was this bad."

A loss function $L: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}^+$ takes a prediction $h(x)$ and a true label $y$, and returns a non-negative number measuring the cost of the prediction error. $$L\bigl(h(x),\, y\bigr) \;\in\; \mathbb{R}^+$$ Zero means perfect prediction. Larger means worse. The specific form of $L$ encodes what kinds of errors we care about and how much.

Two loss functions dominate most of machine learning. Both deserve to be understood, not just memorised.

Squared error loss — for regression problems where $\mathcal{Y} = \mathbb{R}$:

L(h(x), y) = \|h(x) - y\|_2^2 = (h(x) - y)^2

🏏 Rohit Sharma's Score

Predict 67, actual is 82. Loss $= (67-82)^2 = 225$. Predict 79, actual is 82. Loss $= (79-82)^2 = 9$. Being off by 15 runs costs 25 times more than being off by 3.

🇮🇳 NIFTY Close Prediction

Predict 23,800, actual 24,150. Loss $= (23800-24150)^2 = 122{,}500$. Predict 24,100, actual 24,150. Loss $= 2{,}500$. A 350-point error is penalised 49 times more than a 50-point error.

Why squared and not absolute error $|h(x) - y|$? Two reasons. First, squaring penalises large errors disproportionately — the model becomes especially afraid of catastrophic misses, which is often what we want. Second, squaring is differentiable everywhere. The absolute value has a kink at zero that breaks gradient-based optimisation. Squared error is the mathematically convenient choice that also happens to have a deep probabilistic interpretation — as we'll discover in Chapter 3, it secretly assumes the noise is Gaussian.

Zero-one loss — for classification problems where $\mathcal{Y} = \{0, 1\}$:

L(h(x), y) = \begin{cases} 0 & \text{if } h(x) = y \\ 1 & \text{if } h(x) \neq y \end{cases} = \mathbb{1}[h(x) \neq y]

Every wrong prediction costs 1, regardless of how wrong. A perfect prediction costs 0. Clean and simple.

Zero-one loss treats all errors as equally bad. In medicine and trading, this is almost always wrong. Missing a cancer diagnosis (predicting healthy when the patient is ill) is catastrophically worse than a false alarm. A model that wrongly predicts "NIFTY up" on a day it falls 0.1% and a day it falls 5% is penalised identically. In practice, we construct asymmetric loss functions — but zero-one loss is the theoretically clean starting point, and the Bayes classifier we derive later is optimal under it. Know its limitations before you deploy it.

True Risk: The Number We Actually Care About

A loss on a single data point is a noisy signal. What we really want is the expected loss over the entire distribution $P_{XY}$ — how bad is $h$ on average, across all possible inputs the world might throw at it?

The true risk (or expected risk) of a hypothesis $h$ is: $$R(h) \;\triangleq\; \mathbb{E}_{P_{XY}}\!\left[L(h(x), y)\right]$$ It is the expected loss when $(x, y)$ is drawn from the true joint distribution $P_{XY}$. The best hypothesis in $\mathcal{H}$ is: $$h^*(x) = \underset{h \,\in\, \mathcal{H}}{\arg\min}\; R(h)$$

True risk is the population-level performance of $h$. It answers: "If I deploy this model in the real world, over all possible inputs it will ever encounter, how well does it do on average?"

🏥 Cancer Classifier

$R(h)$ under zero-one loss = the probability of making an incorrect diagnosis on a randomly selected patient from the population. $R(h) = 0.05$ means 5% of patients are misclassified.

🇮🇳 NIFTY Direction Model

$R(h)$ under zero-one loss = the fraction of trading days the model predicts the wrong direction, averaged over all possible market conditions — not just the days you tested on.

This is exactly what we want to minimise. And exactly what we cannot compute. Because $R(h)$ requires integrating over $P_{XY}$, which is unknown. We only have $D$ — $n$ samples from it.

Empirical Risk Minimisation — The Principled Substitute

We can't compute $R(h)$. But we can approximate it. Replace the expectation over the unknown $P_{XY}$ with an average over the known dataset $D$:

\hat{R}(h) \;\triangleq\; \frac{1}{N}\sum_{i=1}^N L\!\left(h(x_i),\, y_i\right)

This is the empirical risk — the average loss on the training data. The algorithm that minimises it is called Empirical Risk Minimisation (ERM):

\boxed{\hat{h}^*(x) = \underset{h \,\in\, \mathcal{H}}{\arg\min}\; \hat{R}(h)}

This single equation is the skeleton of nearly every supervised learning algorithm ever invented. Linear regression: ERM with squared error and $\mathcal{H}$ = linear functions. Logistic regression: ERM with cross-entropy loss and $\mathcal{H}$ = sigmoid-linear functions. Neural networks: ERM with various losses and $\mathcal{H}$ = deep compositions of nonlinear functions. The framework is universal. The design choices are what differ.

ERM is principled, not arbitrary. By the Law of Large Numbers, $\hat{R}(h) \to R(h)$ as $N \to \infty$ for any fixed $h$. The empirical average of i.i.d. losses converges to the true expectation. So with infinite data, minimising $\hat{R}(h)$ and minimising $R(h)$ become the same problem. With finite data, we're doing the best we can with what we have. The question is: how much data is enough? That is the central question of statistical learning theory — and it depends entirely on the complexity of $\mathcal{H}$.

The Generalisation Gap — Why Training Error Lies

Here is the most important distinction in all of applied machine learning. It has two names for the same quantity:

Training error $= \hat{R}(\hat{h}^*)$ — how well the model performs on the data it was trained on.
Generalisation error $= R(\hat{h}^*)$ — how well it performs on new, unseen data from $P_{XY}$.

The generalisation gap is $R(\hat{h}^*) - \hat{R}(\hat{h}^*)$. A large gap means the model learned the training data but not the underlying pattern — it has overfit.

$\hat{R}(h) \to R(h)$ for a fixed $h$ as $N \to \infty$ — this is just the LLN. But $\hat{h}^* = \arg\min_\mathcal{H} \hat{R}(h)$ may not converge to $h^* = \arg\min_\mathcal{H} R(h)$. The minimiser of the empirical risk and the minimiser of the true risk are different objects. The ERM algorithm searches over $\mathcal{H}$ and finds whatever looks best on the training data. If $\mathcal{H}$ is very large, it can find a hypothesis that fits the training data perfectly but generalises poorly — because it memorised noise, not signal. When $\hat{h}^*$ converges to $h^*$ is one of the deepest results in learning theory, and it depends on the complexity of $\mathcal{H}$.

This is the single most important concept for any quantitative trader to internalise. A backtest is training error — always flattering, always optimistic. Live trading is generalisation error — the honest report card. A strategy that made 40% annualised in a 3-year backtest and 3% in live trading has not been unlucky. It has overfit. The model found patterns in historical NIFTY data that were real in-sample but were statistical artifacts, not causal signals. The generalisation gap killed it. Every serious quant firm runs out-of-sample tests, walk-forward validation, and live paper trading precisely to measure the generalisation gap before real money is at risk.

Worked Example 1 · Loss, Risk, and ERM in Practice 🏏 IPL + 🇮🇳 NIFTY

For each scenario, identify the loss function, write out the empirical risk $\hat{R}(h)$ explicitly, and state what ERM reduces to computationally.

Scenario A — Predicting whether an IPL batsman scores above 50

$\mathcal{X} \subset \mathbb{R}^{30}$ (match conditions), $\mathcal{Y} = \{0,1\}$, zero-one loss. The dataset has $N = 500$ innings. The empirical risk is:

\hat{R}(h) = \frac{1}{500}\sum_{i=1}^{500} \mathbb{1}[h(x_i) \neq y_i] = \frac{\text{number of misclassified innings}}{500}

ERM = find the classifier in $\mathcal{H}$ that minimises the fraction of training innings it gets wrong. This is literally minimising training misclassification rate.

Scenario B — Predicting NIFTY's next-day return

$\mathcal{X} \subset \mathbb{R}^5$ (VIX, FII, PCR, SGX, prev return), $\mathcal{Y} = \mathbb{R}$, squared error loss, $\mathcal{H}$ = linear functions $h_\theta(x) = \theta^\top x$. Dataset: $N = 500$ trading days.

\hat{R}(h_\theta) = \frac{1}{500}\sum_{i=1}^{500}(\theta^\top x_i - y_i)^2

ERM = find $\theta^*$ that minimises the mean squared prediction error on training days. This is exactly ordinary least squares linear regression — the oldest and most widely used ML algorithm, now derived from first principles.

The unifying insight: linear regression is not a separate invention — it is ERM with squared error loss and a linear hypothesis class. Every supervised learning algorithm is ERM with specific choices of $\mathcal{H}$ and $L$. The framework is always the same.

The Bayes Classifier — The Theoretical Ceiling

We've been asking: given $D$, what is the best $h$ we can find? Now ask a different question: if we knew $P_{XY}$ exactly, what is the best possible classifier? Is there a theoretical upper bound on how well any algorithm can ever do?

Fix binary classification ($\mathcal{Y} = \{0,1\}$), zero-one loss, and allow $\mathcal{H}$ = all possible functions (no restriction). The optimal classifier is:

h_B(x) = \begin{cases} 0 & \text{if } P(y=0 \mid x) > P(y=1 \mid x) \\ 1 & \text{if } P(y=1 \mid x) \geq P(y=0 \mid x) \end{cases}

At every input $x$, predict the more probable class. This is the Bayes classifier. Its risk $R(h_B)$ is called the Bayes error — the irreducible minimum error that no algorithm can beat.

The Bayes classifier is not a practical algorithm — it requires knowing $P_{XY}$, which we don't. It is a theoretical benchmark. The gap $R(h) - R(h_B)$ for any classifier $h$ tells you how far you are from the theoretical ceiling. Some of this gap is due to your hypothesis class being too restricted (approximation error). Some is due to having too little data (estimation error). Reducing both is the goal of all of ML.

Proving the Bayes Classifier Is Optimal

A claim this strong deserves a proof. We show $R(h_B) \leq R(h)$ for every $h \in \mathcal{H}$ — the Bayes classifier has the lowest possible true risk.

Theorem · Optimality of the Bayes Classifier Zero-One Loss · Binary Classification

Setup

For any classifier $h$, partition $\mathbb{R}^d$ into two regions: $$S_i(h) = \{x \in \mathbb{R}^d : h(x) = i\}, \quad i = 0, 1$$ These partition the input space: $S_0(h) \cap S_1(h) = \emptyset$ and $S_0(h) \cup S_1(h) = \mathbb{R}^d$.

Expanding the True Risk

Under zero-one loss, $R(h) = P(h(x) \neq y)$. A mistake happens in exactly two ways: predicting 1 when $y=0$, or predicting 0 when $y=1$. Expanding:

R(h) = P(h(x)=1,\, y=0) + P(h(x)=0,\, y=1)$$ $$= P(y=0)\int_{S_1(h)} P_{X|y=0}(x)\,dx \;+\; P(y=1)\int_{S_0(h)} P_{X|y=1}(x)\,dx

Pointwise Minimisation

For any fixed $x \in \mathbb{R}^d$, the classifier must assign $x$ to either $S_0$ or $S_1$. The contribution of this point to the risk is:

\text{cost of assigning } x \text{ to } S_1 = P(y=0)\cdot P_{X|y=0}(x)$$ $$\text{cost of assigning } x \text{ to } S_0 = P(y=1)\cdot P_{X|y=1}(x)

The Bayes classifier assigns each $x$ to whichever class has the lower cost — it takes the pointwise minimum. By Bayes' theorem, $P(y=0)\cdot P_{X|y=0}(x) \propto P(y=0|x)$, so this is equivalent to predicting the more probable class at every $x$.

Conclusion

Since $h_B$ minimises the integrand at every point $x$, its total risk is:

R(h_B) = \int_{\mathbb{R}^d} \min\!\bigl(P(y=0)\cdot P_{X|y=0}(x),\; P(y=1)\cdot P_{X|y=1}(x)\bigr)\,dx

For any other classifier $h$, there exist points $x$ where it does not take the minimum — so it incurs higher cost there. Therefore:

Result: $R(h_B) \leq R(h)$ for all $h \in \mathcal{H}$. The Bayes classifier is globally optimal under zero-one loss. No algorithm, regardless of complexity, can do better in expectation. The Bayes error $R(h_B)$ is the irreducible floor.

□

The proof has a beautiful structure: it reduces a global optimisation (minimise total risk over all of $\mathbb{R}^d$) to a local decision (at each point $x$, which class has higher posterior probability?). The Bayes classifier solves the local problem at every point simultaneously — and since the global objective is an integral over local contributions, the local optimum gives the global optimum. This "pointwise" argument is a recurring technique in mathematical ML.

The Regression Oracle: What Minimises Squared Error?

The companion result for regression is equally important. For squared error loss and unrestricted $\mathcal{H}$, what $h$ minimises the true risk?

h^*(x) = \underset{h}{\arg\min}\; \mathbb{E}_{P_{XY}}\!\left[(h(x) - y)^2\right] = \mathbb{E}[Y \mid X = x]

The optimal predictor under squared error is the conditional expectation of $Y$ given $X$. Not the most likely value of $Y$ — the average value of $Y$ across all samples with the same $x$.

🏏 Batting Average

The best prediction of Rohit Sharma's score given today's match conditions is his average score across all historical matches with those exact conditions. Not his most common score — his mean score.

🇮🇳 NIFTY Return

The best NIFTY return prediction given today's features is $\mathbb{E}[\text{return} \mid \text{VIX}=18, \text{FII}=+2000\text{Cr}, \ldots]$ — the historical average return on all days with those features.

This result explains why linear regression is not naive. It is the best possible estimator of $\mathbb{E}[Y|X]$ within the linear hypothesis class — and $\mathbb{E}[Y|X]$ is the theoretical target for all regression. Every regression algorithm, from the simplest linear model to the deepest neural network, is attempting to estimate this conditional expectation. The battle is fought in hypothesis class size and data quantity, not in the target itself.

$\mathbb{E}[Y|X=x]$ requires knowing $P_{XY}$. Every regression algorithm is an approximation to this quantity, not the quantity itself. The approximation error has two components: bias (how wrong is $\mathcal{H}$ on average?) and variance (how sensitive is $\hat{h}^*$ to the specific training set?). Getting the bias-variance balance right — Chapter 6 — is the central engineering challenge of regression.

Worked Example 2 · The Bayes Classifier in Action 🇮🇳 INDIA VIX → NIFTY Direction

Suppose we model the relationship between INDIA VIX and NIFTY direction using known distributions. Let $Y = 1$ if NIFTY closes down, $Y = 0$ if up. Assume:

P(Y=1) = 0.45, \quad P(Y=0) = 0.55$$ $$P_{X|Y=0}(x) = \mathcal{N}(x;\; 15,\; 4^2) \quad \text{(up days: VIX tends lower)}$$ $$P_{X|Y=1}(x) = \mathcal{N}(x;\; 22,\; 6^2) \quad \text{(down days: VIX tends higher)}

At VIX = 18, what does the Bayes classifier predict?

Compute the joint likelihoods at $x = 18$:

P(y=0)\cdot P_{X|y=0}(18) = 0.55 \times \mathcal{N}(18;\,15,\,16) = 0.55 \times \frac{1}{\sqrt{2\pi \cdot 16}}e^{-\frac{(18-15)^2}{32}}$$ $$= 0.55 \times 0.0997 \times e^{-0.281} = 0.55 \times 0.0997 \times 0.755 \approx 0.0414

P(y=1)\cdot P_{X|y=1}(18) = 0.45 \times \mathcal{N}(18;\,22,\,36) = 0.45 \times \frac{1}{\sqrt{2\pi \cdot 36}}e^{-\frac{(18-22)^2}{72}}$$ $$= 0.45 \times 0.0665 \times e^{-0.222} = 0.45 \times 0.0665 \times 0.801 \approx 0.0240

Compare and decide

At VIX = 18: the "up" joint likelihood (0.0414) exceeds the "down" joint likelihood (0.0240). The Bayes classifier predicts $h_B(18) = 0$ — NIFTY up.

Finding the decision boundary

The Bayes classifier switches prediction at the $x$ where both joint likelihoods are equal. Setting them equal and solving gives the decision boundary — the VIX level above which the model predicts a down day. In this case, the boundary is near VIX $\approx 19.8$. Below 19.8, predict up. Above 19.8, predict down.

The lesson: the Bayes classifier does not simply threshold at the mean of either class. It accounts for the class priors ($P(Y=0) = 0.55$) and the different spreads of the two distributions. The prior that NIFTY closes up more often than down shifts the decision boundary upward — you need a higher VIX than you might naively think before the model calls a down day.

We now have a complete framework: hypothesis class, loss function, true risk, empirical risk, ERM, and the theoretical optima — Bayes classifier and conditional expectation. The architecture is clear. But there is a critical question we deferred: the LLN guarantees $\hat{R}(h) \to R(h)$ for fixed $h$, but does minimising $\hat{R}$ over all of $\mathcal{H}$ give us the minimiser of $R$? The answer involves measuring how "far apart" two probability distributions are — which requires a new mathematical tool. That tool is KL divergence, and it is the subject of Chapter 3.

Practice Problems

4 questions · Chapter 02

ml / mathematical-foundations / ch02 / q01 ★ Conceptual

A NIFTY direction model achieves 3% training error and 48% test error. A second model achieves 38% training error and 40% test error. Which of the following is the most accurate diagnosis?

Model 1 is better — lower training error always indicates a superior model.

Model 1 has overfit severely. Model 2 has a smaller generalisation gap and will likely perform better in live trading.

Both models are equally bad — 48% and 40% test error are both near random.

Model 1 has underfit — it needs a more complex hypothesis class.

Answer: B.

Model 1 has a generalisation gap of $48\% - 3\% = 45$ percentage points — a massive discrepancy between training and test performance. This is the signature of severe overfitting: the model memorised the training data, including its noise, and learned nothing generalisable. Model 2 has a gap of only $40\% - 38\% = 2$ percentage points — it generalises almost as well as it trains, which is what we want.

The correct conclusion: for live trading, Model 2 is far preferable despite its higher training error. Training error is advertising. Test error is the truth.

Option C is wrong to dismiss both — 40% test error for direction prediction is actually competitive in many market regimes (random is 50%), while 48% is barely better than random. Option D is wrong: a 3% training error is not underfitting — underfitting would show high training error too.

ml / mathematical-foundations / ch02 / q02 ★ Conceptual

For an unrestricted hypothesis class ($\mathcal{H}$ = all measurable functions) and squared error loss, which $h$ minimises the true risk $R(h) = \mathbb{E}_{P_{XY}}[(h(x)-y)^2]$?

The mode of $P(Y \mid X=x)$ — the most likely value of $Y$ given $X$.

The median of $P(Y \mid X=x)$ — the value that minimises absolute error.

The conditional expectation $\mathbb{E}[Y \mid X=x]$ — the mean of the conditional distribution.

The marginal mean $\mathbb{E}[Y]$ — independent of $X$.

Answer: C.

The conditional expectation $\mathbb{E}[Y \mid X=x]$ is the unique minimiser of squared error loss. To see why: write $y = \mathbb{E}[Y|X=x] + \epsilon$ where $\epsilon$ is the residual noise with $\mathbb{E}[\epsilon|X=x] = 0$. Then $(h(x) - y)^2 = (h(x) - \mathbb{E}[Y|X=x] - \epsilon)^2$. Taking expectation over $Y$, the cross term vanishes and we get $\mathbb{E}[(h(x)-y)^2|X=x] = (h(x) - \mathbb{E}[Y|X=x])^2 + \text{Var}(Y|X=x)$. The variance term is irreducible — it is the Bayes error for regression. The first term is minimised by setting $h(x) = \mathbb{E}[Y|X=x]$.

Options A (mode) and B (median) minimise different objectives. Mode minimises zero-one loss. Median minimises absolute error loss. Squared error gives the mean. Each loss function implies a different optimal predictor — the loss function choice is not arbitrary.

ml / mathematical-foundations / ch02 / q03 ★★ Computational

At a particular market state $x$, a classifier estimates $P(y=1 \mid x) = 0.35$. Under zero-one loss, what does the Bayes classifier predict, and what is the Bayes error at this specific point?

Predict class 1; Bayes error at this point = 0.35.

Predict class 0; Bayes error at this point = 0.35.

Predict class 0; Bayes error at this point = 0.65.

Predict class 1; Bayes error at this point = 0.65.

Answer: B.

$P(y=1|x) = 0.35$ implies $P(y=0|x) = 0.65$. The Bayes classifier predicts the more probable class: $h_B(x) = 0$ (predict "NIFTY up" in this context).

The Bayes error at this point is the probability of a mistake even under the optimal decision. Since we predict class 0, we're wrong whenever the true label is 1, which happens with probability $P(y=1|x) = 0.35$. So the local Bayes error is $\min(P(y=0|x), P(y=1|x)) = \min(0.65, 0.35) = 0.35$.

Intuitively: even the best classifier makes mistakes at ambiguous points where both classes have significant probability. At $x$ where $P(y=1|x) = 0.35$, the world itself is uncertain — 35% of the time this market state leads to a down day. No algorithm can eliminate this uncertainty. It is irreducible noise in the data-generating process.

ml / mathematical-foundations / ch02 / q04 ★★ Conceptual

A cancer detection model achieves zero-one loss of 0.04 on a held-out test set from AIIMS Delhi. A NIFTY direction model achieves zero-one loss of 0.44 on unseen trading days. A colleague claims the cancer model is "clearly better." What is the most sophisticated response?

Agreed — 4% error is objectively better than 44% error, regardless of domain.

The NIFTY model is better — financial prediction is harder so any positive result is impressive.

The comparison is meaningless without knowing the Bayes error for each domain. If cancer Bayes error is near 0% and NIFTY Bayes error is near 50%, both models may be equally close to their theoretical ceiling.

Both models are poor — we should aim for 0% error in all domains.

Answer: C.

Raw error rates are meaningless without knowing the Bayes error — the irreducible theoretical minimum — for each domain. Cancer diagnosis: the disease has clear physical signatures (measurable biomarkers, imaging features), so the Bayes error is likely very low, perhaps under 1%. A 4% error model is therefore performing 3–4 percentage points above the ceiling — there is significant room for improvement.

NIFTY direction: markets are close to efficient — prices already incorporate available information. The Bayes error for daily direction prediction may be near 47–49%. A 44% error model is actually performing better than the theoretical minimum in some sense — or more precisely, it is very close to the ceiling. The 6-percentage-point gap from random may represent most of the extractable signal.

This is why quant traders celebrate a sustained 52% win rate on daily direction calls as extraordinary — the Bayes error for that problem is near 50%. A 96% accuracy in cancer detection sounds impressive but may still be far from optimal for that domain. Always ask: what is the Bayes error?

Terminal Questions — Chapter 02 10 problems · No answers given

Work through these independently. Q1–3 are direct applications of the definitions. Q4–7 require connecting concepts. Q8–10 will make you think carefully.

For each of the following, write down the loss function $L(h(x), y)$ explicitly and compute the empirical risk $\hat{R}(h)$ for the given predictions: (a) Regression: true returns $y = [0.5\%, -0.3\%, 1.2\%]$, predictions $h(x) = [0.4\%, -0.5\%, 0.9\%]$, squared error. (b) Classification: true labels $y = [0,1,0,1,0]$, predictions $h(x) = [0,1,1,1,0]$, zero-one loss. Easy

A hypothesis class $\mathcal{H}$ contains only three functions: $h_1, h_2, h_3$. On a training set of 4 points, their losses are: $h_1: [0.2, 0.8, 0.1, 0.5]$, $h_2: [0.4, 0.3, 0.3, 0.2]$, $h_3: [0.1, 0.9, 0.2, 0.8]$. Which hypothesis does ERM select? What is the empirical risk of the selected hypothesis? Easy

At a particular input $x$, $P(y=0|x) = 0.72$ and $P(y=1|x) = 0.28$. (a) What does the Bayes classifier predict? (b) What is the Bayes error at this point? (c) If any classifier predicts class 1 at this $x$, what is its loss at this point under zero-one loss, in expectation over $y$? Easy

Prove that the conditional expectation $h^*(x) = \mathbb{E}[Y|X=x]$ minimises the true risk under squared error loss. Your proof should: (a) write out $R(h) = \mathbb{E}[(h(x)-y)^2]$, (b) add and subtract $\mathbb{E}[Y|X=x]$ inside the square, (c) expand and show the cross-term is zero, (d) conclude that $R(h) \geq R(h^*)$ with equality iff $h = h^*$. Medium

A NIFTY classification model is trained on 500 days and achieves $\hat{R} = 0.38$. The true risk is estimated (via a large test set) as $R = 0.41$. A second model is trained on the same data with $\hat{R} = 0.30$ and $R = 0.46$. (a) Compute the generalisation gap for each. (b) Which model would you deploy for live trading, and why? (c) What does the larger gap of Model 2 tell you about its hypothesis class? Medium

Under zero-one loss, the ERM solution $\hat{h}^*$ minimises the fraction of training points misclassified. Show that for a hypothesis class containing a single threshold classifier $h_t(x) = \mathbb{1}[x > t]$ on one-dimensional data, ERM reduces to searching over at most $N+1$ candidate values of $t$. Why does this make ERM computationally tractable for this class? Medium

A doctor tells you "my diagnosis model has 95% accuracy on my hospital's patient data." List three distinct reasons why this 95% may be misleading as a measure of true risk $R(h)$, drawing on concepts from this chapter. For each reason, suggest one way to get a more honest estimate of $R(h)$. Medium

Construct an explicit example where ERM over $\mathcal{H}$ gives $\hat{R}(\hat{h}^*) = 0$ (zero training error) but $R(\hat{h}^*) \geq 0.4$ (40%+ true error). Your example should specify $\mathcal{X}$, $\mathcal{Y}$, $\mathcal{H}$, a dataset $D$, and an $h \in \mathcal{H}$ that achieves this. What property of $\mathcal{H}$ enables this gap? Hard

The Bayes classifier is optimal under zero-one loss. Under squared error loss, $\mathbb{E}[Y|X]$ is optimal. What is the optimal predictor under absolute error loss $L(h(x),y) = |h(x) - y|$? Prove your answer by showing that any other predictor has higher expected absolute error. (Hint: the proof uses the fact that the median minimises expected absolute deviation.) Hard

The Bayes error $R(h_B)$ is the irreducible noise floor — no algorithm can do better. But in practice, we never know $R(h_B)$ because we don't know $P_{XY}$. Propose a practical method to estimate the Bayes error for a NIFTY direction classification problem using only observed market data, without assuming any parametric form for $P_{XY}$. What are the limitations of your proposed method? Hard

← What Is Learning? Next: Empirical Risk and the Likelihood Game →