The Learning Machine
You can't improve what you can't measure. Before a machine can learn anything, we need to answer one precise question: what does it mean for a hypothesis to be good? The answer involves a loss function, an expectation, and one of the most important theorems in all of statistics.
The Space of All Possible Answers
Chapter 1 ended with a goal: find a function $\hat{f}$ that approximates the unknown $f: \mathcal{X} \to \mathcal{Y}$. But we were deliberately vague about where we look. Do we search over all possible functions? A function from $\mathbb{R}^5$ to $\mathbb{R}$ could be anything — a polynomial, a sine wave, a piecewise constant monster with discontinuities at every rational number. Searching over all of them is computationally hopeless and, as we'll see, statistically dangerous.
The first design decision in any ML problem is to restrict the search to a manageable set — the hypothesis class $\mathcal{H}$.
The hypothesis class is your prior belief about the shape of the world, expressed mathematically. Choosing $\mathcal{H}$ = linear functions says "I believe the relationship between features and labels is roughly linear." Choosing $\mathcal{H}$ = deep neural networks says "I believe the relationship is complex and I want maximum flexibility."
$\mathcal{H}$ = all linear classifiers on match conditions. $h_\theta(x) = \mathbb{1}[\theta^\top x > 0]$. You're betting the boundary between "team scores above 180" and "below 180" is a hyperplane in the feature space of pitch report, toss, team form.
$\mathcal{H}$ = all functions of the form $h_\theta(x) = \theta^\top x$. You're betting tomorrow's NIFTY return is a linear combination of today's VIX, FII flow, PCR, and other features. Simple. Auditable. Probably wrong. But a principled start.
Measuring Goodness: The Loss Function
Once we have a hypothesis $h \in \mathcal{H}$, we need to measure how good it is on a single data point $(x, y)$. We need a number: "this prediction was this bad."
Two loss functions dominate most of machine learning. Both deserve to be understood, not just memorised.
Squared error loss — for regression problems where $\mathcal{Y} = \mathbb{R}$:
Predict 67, actual is 82. Loss $= (67-82)^2 = 225$. Predict 79, actual is 82. Loss $= (79-82)^2 = 9$. Being off by 15 runs costs 25 times more than being off by 3.
Predict 23,800, actual 24,150. Loss $= (23800-24150)^2 = 122{,}500$. Predict 24,100, actual 24,150. Loss $= 2{,}500$. A 350-point error is penalised 49 times more than a 50-point error.
Zero-one loss — for classification problems where $\mathcal{Y} = \{0, 1\}$:
Every wrong prediction costs 1, regardless of how wrong. A perfect prediction costs 0. Clean and simple.
True Risk: The Number We Actually Care About
A loss on a single data point is a noisy signal. What we really want is the expected loss over the entire distribution $P_{XY}$ — how bad is $h$ on average, across all possible inputs the world might throw at it?
True risk is the population-level performance of $h$. It answers: "If I deploy this model in the real world, over all possible inputs it will ever encounter, how well does it do on average?"
$R(h)$ under zero-one loss = the probability of making an incorrect diagnosis on a randomly selected patient from the population. $R(h) = 0.05$ means 5% of patients are misclassified.
$R(h)$ under zero-one loss = the fraction of trading days the model predicts the wrong direction, averaged over all possible market conditions — not just the days you tested on.
This is exactly what we want to minimise. And exactly what we cannot compute. Because $R(h)$ requires integrating over $P_{XY}$, which is unknown. We only have $D$ — $n$ samples from it.
Empirical Risk Minimisation — The Principled Substitute
We can't compute $R(h)$. But we can approximate it. Replace the expectation over the unknown $P_{XY}$ with an average over the known dataset $D$:
This is the empirical risk — the average loss on the training data. The algorithm that minimises it is called Empirical Risk Minimisation (ERM):
This single equation is the skeleton of nearly every supervised learning algorithm ever invented. Linear regression: ERM with squared error and $\mathcal{H}$ = linear functions. Logistic regression: ERM with cross-entropy loss and $\mathcal{H}$ = sigmoid-linear functions. Neural networks: ERM with various losses and $\mathcal{H}$ = deep compositions of nonlinear functions. The framework is universal. The design choices are what differ.
The Generalisation Gap — Why Training Error Lies
Here is the most important distinction in all of applied machine learning. It has two names for the same quantity:
Training error $= \hat{R}(\hat{h}^*)$ — how well the model performs on the data it was trained on.
Generalisation error $= R(\hat{h}^*)$ — how well it performs on new, unseen data from $P_{XY}$.
The generalisation gap is $R(\hat{h}^*) - \hat{R}(\hat{h}^*)$. A large gap means the model learned the training data but not the underlying pattern — it has overfit.
For each scenario, identify the loss function, write out the empirical risk $\hat{R}(h)$ explicitly, and state what ERM reduces to computationally.
$\mathcal{X} \subset \mathbb{R}^{30}$ (match conditions), $\mathcal{Y} = \{0,1\}$, zero-one loss. The dataset has $N = 500$ innings. The empirical risk is:
ERM = find the classifier in $\mathcal{H}$ that minimises the fraction of training innings it gets wrong. This is literally minimising training misclassification rate.
$\mathcal{X} \subset \mathbb{R}^5$ (VIX, FII, PCR, SGX, prev return), $\mathcal{Y} = \mathbb{R}$, squared error loss, $\mathcal{H}$ = linear functions $h_\theta(x) = \theta^\top x$. Dataset: $N = 500$ trading days.
ERM = find $\theta^*$ that minimises the mean squared prediction error on training days. This is exactly ordinary least squares linear regression — the oldest and most widely used ML algorithm, now derived from first principles.
The Bayes Classifier — The Theoretical Ceiling
We've been asking: given $D$, what is the best $h$ we can find? Now ask a different question: if we knew $P_{XY}$ exactly, what is the best possible classifier? Is there a theoretical upper bound on how well any algorithm can ever do?
Fix binary classification ($\mathcal{Y} = \{0,1\}$), zero-one loss, and allow $\mathcal{H}$ = all possible functions (no restriction). The optimal classifier is:
At every input $x$, predict the more probable class. This is the Bayes classifier. Its risk $R(h_B)$ is called the Bayes error — the irreducible minimum error that no algorithm can beat.
Proving the Bayes Classifier Is Optimal
A claim this strong deserves a proof. We show $R(h_B) \leq R(h)$ for every $h \in \mathcal{H}$ — the Bayes classifier has the lowest possible true risk.
For any classifier $h$, partition $\mathbb{R}^d$ into two regions: $$S_i(h) = \{x \in \mathbb{R}^d : h(x) = i\}, \quad i = 0, 1$$ These partition the input space: $S_0(h) \cap S_1(h) = \emptyset$ and $S_0(h) \cup S_1(h) = \mathbb{R}^d$.
Under zero-one loss, $R(h) = P(h(x) \neq y)$. A mistake happens in exactly two ways: predicting 1 when $y=0$, or predicting 0 when $y=1$. Expanding:
For any fixed $x \in \mathbb{R}^d$, the classifier must assign $x$ to either $S_0$ or $S_1$. The contribution of this point to the risk is:
The Bayes classifier assigns each $x$ to whichever class has the lower cost — it takes the pointwise minimum. By Bayes' theorem, $P(y=0)\cdot P_{X|y=0}(x) \propto P(y=0|x)$, so this is equivalent to predicting the more probable class at every $x$.
Since $h_B$ minimises the integrand at every point $x$, its total risk is:
For any other classifier $h$, there exist points $x$ where it does not take the minimum — so it incurs higher cost there. Therefore:
The Regression Oracle: What Minimises Squared Error?
The companion result for regression is equally important. For squared error loss and unrestricted $\mathcal{H}$, what $h$ minimises the true risk?
The optimal predictor under squared error is the conditional expectation of $Y$ given $X$. Not the most likely value of $Y$ — the average value of $Y$ across all samples with the same $x$.
The best prediction of Rohit Sharma's score given today's match conditions is his average score across all historical matches with those exact conditions. Not his most common score — his mean score.
The best NIFTY return prediction given today's features is $\mathbb{E}[\text{return} \mid \text{VIX}=18, \text{FII}=+2000\text{Cr}, \ldots]$ — the historical average return on all days with those features.
This result explains why linear regression is not naive. It is the best possible estimator of $\mathbb{E}[Y|X]$ within the linear hypothesis class — and $\mathbb{E}[Y|X]$ is the theoretical target for all regression. Every regression algorithm, from the simplest linear model to the deepest neural network, is attempting to estimate this conditional expectation. The battle is fought in hypothesis class size and data quantity, not in the target itself.
Suppose we model the relationship between INDIA VIX and NIFTY direction using known distributions. Let $Y = 1$ if NIFTY closes down, $Y = 0$ if up. Assume:
Compute the joint likelihoods at $x = 18$:
At VIX = 18: the "up" joint likelihood (0.0414) exceeds the "down" joint likelihood (0.0240). The Bayes classifier predicts $h_B(18) = 0$ — NIFTY up.
The Bayes classifier switches prediction at the $x$ where both joint likelihoods are equal. Setting them equal and solving gives the decision boundary — the VIX level above which the model predicts a down day. In this case, the boundary is near VIX $\approx 19.8$. Below 19.8, predict up. Above 19.8, predict down.
We now have a complete framework: hypothesis class, loss function, true risk, empirical risk, ERM, and the theoretical optima — Bayes classifier and conditional expectation. The architecture is clear. But there is a critical question we deferred: the LLN guarantees $\hat{R}(h) \to R(h)$ for fixed $h$, but does minimising $\hat{R}$ over all of $\mathcal{H}$ give us the minimiser of $R$? The answer involves measuring how "far apart" two probability distributions are — which requires a new mathematical tool. That tool is KL divergence, and it is the subject of Chapter 3.
A NIFTY direction model achieves 3% training error and 48% test error. A second model achieves 38% training error and 40% test error. Which of the following is the most accurate diagnosis?
Model 1 has a generalisation gap of $48\% - 3\% = 45$ percentage points — a massive discrepancy between training and test performance. This is the signature of severe overfitting: the model memorised the training data, including its noise, and learned nothing generalisable. Model 2 has a gap of only $40\% - 38\% = 2$ percentage points — it generalises almost as well as it trains, which is what we want.
The correct conclusion: for live trading, Model 2 is far preferable despite its higher training error. Training error is advertising. Test error is the truth.
Option C is wrong to dismiss both — 40% test error for direction prediction is actually competitive in many market regimes (random is 50%), while 48% is barely better than random. Option D is wrong: a 3% training error is not underfitting — underfitting would show high training error too.
For an unrestricted hypothesis class ($\mathcal{H}$ = all measurable functions) and squared error loss, which $h$ minimises the true risk $R(h) = \mathbb{E}_{P_{XY}}[(h(x)-y)^2]$?
The conditional expectation $\mathbb{E}[Y \mid X=x]$ is the unique minimiser of squared error loss. To see why: write $y = \mathbb{E}[Y|X=x] + \epsilon$ where $\epsilon$ is the residual noise with $\mathbb{E}[\epsilon|X=x] = 0$. Then $(h(x) - y)^2 = (h(x) - \mathbb{E}[Y|X=x] - \epsilon)^2$. Taking expectation over $Y$, the cross term vanishes and we get $\mathbb{E}[(h(x)-y)^2|X=x] = (h(x) - \mathbb{E}[Y|X=x])^2 + \text{Var}(Y|X=x)$. The variance term is irreducible — it is the Bayes error for regression. The first term is minimised by setting $h(x) = \mathbb{E}[Y|X=x]$.
Options A (mode) and B (median) minimise different objectives. Mode minimises zero-one loss. Median minimises absolute error loss. Squared error gives the mean. Each loss function implies a different optimal predictor — the loss function choice is not arbitrary.
At a particular market state $x$, a classifier estimates $P(y=1 \mid x) = 0.35$. Under zero-one loss, what does the Bayes classifier predict, and what is the Bayes error at this specific point?
$P(y=1|x) = 0.35$ implies $P(y=0|x) = 0.65$. The Bayes classifier predicts the more probable class: $h_B(x) = 0$ (predict "NIFTY up" in this context).
The Bayes error at this point is the probability of a mistake even under the optimal decision. Since we predict class 0, we're wrong whenever the true label is 1, which happens with probability $P(y=1|x) = 0.35$. So the local Bayes error is $\min(P(y=0|x), P(y=1|x)) = \min(0.65, 0.35) = 0.35$.
Intuitively: even the best classifier makes mistakes at ambiguous points where both classes have significant probability. At $x$ where $P(y=1|x) = 0.35$, the world itself is uncertain — 35% of the time this market state leads to a down day. No algorithm can eliminate this uncertainty. It is irreducible noise in the data-generating process.
A cancer detection model achieves zero-one loss of 0.04 on a held-out test set from AIIMS Delhi. A NIFTY direction model achieves zero-one loss of 0.44 on unseen trading days. A colleague claims the cancer model is "clearly better." What is the most sophisticated response?
Raw error rates are meaningless without knowing the Bayes error — the irreducible theoretical minimum — for each domain. Cancer diagnosis: the disease has clear physical signatures (measurable biomarkers, imaging features), so the Bayes error is likely very low, perhaps under 1%. A 4% error model is therefore performing 3–4 percentage points above the ceiling — there is significant room for improvement.
NIFTY direction: markets are close to efficient — prices already incorporate available information. The Bayes error for daily direction prediction may be near 47–49%. A 44% error model is actually performing better than the theoretical minimum in some sense — or more precisely, it is very close to the ceiling. The 6-percentage-point gap from random may represent most of the extractable signal.
This is why quant traders celebrate a sustained 52% win rate on daily direction calls as extraordinary — the Bayes error for that problem is near 50%. A 96% accuracy in cancer detection sounds impressive but may still be far from optimal for that domain. Always ask: what is the Bayes error?
Work through these independently. Q1–3 are direct applications of the definitions. Q4–7 require connecting concepts. Q8–10 will make you think carefully.