What Is Learning?
Every prediction you have ever made — about tomorrow's weather, the next delivery, NIFTY's close — was an act of learning. You didn't derive any equations. You observed, you updated, you predicted. Machine learning is just that process made precise, principled, and scalable.
The Prediction Instinct
This morning, before leaving home, you glanced at the sky. Grey clouds, a certain heaviness in the air — you grabbed your umbrella. You didn't solve a system of differential equations modelling atmospheric pressure and moisture convection. You looked at past rainy days stored somewhere in your memory, matched the current sky to that pattern, and made a call. You were, in the precise mathematical sense we are about to build, doing machine learning.
A doctor looking at an X-ray develops a feel — after thousands of images — for what a malignant shadow looks like versus a benign one. She can't write down the rule. She couldn't derive it from the physics of how tissue absorbs radiation. But she's learned it, reliably, from data.
A NIFTY trader who has watched the market for five years develops a feel for when the 10:30 AM spike is a genuine breakout versus a trap set by operators. She can't articulate the rule precisely. But she acts on it, and over time, she's more right than wrong.
Both the doctor and the trader are doing the same thing mathematically: mapping inputs to outputs using accumulated experience. Machine learning is the discipline that makes this process explicit — not to replace human judgment, but to scale it, formalise it, and make it auditable.
Formalising the Problem — The Function Approximation View
Let's give the prediction instinct a precise mathematical skeleton. We have two sets: the input space $\mathcal{X}$ (everything we observe) and the output space $\mathcal{Y}$ (everything we want to predict). There is some true relationship between inputs and outputs — a function we'll call $f$:
$f$ is the oracle. It is the perfect answer to every question we want to ask. For the doctor, $f$ maps an X-ray image (a vector of pixel intensities) to a diagnosis ($0$ = healthy, $1$ = diseased). For the NIFTY trader, $f$ maps today's market state (returns, volatility, open interest, FII flows) to tomorrow's direction. For a Zomato engineer, $f$ maps order details (restaurant, location, time of day, weather) to estimated delivery time.
The problem: $f$ is unknown. We don't have access to the oracle. What we do have is a dataset — a finite collection of input-output pairs that $f$ produced:
where each $x_i \in \mathcal{X}$ is an observed input, and $y_i \in \mathcal{Y}$ is the corresponding output. The fundamental goal of supervised machine learning is: given $D$, find a function $\hat{f}$ that approximates $f$ well — not just on the data we've seen, but on new inputs we haven't seen yet.
Why We Can't Just Derive $f$ — The Case for Statistics
A reasonable question: why not derive $f$ from first principles? We have physics, chemistry, biology. Can't we just model the system?
Sometimes yes. The trajectory of a cricket ball follows Newtonian mechanics well enough to be computed analytically. But the moment the system involves human beings at scale — markets, healthcare outcomes, consumer behaviour — the first-principles approach collapses under complexity.
Consider the NIFTY closing price tomorrow. It is the result of buy and sell orders placed by millions of participants — retail investors in Patna, proprietary traders in Mumbai, foreign institutional investors rebalancing in New York, algorithmic systems reacting to news in microseconds. No equation captures this. The system has too many interacting agents, too many feedback loops, too much dependence on news that hasn't happened yet.
The medical diagnosis problem is equally intractable from physics. The relationship between pixel intensities in a lung X-ray and whether the patient has cancer involves the entire biochemistry of tumour formation — a system of staggering complexity that is nowhere close to being derivable from first principles.
When $f$ is too complex to derive analytically, we do the only reasonable thing: we observe it repeatedly and build a statistical approximation. This is the philosophical leap from deterministic to probabilistic thinking, and it is the foundation of everything that follows.
The Probabilistic Worldview — Why Random Variables?
To make the statistical approach rigorous, we need a language. That language is probability theory. Specifically, we need to ask: where do the data points $(x_i, y_i)$ come from?
The dataset $D$ didn't have to come out exactly this way. If the doctor had seen different patients that month, the X-rays would be different. If we looked at a different set of trading days, the NIFTY returns would differ. The data is the outcome of a random process — and the mathematical object that captures a random process is a random variable.
$\Omega$ — the sample space: the set of all possible outcomes of the random experiment. If the experiment is "observe a patient," then $\Omega$ is the set of all possible patients in the world. If the experiment is "observe today's NIFTY," then $\Omega = \mathbb{R}_{>0}$.
$\mathcal{F}$ — a sigma-algebra: the collection of events we are allowed to assign probabilities to. For a coin flip, $\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \{H,T\}\}$. For NIFTY's price, $\mathcal{F}$ contains events like "NIFTY closes between 23,000 and 24,000."
$P : \mathcal{F} \to [0,1]$ — the probability measure: assigns a number to each event. $P(\Omega) = 1$, $P(\emptyset) = 0$, and probabilities add up correctly for disjoint events.
Now we can define a random variable properly. A random variable $X$ is a function from the sample space $\Omega$ to $\mathbb{R}^d$:
Every data point $x_i$ you observe is one realisation of $X$ — one outcome of the random experiment made concrete. The random variable $X$ lives in the abstract probability space $(\Omega, \mathcal{F}, P)$. Its realisations $x_1, x_2, \ldots, x_n$ live in $\mathbb{R}^d$, where $d$ is the number of features.
The distribution function of $X$ — written $P_X$ or $F_X$ — tells us the probability of $X$ taking values in any region of $\mathbb{R}^d$. It completely characterises the random variable: if you know $P_X$, you know everything about the statistical behaviour of $X$.
Similarly, the label $Y$ is a random variable $Y : \Omega \to \mathcal{Y}$. Crucially, $X$ and $Y$ are both defined on the same sample space $\Omega$ — they are random variables arising from the same underlying random experiment. Observing a patient produces both an X-ray $x$ and a diagnosis $y$ simultaneously. Observing a trading day produces both a feature vector $x$ and a future return $y$ simultaneously.
The Joint Distribution $P_{XY}$ — The God-Object of ML
Since $X$ and $Y$ are both defined on the same $\Omega$, we can define their joint distribution $P_{XY}$ — a single mathematical object that encodes the complete statistical relationship between inputs and outputs.
This is the god-object of machine learning. Everything — every task, every model, every algorithm — is a different question asked of $P_{XY}$. Want to classify? Ask for the conditional $P_{Y|X}$. Want to predict a number? Compute $\mathbb{E}[Y | X = x]$. Want to generate realistic data? Sample from the marginal $P_X$. Want to understand which inputs are informative? Look at the mutual information between $X$ and $Y$.
The tragedy: we never get to see $P_{XY}$ directly. We only see $D$ — $n$ samples drawn from it. All of ML is the art of drawing conclusions about $P_{XY}$ from this finite, noisy, incomplete glimpse.
| ML Task | What you're really estimating | Example |
|---|---|---|
| Classification | P(Y | X) — probability of label given features |
Is this NIFTY move a breakout or a trap? |
| Regression | E[Y | X] — expected output given features |
What will NIFTY close at tomorrow? |
| Density Estimation | P(X) — marginal distribution of features |
How unusual is today's INDIA VIX reading? |
| Anomaly Detection | Regions where P(X) is very low |
Flag suspicious order flow on NSE |
| Generation | Sample new x ~ P(X) |
Simulate a realistic NIFTY return path |
For each scenario below, identify $\mathcal{X}$, $\mathcal{Y}$, what $P_{XY}$ represents, and which ML task is being performed.
Before each IPL match, you observe: pitch report (5 categorical features), toss result, playing XI on both sides (22 binary indicators), recent team form (win/loss over last 5 games). You want to predict whether the batting-first team scores above 180.
$\mathcal{X} \subset \mathbb{R}^{30}$ approximately (encoding all the features above). $\mathcal{Y} = \{0, 1\}$ (below 180, above 180). This is binary classification. We're estimating $P(Y=1 \mid X=x)$ — the probability of a high score given the match conditions. $P_{XY}$ encodes everything about how pitch conditions, team composition, and form collectively determine scoring outcomes.
At 9:15 AM each day, you observe: overnight SGX Nifty change, previous day's close-to-close return, INDIA VIX opening level, net FII cash market activity (previous session), and the PCR (put-call ratio) on the front-month contract. You want to predict the NIFTY return from 9:15 AM to 3:30 PM.
$\mathcal{X} \subset \mathbb{R}^5$. $\mathcal{Y} = \mathbb{R}$ (a continuous return). This is regression. We're estimating $\mathbb{E}[Y \mid X=x]$. $P_{XY}$ encodes the statistical relationship between opening market conditions and the intraday return that follows.
The i.i.d. Assumption — Powerful, Convenient, and Often Wrong
We said the dataset $D$ consists of observations drawn from $P_{XY}$. But we haven't said how they're drawn. The standard assumption — the one that makes the mathematics tractable — is that the data points are drawn independently and identically distributed, which we write as:
Two words, four letters, and an assumption that carries enormous weight. Let's earn each one.
Identically distributed means every $(x_i, y_i)$ is drawn from the same $P_{XY}$. The rules of the game don't change between observations. The distribution that generated the first data point is the same one that generated the last.
Independent means knowing one data point tells you nothing about another. Formally, for $i \neq j$:
The observations don't talk to each other. They have no memory of each other. Whether patient 47 in the X-ray dataset has cancer tells you nothing about patient 48's diagnosis — they're unconnected strangers. Independence is a very reasonable assumption there.
So why do we use i.i.d. at all? Because without it, the mathematics of learning theory — error bounds, sample complexity, generalisation guarantees — becomes vastly harder to derive. The i.i.d. assumption is the price of tractability. It gives us clean theorems and usable algorithms. The practitioner's job is to know when that price is too high for the problem at hand, and to reach for tools designed for dependent data (time series models, recurrent networks, state space models) when it is.
The Fundamental Problem of ML, Stated Precisely
We now have all the ingredients for a precise statement of what ML is trying to do. It is worth reading slowly.
Given a dataset $D = \{(x_i, y_i)\}_{i=1}^n$ drawn i.i.d. from an unknown joint distribution $P_{XY}$ over $\mathcal{X} \times \mathcal{Y}$, estimate $P_{XY}$ — or the relevant conditional or marginal of it — well enough to make accurate predictions on new, unseen inputs drawn from the same $P_{XY}$.
This is, at its core, an estimation problem. We want to estimate a probability distribution from samples. The challenge: $P_{XY}$ could be any distribution over $\mathcal{X} \times \mathcal{Y}$ — an infinite-dimensional space of possibilities. Estimating an arbitrary distribution from $n$ finite samples is impossible in general. We need to constrain the problem.
The Parametric Approach — Making a Bet on the Shape of the World
The standard way to make the estimation problem tractable is to assume that $P_{XY}$ belongs to some parametric family — a specific collection of distributions indexed by a finite-dimensional parameter vector $\theta \in \Theta$:
For example: assume the data is Gaussian, $P_\theta = \mathcal{N}(\mu, \Sigma)$, so $\theta = (\mu, \Sigma)$ — a mean vector and a covariance matrix. Or assume the conditional $P_{Y|X}$ is a linear function of $X$ plus Gaussian noise, which leads to linear regression. The parametric assumption shrinks the problem from "find an arbitrary distribution over an infinite-dimensional space" to "find the right $k$-dimensional parameter vector."
With a parametric family in hand, the estimation problem becomes an optimisation problem. Define a distance metric $d(P_{XY}, P_\theta)$ that measures how far the parametric distribution $P_\theta$ is from the true unknown $P_{XY}$. Then find the parameter that minimises this distance:
There's one immediate subtlety: we can't actually compute $d(P_{XY}, P_\theta)$, because $P_{XY}$ is unknown. We only have samples $D$. So in practice, we approximate $d(P_{XY}, P_\theta)$ using the data. How we do that approximation — and what it means mathematically — is the subject of Chapter 2.
Suppose we observe $n = 500$ daily NIFTY log-returns $\{x_1, \ldots, x_{500}\}$ and want to estimate their distribution $P_X$. We decide to use a parametric Gaussian family:
We are betting that NIFTY daily log-returns are drawn from some Gaussian distribution — that the true $P_X$ is one member of the family $\{\mathcal{N}(\mu, \sigma^2) : \mu \in \mathbb{R}, \sigma > 0\}$. This reduces the infinite-dimensional problem of estimating $P_X$ to a two-dimensional problem: find $(\mu^*, \sigma^{*2})$.
We want $\theta^* = (\mu^*, \sigma^{*2})$ such that $\mathcal{N}(\mu^*, \sigma^{*2})$ is "closest" to the true $P_X$ in some distance $d$. Using the sample mean and variance as an initial guess:
In Chapter 4, we'll see that this is exactly the Maximum Likelihood Estimator — the $\theta$ that minimises the KL divergence from $P_X$ to $P_\theta$ using only the samples in $D$.
NIFTY returns are well-known to have fat tails — the probability of a 4-sigma daily move is much higher in reality than the Gaussian predicts. On 12 March 2020, NIFTY fell 8.3% in a single session. Under $\mathcal{N}(0.05\%, 1\%)$ (roughly calibrated to historical data), the probability of an 8.3% move is astronomically small — effectively zero. The Gaussian family simply cannot express this.
We now have the full picture of what machine learning is, mathematically. Data is samples from an unknown $P_{XY}$. Learning is estimating $P_{XY}$, or the right conditional of it, by fitting a parametric family using an optimisation criterion. Generalisation — performing well on new, unseen data — is the proof that we've learned the right thing. Everything else is detail.
But there is one question we haven't answered: what should the distance $d$ be? How do you measure the gap between two probability distributions when you can only see samples from one of them, and the other is your own parametric guess? That is the question that unlocks everything. It is the subject of Chapter 2.
A Zomato engineer builds a model to predict delivery time. She trains it on data collected in Delhi during October–November 2024 and deploys it across India in June 2025. In Mumbai during the monsoon, the model performs terribly. Which assumption of the i.i.d. framework is most directly violated?
The identical distribution assumption requires that training and deployment data come from the same $P_{XY}$. Here, the training data was collected in a specific city, season, and time period. The deployment context — different city, monsoon conditions, different traffic patterns, different restaurant density — is a genuinely different $P_{XY}$. The relationship between features (distance, time of day, restaurant) and delivery time is structurally different in monsoon Mumbai versus autumn Delhi.
This is called distribution shift or covariate shift, and it is one of the most common failure modes of deployed ML systems. Option A (independence) might also be mildly violated — nearby deliveries during flooding may be correlated — but the primary failure is the distributional mismatch between training and deployment.
An ML practitioner wants to predict whether an IPL batsman will score above 50 in a given match. She defines $\mathcal{X}$ as the batsman's career statistics and $\mathcal{Y} = \{0, 1\}$. What is she actually estimating from $P_{XY}$?
Classification is always about estimating $P(Y \mid X)$ — the probability of a particular label given the observed features. When the model sees Virat Kohli's career statistics (average 58, strike rate 130, ...) and outputs "72% probability of scoring above 50," it is evaluating $P(Y=1 \mid X = x_\text{Kohli})$.
Option B — estimating the full joint $P_{XY}$ — would also work and is more powerful, but it's harder and often unnecessary. If you only need to classify, you only need the conditional. Option A ($P_X$) is the distribution of career statistics themselves — this is what you'd estimate if you were doing density estimation or anomaly detection, not classification. Option D ($P_Y$) is just the base rate — the fraction of innings above 50 regardless of who's batting — which ignores the feature information entirely.
A trader trains a NIFTY direction model on daily data from 2015–2022 and achieves 58% accuracy on a held-out test set from the same period. In live trading from January 2023 onwards, the model achieves only 51% — barely above random. Which of the following is the most likely explanation, and why?
The held-out test set comes from the same 2015–2022 period as the training data, so if the model performs well there, overfitting is less likely to be the primary issue. The sharp drop in live performance starting 2023 is the hallmark of distribution shift: the $P_{XY}$ underlying NIFTY returns in 2023–2024 (rising interest rate environment, post-COVID normalisation, changed FII patterns) is measurably different from 2015–2022.
The model learned to exploit statistical regularities that were genuinely present in the training period but faded or reversed in the deployment period. This is also known as non-stationarity in financial time series, and it is arguably the central challenge of all quantitative trading. The correct response is not to build a better model on the old data, but to continuously retrain on recent data, use shorter lookback windows, or build models that explicitly account for regime changes.
A researcher fits a Gaussian model $\mathcal{N}(\hat\mu, \hat\sigma^2)$ to 10 years of NIFTY daily log-returns and uses it to estimate the probability of a daily move worse than $-5\%$. The Gaussian gives a probability of $3.2 \times 10^{-7}$. In reality, such moves have occurred 4 times in 10 years (roughly 2,500 trading days). What is happening, and what is its name?
The empirical frequency of $-5\%$ days is $4/2500 \approx 1.6 \times 10^{-3}$ — five orders of magnitude larger than the Gaussian prediction of $3.2 \times 10^{-7}$. This is not a matter of data quantity or changing regimes. It is a fundamental mismatch between the shape of the parametric family and the shape of the true distribution.
The Gaussian distribution has exponentially thin tails — it places essentially zero probability on events beyond 3–4 standard deviations. But financial returns have fat tails (formally, their kurtosis exceeds 3): extreme events occur far more frequently than any Gaussian would predict. This is documented in every liquid market on earth and goes by names like leptokurtosis or heavy-tailed distributions.
This is model misspecification: the true $P_X$ lies outside your parametric family $\{\mathcal{N}(\mu, \sigma^2)\}$. No amount of data, no clever optimisation, will fix this. The fix is to choose a richer family — Student-t distributions, mixture models, or non-parametric approaches — that can actually express fat tails.
Work through these independently. Q1–3 test direct understanding. Q4–7 require connecting ideas across sections. Q8–10 will make you think beyond the chapter.