Foundational Chapter 01 of 08 · Mathematical Foundations of ML

What Is Learning?

Every prediction you have ever made — about tomorrow's weather, the next delivery, NIFTY's close — was an act of learning. You didn't derive any equations. You observed, you updated, you predicted. Machine learning is just that process made precise, principled, and scalable.

The Prediction Instinct

This morning, before leaving home, you glanced at the sky. Grey clouds, a certain heaviness in the air — you grabbed your umbrella. You didn't solve a system of differential equations modelling atmospheric pressure and moisture convection. You looked at past rainy days stored somewhere in your memory, matched the current sky to that pattern, and made a call. You were, in the precise mathematical sense we are about to build, doing machine learning.

A doctor looking at an X-ray develops a feel — after thousands of images — for what a malignant shadow looks like versus a benign one. She can't write down the rule. She couldn't derive it from the physics of how tissue absorbs radiation. But she's learned it, reliably, from data.

A NIFTY trader who has watched the market for five years develops a feel for when the 10:30 AM spike is a genuine breakout versus a trap set by operators. She can't articulate the rule precisely. But she acts on it, and over time, she's more right than wrong.

Both the doctor and the trader are doing the same thing mathematically: mapping inputs to outputs using accumulated experience. Machine learning is the discipline that makes this process explicit — not to replace human judgment, but to scale it, formalise it, and make it auditable.

The human brain is extraordinarily good at learning from examples. It is extraordinarily bad at explaining what it learned. ML flips this: the algorithms are transparent about what they've learned (every parameter is inspectable), but they need far more examples than a human does. A child learns to recognise a cat from 10 examples. A neural network needs thousands. The trade-off between sample efficiency and transparency is one of the deepest themes in the field.

Formalising the Problem — The Function Approximation View

Let's give the prediction instinct a precise mathematical skeleton. We have two sets: the input space $\mathcal{X}$ (everything we observe) and the output space $\mathcal{Y}$ (everything we want to predict). There is some true relationship between inputs and outputs — a function we'll call $f$:

f : \mathcal{X} \to \mathcal{Y}

$f$ is the oracle. It is the perfect answer to every question we want to ask. For the doctor, $f$ maps an X-ray image (a vector of pixel intensities) to a diagnosis ($0$ = healthy, $1$ = diseased). For the NIFTY trader, $f$ maps today's market state (returns, volatility, open interest, FII flows) to tomorrow's direction. For a Zomato engineer, $f$ maps order details (restaurant, location, time of day, weather) to estimated delivery time.

The problem: $f$ is unknown. We don't have access to the oracle. What we do have is a dataset — a finite collection of input-output pairs that $f$ produced:

D = \{(x_1, y_1),\; (x_2, y_2),\; \ldots,\; (x_n, y_n)\}

where each $x_i \in \mathcal{X}$ is an observed input, and $y_i \in \mathcal{Y}$ is the corresponding output. The fundamental goal of supervised machine learning is: given $D$, find a function $\hat{f}$ that approximates $f$ well — not just on the data we've seen, but on new inputs we haven't seen yet.

The word "supervised" is worth unpacking. In supervised learning, every input $x_i$ in our dataset comes with a label $y_i$ — a known correct answer. Someone supervised the data collection. The doctor annotated the X-rays. The historical NSE data came with actual future prices. Contrast this with unsupervised learning, where you only have $x_i$ with no labels, and the goal is to find structure in $\mathcal{X}$ itself. We'll encounter both flavours. For now, supervised learning is our entry point.

What does $\mathcal{X}$ look like for a NIFTY prediction problem? A typical feature vector might include: yesterday's log-return, the current INDIA VIX level, the net FII buy/sell in cash markets, the put-call ratio on NIFTY options, and the global cues (S&P 500 futures overnight move). That's a vector in $\mathbb{R}^5$ — five numbers, one data point. Over 250 trading days, you have $n = 250$ such points. This is your $D$. It sounds like a lot. In ML terms, it is heartbreakingly small. A facial recognition model trains on millions of images. This is the data scarcity problem specific to finance, and it will haunt us throughout this course.

Why We Can't Just Derive $f$ — The Case for Statistics

A reasonable question: why not derive $f$ from first principles? We have physics, chemistry, biology. Can't we just model the system?

Sometimes yes. The trajectory of a cricket ball follows Newtonian mechanics well enough to be computed analytically. But the moment the system involves human beings at scale — markets, healthcare outcomes, consumer behaviour — the first-principles approach collapses under complexity.

Consider the NIFTY closing price tomorrow. It is the result of buy and sell orders placed by millions of participants — retail investors in Patna, proprietary traders in Mumbai, foreign institutional investors rebalancing in New York, algorithmic systems reacting to news in microseconds. No equation captures this. The system has too many interacting agents, too many feedback loops, too much dependence on news that hasn't happened yet.

The medical diagnosis problem is equally intractable from physics. The relationship between pixel intensities in a lung X-ray and whether the patient has cancer involves the entire biochemistry of tumour formation — a system of staggering complexity that is nowhere close to being derivable from first principles.

When $f$ is too complex to derive analytically, we do the only reasonable thing: we observe it repeatedly and build a statistical approximation. This is the philosophical leap from deterministic to probabilistic thinking, and it is the foundation of everything that follows.

India's weather forecasting is a beautiful example of this tension. The IMD (India Meteorological Department) runs sophisticated physical models — differential equations, atmospheric dynamics, the works. Yet a farmer in Vidarbha with 40 years of local observation often has better intuition for when the monsoon will arrive in his district than the IMD model. Why? Because the farmer has a richer local dataset and has implicitly learned the local anomalies that the global model smooths over. Modern ML weather models like Google's GraphCast combine both — the physical model provides structure, the data provides the corrections. This hybrid approach — physics where you can, statistics where you must — is the frontier of scientific ML.

The Probabilistic Worldview — Why Random Variables?

To make the statistical approach rigorous, we need a language. That language is probability theory. Specifically, we need to ask: where do the data points $(x_i, y_i)$ come from?

The dataset $D$ didn't have to come out exactly this way. If the doctor had seen different patients that month, the X-rays would be different. If we looked at a different set of trading days, the NIFTY returns would differ. The data is the outcome of a random process — and the mathematical object that captures a random process is a random variable.

The Probability Triplet $(\Omega, \mathcal{F}, P)$. All of probability theory is built on three objects:

$\Omega$ — the sample space: the set of all possible outcomes of the random experiment. If the experiment is "observe a patient," then $\Omega$ is the set of all possible patients in the world. If the experiment is "observe today's NIFTY," then $\Omega = \mathbb{R}_{>0}$.

$\mathcal{F}$ — a sigma-algebra: the collection of events we are allowed to assign probabilities to. For a coin flip, $\mathcal{F} = \{\emptyset, \{H\}, \{T\}, \{H,T\}\}$. For NIFTY's price, $\mathcal{F}$ contains events like "NIFTY closes between 23,000 and 24,000."

$P : \mathcal{F} \to [0,1]$ — the probability measure: assigns a number to each event. $P(\Omega) = 1$, $P(\emptyset) = 0$, and probabilities add up correctly for disjoint events.

Don't be intimidated by the sigma-algebra $\mathcal{F}$. It is the answer to a simple question: which questions about the world are we even allowed to ask? You can't assign a probability to every conceivable subset of $\mathbb{R}$ — measure theory tells us some subsets are pathological. The sigma-algebra is the safety fence: it defines the well-behaved events. In practice, for all problems in this course, $\mathcal{F}$ will be the standard Borel sigma-algebra on $\mathbb{R}^d$, which contains every interval, every open set, every closed set — essentially everything you'd ever want to compute a probability for.

Now we can define a random variable properly. A random variable $X$ is a function from the sample space $\Omega$ to $\mathbb{R}^d$:

X : \Omega \to \mathbb{R}^d

Every data point $x_i$ you observe is one realisation of $X$ — one outcome of the random experiment made concrete. The random variable $X$ lives in the abstract probability space $(\Omega, \mathcal{F}, P)$. Its realisations $x_1, x_2, \ldots, x_n$ live in $\mathbb{R}^d$, where $d$ is the number of features.

The distribution function of $X$ — written $P_X$ or $F_X$ — tells us the probability of $X$ taking values in any region of $\mathbb{R}^d$. It completely characterises the random variable: if you know $P_X$, you know everything about the statistical behaviour of $X$.

Similarly, the label $Y$ is a random variable $Y : \Omega \to \mathcal{Y}$. Crucially, $X$ and $Y$ are both defined on the same sample space $\Omega$ — they are random variables arising from the same underlying random experiment. Observing a patient produces both an X-ray $x$ and a diagnosis $y$ simultaneously. Observing a trading day produces both a feature vector $x$ and a future return $y$ simultaneously.

The Joint Distribution $P_{XY}$ — The God-Object of ML

Since $X$ and $Y$ are both defined on the same $\Omega$, we can define their joint distribution $P_{XY}$ — a single mathematical object that encodes the complete statistical relationship between inputs and outputs.

P_{XY}(x, y) = P(X \leq x,\; Y \leq y)

This is the god-object of machine learning. Everything — every task, every model, every algorithm — is a different question asked of $P_{XY}$. Want to classify? Ask for the conditional $P_{Y|X}$. Want to predict a number? Compute $\mathbb{E}[Y | X = x]$. Want to generate realistic data? Sample from the marginal $P_X$. Want to understand which inputs are informative? Look at the mutual information between $X$ and $Y$.

The tragedy: we never get to see $P_{XY}$ directly. We only see $D$ — $n$ samples drawn from it. All of ML is the art of drawing conclusions about $P_{XY}$ from this finite, noisy, incomplete glimpse.

ML Task	What you're really estimating	Example
Classification	`P(Y \| X)` — probability of label given features	Is this NIFTY move a breakout or a trap?
Regression	`E[Y \| X]` — expected output given features	What will NIFTY close at tomorrow?
Density Estimation	`P(X)` — marginal distribution of features	How unusual is today's INDIA VIX reading?
Anomaly Detection	Regions where `P(X)` is very low	Flag suspicious order flow on NSE
Generation	Sample new `x ~ P(X)`	Simulate a realistic NIFTY return path

Every time NSE publishes a new options chain at 9:15 AM, that is one data point $(x_i, y_i)$ — a snapshot of the market state and the resulting option prices. The entire history of NSE options data since 2001 is just $D$: roughly 5,000 trading days, one sample per day from an infinite, unknown $P_{XY}$. The market has been running for over two decades and we still have only 5,000 samples. For context, a state-of-the-art image classification model trains on 14 million images. This is why financial ML is genuinely hard, and why the elegant theory we're building here must always be applied with humility about sample size.

Worked Example 1 · Identifying the ML Problem 🏏 IPL + 🇮🇳 NIFTY

For each scenario below, identify $\mathcal{X}$, $\mathcal{Y}$, what $P_{XY}$ represents, and which ML task is being performed.

Scenario A — IPL match prediction

Before each IPL match, you observe: pitch report (5 categorical features), toss result, playing XI on both sides (22 binary indicators), recent team form (win/loss over last 5 games). You want to predict whether the batting-first team scores above 180.

$\mathcal{X} \subset \mathbb{R}^{30}$ approximately (encoding all the features above). $\mathcal{Y} = \{0, 1\}$ (below 180, above 180). This is binary classification. We're estimating $P(Y=1 \mid X=x)$ — the probability of a high score given the match conditions. $P_{XY}$ encodes everything about how pitch conditions, team composition, and form collectively determine scoring outcomes.

Scenario B — NIFTY intraday return prediction

At 9:15 AM each day, you observe: overnight SGX Nifty change, previous day's close-to-close return, INDIA VIX opening level, net FII cash market activity (previous session), and the PCR (put-call ratio) on the front-month contract. You want to predict the NIFTY return from 9:15 AM to 3:30 PM.

$\mathcal{X} \subset \mathbb{R}^5$. $\mathcal{Y} = \mathbb{R}$ (a continuous return). This is regression. We're estimating $\mathbb{E}[Y \mid X=x]$. $P_{XY}$ encodes the statistical relationship between opening market conditions and the intraday return that follows.

The key observation: Same mathematical framework, completely different domains. $\mathcal{X}$, $\mathcal{Y}$, and the relevant conditional of $P_{XY}$ change — but the fundamental question ("use $D$ to approximate $f$") stays the same. This universality is why the mathematics of ML is worth learning once and applying everywhere.

The i.i.d. Assumption — Powerful, Convenient, and Often Wrong

We said the dataset $D$ consists of observations drawn from $P_{XY}$. But we haven't said how they're drawn. The standard assumption — the one that makes the mathematics tractable — is that the data points are drawn independently and identically distributed, which we write as:

(x_i, y_i) \overset{\text{i.i.d.}}{\sim} P_{XY}, \quad i = 1, \ldots, n

Two words, four letters, and an assumption that carries enormous weight. Let's earn each one.

Identically distributed means every $(x_i, y_i)$ is drawn from the same $P_{XY}$. The rules of the game don't change between observations. The distribution that generated the first data point is the same one that generated the last.

Distribution shift is what happens when the identical distribution assumption breaks. Train your NIFTY prediction model on data from 2015 to 2019. Deploy it in March 2020. The COVID-19 crash changes the relationship between every feature and the label — FII flows, VIX levels, PCR ratios all behave in qualitatively different ways during a pandemic-driven circuit breaker. The $P_{XY}$ during 2020 is genuinely different from the $P_{XY}$ during 2017. Your model was trained on the wrong distribution. This is the single biggest reason financial ML models that backtest beautifully fail catastrophically in live trading.

Independent means knowing one data point tells you nothing about another. Formally, for $i \neq j$:

P\bigl((x_i, y_i) \mid (x_j, y_j)\bigr) = P\bigl((x_i, y_i)\bigr)

The observations don't talk to each other. They have no memory of each other. Whether patient 47 in the X-ray dataset has cancer tells you nothing about patient 48's diagnosis — they're unconnected strangers. Independence is a very reasonable assumption there.

Serial dependence is what happens when independence breaks. NIFTY daily returns are not independent. If NIFTY crashed 4% today, tomorrow's return distribution is statistically different from what it would be on a quiet day. Volatility clusters — large moves tend to be followed by large moves. Institutional investors rebalance over multiple days. News cycles create momentum. This is well-documented and goes by the name of autocorrelation in financial time series. The i.i.d. assumption on daily returns is a convenient lie. A usable lie, in many contexts — but always a lie. Know when you're making it.

So why do we use i.i.d. at all? Because without it, the mathematics of learning theory — error bounds, sample complexity, generalisation guarantees — becomes vastly harder to derive. The i.i.d. assumption is the price of tractability. It gives us clean theorems and usable algorithms. The practitioner's job is to know when that price is too high for the problem at hand, and to reach for tools designed for dependent data (time series models, recurrent networks, state space models) when it is.

The Fundamental Problem of ML, Stated Precisely

We now have all the ingredients for a precise statement of what ML is trying to do. It is worth reading slowly.

The Fundamental Problem of Machine Learning.

Given a dataset $D = \{(x_i, y_i)\}_{i=1}^n$ drawn i.i.d. from an unknown joint distribution $P_{XY}$ over $\mathcal{X} \times \mathcal{Y}$, estimate $P_{XY}$ — or the relevant conditional or marginal of it — well enough to make accurate predictions on new, unseen inputs drawn from the same $P_{XY}$.

This is, at its core, an estimation problem. We want to estimate a probability distribution from samples. The challenge: $P_{XY}$ could be any distribution over $\mathcal{X} \times \mathcal{Y}$ — an infinite-dimensional space of possibilities. Estimating an arbitrary distribution from $n$ finite samples is impossible in general. We need to constrain the problem.

The Parametric Approach — Making a Bet on the Shape of the World

The standard way to make the estimation problem tractable is to assume that $P_{XY}$ belongs to some parametric family — a specific collection of distributions indexed by a finite-dimensional parameter vector $\theta \in \Theta$:

\mathcal{P} = \{P_\theta : \theta \in \Theta\}, \quad \Theta \subseteq \mathbb{R}^k

For example: assume the data is Gaussian, $P_\theta = \mathcal{N}(\mu, \Sigma)$, so $\theta = (\mu, \Sigma)$ — a mean vector and a covariance matrix. Or assume the conditional $P_{Y|X}$ is a linear function of $X$ plus Gaussian noise, which leads to linear regression. The parametric assumption shrinks the problem from "find an arbitrary distribution over an infinite-dimensional space" to "find the right $k$-dimensional parameter vector."

With a parametric family in hand, the estimation problem becomes an optimisation problem. Define a distance metric $d(P_{XY}, P_\theta)$ that measures how far the parametric distribution $P_\theta$ is from the true unknown $P_{XY}$. Then find the parameter that minimises this distance:

\theta^* = \underset{\theta \in \Theta}{\arg\min}\; d(P_{XY},\; P_\theta)

Read that equation carefully. It is the architecture of all of supervised machine learning. You choose a model family $\mathcal{P}$ (linear model, decision tree, neural network — this is the "model choice"). You choose a distance $d$ (squared error, cross-entropy, KL divergence — this is the "loss function"). You optimise. The entire rest of this course — all eight chapters — is unpacking three questions: which family? which distance? how do we optimise when we can't see $P_{XY}$ directly, only samples from it? Everything flows from this one equation.

There's one immediate subtlety: we can't actually compute $d(P_{XY}, P_\theta)$, because $P_{XY}$ is unknown. We only have samples $D$. So in practice, we approximate $d(P_{XY}, P_\theta)$ using the data. How we do that approximation — and what it means mathematically — is the subject of Chapter 2.

Worked Example 2 · The Parametric Bet 🇮🇳 NIFTY Returns

Suppose we observe $n = 500$ daily NIFTY log-returns $\{x_1, \ldots, x_{500}\}$ and want to estimate their distribution $P_X$. We decide to use a parametric Gaussian family:

P_\theta = \mathcal{N}(\mu, \sigma^2), \quad \theta = (\mu, \sigma^2)

What the parametric bet means

We are betting that NIFTY daily log-returns are drawn from some Gaussian distribution — that the true $P_X$ is one member of the family $\{\mathcal{N}(\mu, \sigma^2) : \mu \in \mathbb{R}, \sigma > 0\}$. This reduces the infinite-dimensional problem of estimating $P_X$ to a two-dimensional problem: find $(\mu^*, \sigma^{*2})$.

The optimisation problem

We want $\theta^* = (\mu^*, \sigma^{*2})$ such that $\mathcal{N}(\mu^*, \sigma^{*2})$ is "closest" to the true $P_X$ in some distance $d$. Using the sample mean and variance as an initial guess:

\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \hat{\mu})^2

In Chapter 4, we'll see that this is exactly the Maximum Likelihood Estimator — the $\theta$ that minimises the KL divergence from $P_X$ to $P_\theta$ using only the samples in $D$.

When the bet is wrong

NIFTY returns are well-known to have fat tails — the probability of a 4-sigma daily move is much higher in reality than the Gaussian predicts. On 12 March 2020, NIFTY fell 8.3% in a single session. Under $\mathcal{N}(0.05\%, 1\%)$ (roughly calibrated to historical data), the probability of an 8.3% move is astronomically small — effectively zero. The Gaussian family simply cannot express this.

The lesson: the parametric bet is a double-edged sword. It makes the problem tractable, but if the true $P_X$ doesn't belong to your chosen family, no amount of data will give you the right answer — you'll converge to the best Gaussian approximation of the true distribution, not the true distribution itself. This is called model misspecification, and learning to diagnose it is a core practitioner skill.

We now have the full picture of what machine learning is, mathematically. Data is samples from an unknown $P_{XY}$. Learning is estimating $P_{XY}$, or the right conditional of it, by fitting a parametric family using an optimisation criterion. Generalisation — performing well on new, unseen data — is the proof that we've learned the right thing. Everything else is detail.

But there is one question we haven't answered: what should the distance $d$ be? How do you measure the gap between two probability distributions when you can only see samples from one of them, and the other is your own parametric guess? That is the question that unlocks everything. It is the subject of Chapter 2.

Practice Problems

4 questions · Chapter 01

ml / mathematical-foundations / ch01 / q01 ★ Conceptual

A Zomato engineer builds a model to predict delivery time. She trains it on data collected in Delhi during October–November 2024 and deploys it across India in June 2025. In Mumbai during the monsoon, the model performs terribly. Which assumption of the i.i.d. framework is most directly violated?

Independence — monsoon deliveries are correlated with each other.

Identical distribution — the true $P_{XY}$ in Mumbai monsoon is different from Delhi in October.

The parametric assumption — a linear model can't capture delivery time.

The function approximation assumption — $f$ doesn't exist for delivery time prediction.

Answer: B.

The identical distribution assumption requires that training and deployment data come from the same $P_{XY}$. Here, the training data was collected in a specific city, season, and time period. The deployment context — different city, monsoon conditions, different traffic patterns, different restaurant density — is a genuinely different $P_{XY}$. The relationship between features (distance, time of day, restaurant) and delivery time is structurally different in monsoon Mumbai versus autumn Delhi.

This is called distribution shift or covariate shift, and it is one of the most common failure modes of deployed ML systems. Option A (independence) might also be mildly violated — nearby deliveries during flooding may be correlated — but the primary failure is the distributional mismatch between training and deployment.

ml / mathematical-foundations / ch01 / q02 ★ Conceptual

An ML practitioner wants to predict whether an IPL batsman will score above 50 in a given match. She defines $\mathcal{X}$ as the batsman's career statistics and $\mathcal{Y} = \{0, 1\}$. What is she actually estimating from $P_{XY}$?

The marginal $P_X$ — the distribution of career statistics across all batsmen.

The joint $P_{XY}$ directly — the full distribution over career stats and match outcomes.

The conditional $P(Y=1 \mid X=x)$ — the probability of scoring above 50 given career stats.

The marginal $P_Y$ — the overall fraction of innings where batsmen score above 50.

Answer: C.

Classification is always about estimating $P(Y \mid X)$ — the probability of a particular label given the observed features. When the model sees Virat Kohli's career statistics (average 58, strike rate 130, ...) and outputs "72% probability of scoring above 50," it is evaluating $P(Y=1 \mid X = x_\text{Kohli})$.

Option B — estimating the full joint $P_{XY}$ — would also work and is more powerful, but it's harder and often unnecessary. If you only need to classify, you only need the conditional. Option A ($P_X$) is the distribution of career statistics themselves — this is what you'd estimate if you were doing density estimation or anomaly detection, not classification. Option D ($P_Y$) is just the base rate — the fraction of innings above 50 regardless of who's batting — which ignores the feature information entirely.

ml / mathematical-foundations / ch01 / q03 ★★ Conceptual

A trader trains a NIFTY direction model on daily data from 2015–2022 and achieves 58% accuracy on a held-out test set from the same period. In live trading from January 2023 onwards, the model achieves only 51% — barely above random. Which of the following is the most likely explanation, and why?

The independence assumption was violated — consecutive trading days are correlated.

The model is too complex — it has too many parameters for the dataset size.

Distribution shift — the statistical regime governing NIFTY returns changed after 2022.

The test set accuracy of 58% was a fluke — 58% is not statistically significant over a few years.

Answer: C.

The held-out test set comes from the same 2015–2022 period as the training data, so if the model performs well there, overfitting is less likely to be the primary issue. The sharp drop in live performance starting 2023 is the hallmark of distribution shift: the $P_{XY}$ underlying NIFTY returns in 2023–2024 (rising interest rate environment, post-COVID normalisation, changed FII patterns) is measurably different from 2015–2022.

The model learned to exploit statistical regularities that were genuinely present in the training period but faded or reversed in the deployment period. This is also known as non-stationarity in financial time series, and it is arguably the central challenge of all quantitative trading. The correct response is not to build a better model on the old data, but to continuously retrain on recent data, use shorter lookback windows, or build models that explicitly account for regime changes.

ml / mathematical-foundations / ch01 / q04 ★★ Conceptual

A researcher fits a Gaussian model $\mathcal{N}(\hat\mu, \hat\sigma^2)$ to 10 years of NIFTY daily log-returns and uses it to estimate the probability of a daily move worse than $-5\%$. The Gaussian gives a probability of $3.2 \times 10^{-7}$. In reality, such moves have occurred 4 times in 10 years (roughly 2,500 trading days). What is happening, and what is its name?

Overfitting — the model has memorised the training data too well.

Distribution shift — NIFTY's true distribution changed during the 10-year period.

Model misspecification — the true $P_X$ has fat tails that the Gaussian family cannot express.

Insufficient data — 2,500 data points is not enough to fit a Gaussian accurately.

Answer: C.

The empirical frequency of $-5\%$ days is $4/2500 \approx 1.6 \times 10^{-3}$ — five orders of magnitude larger than the Gaussian prediction of $3.2 \times 10^{-7}$. This is not a matter of data quantity or changing regimes. It is a fundamental mismatch between the shape of the parametric family and the shape of the true distribution.

The Gaussian distribution has exponentially thin tails — it places essentially zero probability on events beyond 3–4 standard deviations. But financial returns have fat tails (formally, their kurtosis exceeds 3): extreme events occur far more frequently than any Gaussian would predict. This is documented in every liquid market on earth and goes by names like leptokurtosis or heavy-tailed distributions.

This is model misspecification: the true $P_X$ lies outside your parametric family $\{\mathcal{N}(\mu, \sigma^2)\}$. No amount of data, no clever optimisation, will fix this. The fix is to choose a richer family — Student-t distributions, mixture models, or non-parametric approaches — that can actually express fat tails.

Terminal Questions — Chapter 01 10 problems · No answers given

Work through these independently. Q1–3 test direct understanding. Q4–7 require connecting ideas across sections. Q8–10 will make you think beyond the chapter.

For each of the following, define $\mathcal{X}$, $\mathcal{Y}$, and state whether it is classification, regression, or density estimation: (a) Predicting whether a patient will be readmitted to hospital within 30 days, given their discharge summary. (b) Estimating tomorrow's INDIA VIX level given today's market state. (c) Identifying whether a transaction on the NSE is suspicious (possible market manipulation). Easy

An X-ray image of dimensions $p \times q$ pixels, each with 256 grayscale values, is treated as a data point. What is the dimension $d$ of the feature vector $x \in \mathbb{R}^d$? For a standard chest X-ray at $1024 \times 1024$ resolution, compute $d$. What does this imply about the size of the feature space relative to typical medical dataset sizes (usually a few thousand images)? Easy

Let $\Omega$ be the sample space of all possible trading days on NSE. Define two random variables $X: \Omega \to \mathbb{R}^5$ (opening market features) and $Y: \Omega \to \mathbb{R}$ (intraday return). Describe in words what the following objects represent: (a) $P_X$, (b) $P_Y$, (c) $P_{XY}$, (d) $P(Y \mid X = x_0)$ where $x_0$ is today's specific feature vector. Easy

A hospital builds a cancer detection model on patients from Mumbai's Tata Memorial Hospital and deploys it at a rural clinic in Chhattisgarh. List at least three specific ways in which the i.i.d. assumption might be violated in this deployment. For each violation, state whether it is independence or identical distribution that breaks, and explain the mechanism. Medium

Consider the parametric family of distributions $\mathcal{P} = \{\text{Bernoulli}(p) : p \in [0,1]\}$ applied to the problem of predicting whether NIFTY closes up or down on a given day. (a) What is the parameter $\theta$ here? (b) What assumption about $P_Y$ are you making by choosing this family? (c) Is this assumption reasonable? What does it imply about the marginal probability of NIFTY closing up on any given day? Medium

You observe that a NIFTY prediction model trained on 2010–2020 data performs well out-of-sample within that period but poorly on 2021–2023 data. A colleague says "just add more data and retrain." A second colleague says "the problem is non-stationarity — more data won't help unless you address the regime change." Who is more likely correct, and why? What specific test would you run to distinguish between overfitting and distribution shift as the root cause? Medium

In the NIFTY return example from Worked Example 2, we estimated $P_X$ using a Gaussian. The sample mean of daily log-returns over 10 years is approximately $\hat\mu = 0.0004$ (0.04% per day) and sample standard deviation $\hat\sigma = 0.0095$ (0.95% per day). Using these parameters, compute $P(\text{NIFTY falls more than 3\%})$ under the fitted Gaussian. Then compare with the empirical frequency: historically, NIFTY has had roughly 18 such days in 3,000 trading sessions. What does this comparison tell you about the Gaussian assumption? Medium

The fundamental problem of ML requires that test data be drawn from the same $P_{XY}$ as training data. In reinforcement learning for trading, the algorithm's own actions affect the market, which changes $P_{XY}$. Explain precisely which assumption of the standard ML framework breaks in this setting, and why this makes reinforcement learning fundamentally harder than supervised learning. Hard

We argued that the parametric approach is necessary for tractability. But there is an alternative: non-parametric density estimation (e.g., kernel density estimation), which makes no assumption about the functional form of $P_X$. (a) What is the price of the non-parametric approach in terms of data requirements? (b) Given that financial datasets are typically small (hundreds to thousands of points) and feature spaces are high-dimensional ($d$ can be 10–100), argue for or against using non-parametric methods in quantitative finance. Hard

Is the i.i.d. assumption ever exactly true in practice? Construct a careful argument that it is — by describing a data-generating process (real or hypothetical) where i.i.d. holds exactly. Then construct an equally careful argument that even in your example, some subtle dependence or distributional change might still be present. What does this exercise tell you about the role of assumptions in statistical modelling? Hard

← Previous Next: The Learning Machine →