Intermediate Chapter 06 of 08 · Mathematical Foundations of ML

Non-parametric
Density Estimation

Every chapter so far has assumed a parametric family — Gaussian, Bernoulli, mixture of Gaussians. What if you refuse to make that assumption? What if you let the data itself determine the shape of the density, with no model imposed? This chapter builds the answer from a single counting argument.

Abandoning the Parametric Assumption

Chapters 3 through 5 all operated within the parametric framework: choose a model family $\{P_\theta\}$, then find the best $\theta$ by MLE, MAP, or EM. The assumption that the true distribution belongs to your chosen family is always implicit — and always potentially wrong.

Non-parametric density estimation makes no such assumption. Given data $D = \{x_1, \ldots, x_n\}$ drawn i.i.d. from an unknown $P_x$, we want to estimate $P_x(x)$ at any query point $x$ — directly from the data, with no parametric model in between. The data determines the shape.

Think of the difference this way. A parametric estimator says: "I believe the data is Gaussian. Let me find the best Gaussian." A non-parametric estimator says: "I make no assumption about the shape. I'll estimate the density at any point $x$ by counting how many data points are near $x$." The first approach can be very efficient when the assumption is right. The second is more flexible but requires more data to achieve the same accuracy. The bias-variance tradeoff again, in a new form.

The Fundamental Counting Argument

The core idea of all non-parametric density estimation is elegant and elementary. Fix a region $R$ in the input space around the query point $x$. Let $P$ be the probability that a randomly drawn data point falls within $R$:

P = \int_R P_x(x)\, dx

Now suppose we have $n$ data points drawn i.i.d. from $P_x$. Whether each point falls in $R$ is a Bernoulli trial with success probability $P$. If $k$ points fall in $R$, the MLE for $P$ is simply the empirical frequency:

\hat{P} = \frac{k}{n}

So we have $\frac{k}{n} \approx \int_R P_x(x)\, dx$. If $R$ is small enough that $P_x(x)$ is approximately constant within $R$ — a reasonable assumption for a sufficiently small region — then:

\int_R P_x(x)\, dx \approx P_x(x) \cdot V

where $V$ is the volume of region $R$. Combining these two approximations gives the fundamental non-parametric density estimate:

\boxed{P_x(x) \approx \frac{k}{n \cdot V}}

Density at $x$ $\approx$ fraction of data points near $x$, divided by the volume of the neighbourhood. This single formula is the foundation of every non-parametric density estimator.

The formula $P_x(x) = k/(nV)$ has a beautiful physical interpretation. Imagine dropping $n$ marbles randomly onto the floor according to distribution $P_x$. To estimate the density at a point $x$, draw a circle of area $V$ around $x$ and count how many marbles landed inside ($k$). The density estimate is $k/(nV)$ — the number of marbles per unit area near $x$. More marbles in the region means higher density. Larger region means less resolution but more stable estimates.

Two Strategies: Fix $V$ or Fix $k$

The formula $P_x(x) = k/(nV)$ has two free quantities: $k$ (the count) and $V$ (the volume). We can only choose one — the other is determined by the data. This gives rise to two families of non-parametric estimators:

Strategy	Fix	Determine	Method
Parzen Windows	$V$ (volume of region)	$k$ (count points inside)	Fix a window, count neighbours
K-Nearest Neighbours	$k$ (number of neighbours)	$V$ (grow until $k$ points inside)	Fix count, grow neighbourhood

Both give density estimates via the same formula. They differ in what they hold fixed and what they let the data determine.

Parzen Window Estimation

Fix a region $R$ of volume $V$ centred at the query point $x$, and count how many of the $n$ training points fall inside. The simplest choice: a $d$-dimensional hypercube with side length $h$, giving $V = h^d$.

To count the points inside, define the window function:

\phi(u) = \begin{cases} 1 & \text{if } |u_j| \leq \frac{1}{2} \quad \text{for all } j = 1, \ldots, d \\ 0 & \text{otherwise} \end{cases}

Then $\phi\!\left(\frac{x - x_i}{h}\right) = 1$ if and only if $x_i$ is within the hypercube of side $h$ centred at $x$. Summing over all training points gives $k = \sum_i \phi\!\left(\frac{x-x_i}{h}\right)$. The Parzen density estimate is:

The Parzen window density estimate at query point $x$ is: $$P_x(x) = \frac{1}{n \cdot h^d} \sum_{i=1}^n \phi\!\left(\frac{x - x_i}{h}\right)$$ where $h$ is the bandwidth (side length of the hypercube) and $\phi$ is the window function. The bandwidth $h$ is the single hyperparameter controlling the smoothness of the estimate.

The hypercube window function is discontinuous — the density estimate jumps sharply at the boundaries of each hypercube. This produces density estimates that are not smooth and can have artefacts. The fix: replace the hard indicator $\phi$ with a smooth kernel. The Gaussian kernel $\phi(u) = \exp(-\|u - u_0\|_2^2)$ assigns soft weights to each training point based on its distance from $x$, producing smooth density estimates. This is the basis of Kernel Density Estimation (KDE) — widely used in practice for everything from financial return distributions to election modelling.

📊 Bandwidth Too Small ($h \to 0$)

Each training point gets its own spike. The estimate perfectly reproduces the training data but has no generalisability — it cannot estimate density between training points. Variance is very high.

Analogous to memorising training data in ERM. Zero training error, useless generalisation.

📊 Bandwidth Too Large ($h \to \infty$)

Every training point contributes to every query point. The estimate becomes flat — approximately uniform — regardless of the true shape. Bias is very high.

Analogous to a hypothesis class that is too simple. The model cannot capture the true structure.

KDE is used routinely in Indian quantitative finance. When a derivatives desk at a Mumbai prop firm wants to estimate the implied return distribution from 3 years of NIFTY daily returns, they run KDE with a Gaussian kernel — not a parametric Gaussian fit. The KDE faithfully captures the fat tails, the negative skew, and the volatility clustering that a single Gaussian would miss. The bandwidth is typically chosen by cross-validation or Silverman's rule of thumb: $h = 1.06\,\hat{\sigma}\, n^{-1/5}$.

K-Nearest Neighbour Density Estimation

The second strategy fixes $k$ and grows the volume $V$ until it captures exactly $k$ training points. At a dense region of the input space, $V$ will be small — many points are nearby. At a sparse region, $V$ will be large — you need to look further to find $k$ neighbours.

The $k$-nearest neighbour (KNN) density estimate at query point $x$ is: $$P_x(x) = \frac{k}{n \cdot V(x)}$$ where $V(x)$ is the volume of the smallest ball around $x$ that contains exactly $k$ training points. Unlike Parzen windows, the neighbourhood size $V(x)$ adapts to the local data density — it contracts in dense regions and expands in sparse ones.

This adaptive neighbourhood is the key advantage of KNN over Parzen windows. In regions where the data is dense, KNN uses a small neighbourhood and achieves high resolution. In sparse regions, it automatically widens the neighbourhood to maintain a stable estimate. Parzen windows use the same $V$ everywhere, which can lead to over-smoothing in dense regions or under-smoothing in sparse ones.

KNN density estimation has a simple mental model. To estimate the density at $x$: find the $k$ nearest training points. The density is inversely proportional to how far you had to look. If your 10 nearest neighbours are all within 0.1 units of $x$, the density is high. If you have to go 5 units away to find your 10th neighbour, the density is low. The density estimate adapts automatically — no bandwidth to tune, just one hyperparameter $k$.

The KNN Classifier: Bayes Optimal from Counting

The most important application of KNN density estimation in this course is what happens when you use it to implement the Bayes classifier. This is where the chapter's payoff lives.

Consider $m$-class classification. Given a query point $x$, place a volume $V$ around it and capture $k$ training points. Let $k_i$ be the number of those $k$ points with class label $i$. Then $\sum_{i=1}^m k_i = k$.

Derivation · KNN Classifier as the Bayes Classifier Non-parametric → Plug into Bayes

Step 1 — KNN density estimate for joint distribution

Using the KNN estimate $P_x(x) = k/(nV)$, the joint density of $x$ and class label $y_i$ is estimated by counting only points of class $i$ within the volume:

\hat{P}(x, y_i) = \frac{k_i}{n \cdot V}

Step 2 — Estimate the posterior $P(y_i | x)$ via Bayes' theorem

\hat{P}(y_i \mid x) = \frac{\hat{P}(x, y_i)}{\sum_{j=1}^m \hat{P}(x, y_j)} = \frac{k_i / (nV)}{\sum_{j=1}^m k_j / (nV)} = \frac{k_i}{\sum_j k_j} = \frac{k_i}{k}

The $nV$ factors cancel. The posterior estimate is simply the fraction of the $k$ nearest neighbours that belong to class $i$.

Step 3 — Apply the Bayes classifier rule

The Bayes classifier predicts the class with the highest posterior:

h_B(x) = \underset{i}{\arg\max}\; \hat{P}(y_i \mid x) = \underset{i}{\arg\max}\; \frac{k_i}{k} = \underset{i}{\arg\max}\; k_i

The KNN classifier predicts the class that has the most representatives among the $k$ nearest neighbours. This is the Bayes classifier with KNN density estimates. The derivation shows that KNN classification is not a heuristic — it is a principled implementation of the optimal Bayes decision rule using non-parametric density estimation. Majority vote among neighbours is Bayes-optimal density estimation in disguise.

This derivation should feel satisfying. We started in Chapter 2 with the abstract Bayes classifier $h_B(x) = \arg\max_y P(y|x)$ — theoretically optimal, practically useless without knowing $P_{XY}$. We then spent chapters 4 and 5 learning to estimate $P_{XY}$ parametrically. Now KNN gives us a completely different path to the same destination: estimate $P(y|x)$ non-parametrically by counting class representatives among neighbours, then apply Bayes. The KNN classifier is the Bayes classifier all the way down.

Worked Example 1 · Parzen Window Density Estimate on NIFTY Returns 🇮🇳 Bandwidth Selection

Five daily NIFTY log-returns: $D = \{-1.5\%, -0.5\%, 0.0\%, +0.8\%, +1.6\%\}$. We want to estimate the density at $x = 0.5\%$ using a 1D Parzen window with bandwidth $h$.

Case A — Bandwidth $h = 1.0\%$

The window at $x = 0.5\%$ covers $[0.5 - 0.5, 0.5 + 0.5] = [0.0\%, 1.0\%]$. Count points in this interval: $\{0.0\%, 0.8\%\}$ — two points. $k = 2$, $V = h = 1.0\%$.

\hat{P}_x(0.5\%) = \frac{k}{n \cdot h} = \frac{2}{5 \times 1.0\%} = 0.40 \text{ per percent}

Case B — Bandwidth $h = 0.4\%$

The window covers $[0.3\%, 0.7\%]$. Count points: $\{0.8\%\}$ is outside, $\{0.0\%\}$ is outside. Only nothing falls in — $k = 0$. Wait: $0.5\%$ is the query, not a data point. Check: $-1.5$, $-0.5$, $0.0$, $+0.8$, $+1.6$. None in $[0.3, 0.7]$. $k = 0$.

\hat{P}_x(0.5\%) = \frac{0}{5 \times 0.4\%} = 0

The narrow window finds no data and estimates zero density — clearly wrong. The true density is positive in this region.

Case C — Bandwidth $h = 3.0\%$

The window covers $[-1.0\%, 2.0\%]$. Count points: $\{-0.5\%, 0.0\%, +0.8\%, +1.6\%\}$ — four points. $k = 4$.

\hat{P}_x(0.5\%) = \frac{4}{5 \times 3.0\%} = 0.267 \text{ per percent}

Key lesson: The optimal bandwidth is neither too small (zero estimates, high variance) nor too large (flat estimates, high bias). Case A with $h = 1.0\%$ gives the most sensible estimate for this small dataset. In practice, bandwidth is chosen by cross-validation: hold out each point, estimate the density at it using the remaining points, and choose $h$ to maximise the held-out log-likelihood.

Worked Example 2 · KNN Classifier for NIFTY Market State 🇮🇳 3-Class Classification · $K = 3$

We classify today's market state as Bullish (B), Neutral (N), or Bearish (R) based on two features: $x_1$ = INDIA VIX level, $x_2$ = NIFTY 5-day momentum (%). Six labelled training days:

\begin{array}{lccc} \text{Day} & x_1\text{ (VIX)} & x_2\text{ (Mom\%)} & \text{Label} \\ \hline 1 & 13 & +2.1 & B \\ 2 & 14 & +1.5 & B \\ 3 & 16 & -0.3 & N \\ 4 & 17 & -0.8 & N \\ 5 & 24 & -2.5 & R \\ 6 & 26 & -3.1 & R \end{array}

Query point: VIX = 15, Momentum = +0.6%

Compute Euclidean distances from query $(15, 0.6)$ to all training points (normalise features to same scale first — here VIX range $\approx 13$, Mom range $\approx 5.2$, so distances are approximate):

d_1 = \sqrt{(15-13)^2 + (0.6-2.1)^2} = \sqrt{4 + 2.25} = 2.50$$ $$d_2 = \sqrt{(15-14)^2 + (0.6-1.5)^2} = \sqrt{1 + 0.81} = 1.35$$ $$d_3 = \sqrt{(15-16)^2 + (0.6+0.3)^2} = \sqrt{1 + 0.81} = 1.35$$ $$d_4 = \sqrt{(15-17)^2 + (0.6+0.8)^2} = \sqrt{4 + 1.96} = 2.44$$ $$d_5 = \sqrt{(15-24)^2 + (0.6+2.5)^2} = \sqrt{81 + 9.61} = 9.52$$ $$d_6 = \sqrt{(15-26)^2 + (0.6+3.1)^2} = \sqrt{121 + 13.69} = 11.60

$K = 3$ nearest neighbours

Sorted distances: Day 2 (1.35, B), Day 3 (1.35, N), Day 4 (2.44, N). The 3 nearest neighbours are: B, N, N.

k_B = 1, \quad k_N = 2, \quad k_R = 0$$ $$\hat{P}(B|x) = 1/3, \quad \hat{P}(N|x) = 2/3, \quad \hat{P}(R|x) = 0

KNN prediction: Neutral. With VIX at 15 (mildly elevated) and positive momentum of +0.6%, the market state is classified as Neutral — sitting between the clearly Bullish (low VIX, high momentum) and clearly Bearish (high VIX, negative momentum) training examples. The 2/3 posterior probability for Neutral reflects genuine uncertainty — the query point is in a transitional zone between regimes.

Chapter 7 Preview

Non-parametric methods are powerful but have a fundamental limitation: they have no learnable parameters. At test time, every prediction requires searching through all training data — computation scales with $n$. They also scale poorly with dimension $d$ (the curse of dimensionality). Chapter 7 returns to parametric models — but now with a complete understanding of what we want from them: linear regression, logistic regression, and the bias-variance tradeoff made fully rigorous.

Practice Problems

4 questions · Chapter 06

ml / mathematical-foundations / ch06 / q01 ★ Conceptual

In a Parzen window estimator, increasing the bandwidth $h$ has what effect on the density estimate?

Decreases bias and increases variance — more data points contribute, improving accuracy.

Increases bias and decreases variance — the estimate becomes smoother and flatter, averaging over more data but losing local structure.

Has no effect on bias or variance — only the number of training points matters.

Increases both bias and variance — more data in the window makes the estimate less stable.

Answer: B.

Increasing $h$ makes the window larger, so more training points contribute to the estimate at every query point. This reduces variance — the estimate is more stable because it averages over more data. But it also increases bias: by averaging over a wider region, the estimate loses the ability to capture local structure. In the extreme $h \to \infty$, every training point contributes equally everywhere, producing a flat estimate regardless of the true shape — maximum bias, minimum variance.

This is the bias-variance tradeoff in density estimation, directly analogous to hypothesis class complexity in ERM. The bandwidth $h$ is the "complexity parameter" of the Parzen estimator.

ml / mathematical-foundations / ch06 / q02 ★ Conceptual

The KNN density estimator uses an adaptive neighbourhood — the volume $V(x)$ changes with the query point $x$. What is the key advantage of this over a fixed-volume Parzen window?

KNN is always more accurate than Parzen windows regardless of the data distribution.

KNN requires no hyperparameter, while Parzen windows require choosing $h$.

KNN automatically uses finer resolution in dense regions and coarser resolution in sparse regions, giving better estimates in both. A fixed-volume window can over-smooth dense regions or give zero estimates in sparse ones.

KNN is computationally cheaper because it only looks at $k$ points instead of all $n$.

Answer: C.

The adaptive neighbourhood is KNN's defining advantage. In a dense region, the $k$ nearest neighbours are very close to the query point — $V(x)$ is small, giving a fine-resolution estimate. In a sparse region, you need to search further — $V(x)$ is large, giving a smoother but stable estimate. A fixed-bandwidth Parzen window cannot do this: it either has $h$ too small (zero estimates in sparse regions) or $h$ too large (over-smoothed estimates in dense regions).

Option B is wrong: KNN still has a hyperparameter — $k$. Choosing $k$ is analogous to choosing $h$. Option D is wrong: KNN still requires computing distances to all $n$ training points to find the $k$ nearest ones.

ml / mathematical-foundations / ch06 / q03 ★★ Computational

A KNN classifier with $K = 5$ is applied to a query point. Among the 5 nearest neighbours: 3 have label "NIFTY Up" and 2 have label "NIFTY Down". What does the classifier predict, and what are the posterior probability estimates for each class?

Predicts Up with $P(\text{Up}|x) = 1.0$, $P(\text{Down}|x) = 0.0$ — majority vote means the minority is ignored.

Predicts Up with $P(\text{Up}|x) = 3/5 = 0.6$ and $P(\text{Down}|x) = 2/5 = 0.4$ — the posterior is the fraction of neighbours in each class.

Predicts Up with $P(\text{Up}|x) = 0.5$ — with $K = 5$, the classifier always outputs 50-50 for binary problems.

Cannot determine the posterior without knowing the volume $V$ of the neighbourhood.

Answer: B.

From the derivation: $\hat{P}(y_i|x) = k_i/k$. With $k_{\text{Up}} = 3$ and $k_{\text{Down}} = 2$ out of $k = 5$ total neighbours, the posterior estimates are $\hat{P}(\text{Up}|x) = 3/5 = 0.6$ and $\hat{P}(\text{Down}|x) = 2/5 = 0.4$. The classifier predicts the class with the higher posterior — Up — which corresponds to simple majority vote.

Option A is wrong: the minority class is not ignored — it contributes to the posterior estimate. The KNN posterior gives calibrated probability estimates, not just hard labels. Option D is wrong: the volume $V$ cancels in the ratio $k_i/(nV) \div \sum_j k_j/(nV) = k_i/k$. The volume is irrelevant for classification.

ml / mathematical-foundations / ch06 / q04 ★★ Conceptual

A quant wants to estimate the density of NIFTY daily returns at the point $x = -3\%$ (a large down day) using 500 historical observations. She has very few observations near $-3\%$. Which non-parametric approach is most appropriate, and why?

Parzen window with small bandwidth — to avoid contaminating the estimate with returns far from $-3\%$.

Parzen window with large bandwidth — since there are few observations, a wider window captures more data.

KNN with an appropriate $k$ — the adaptive neighbourhood automatically expands to find $k$ neighbours in the sparse tail region, giving a stable estimate without manual bandwidth tuning for this specific region.

Neither — non-parametric methods cannot estimate density in tail regions with few observations.

Answer: C.

The tail region around $-3\%$ is sparse — few observations, so any fixed-bandwidth Parzen window has a problem. Small $h$: finds almost no points, gives an estimate near zero (and zero for some query points). Large $h$: captures many returns far from $-3\%$, contaminating the tail estimate with the bulk of the return distribution.

KNN handles this naturally: it grows the neighbourhood until it finds $k$ points, however far that requires searching. The resulting $V$ is large in the tail (reflecting sparsity), and the estimate $k/(nV)$ is correspondingly small — correctly indicating low tail density. The adaptive neighbourhood is exactly what's needed in regions with non-uniform data density, which describes financial return distributions perfectly: dense near the mean, sparse in the tails.

Option D is wrong: non-parametric methods can estimate tail density; they just need enough total data and the right approach.

Terminal Questions — Chapter 06 10 problems · No answers given

Q1–3 are direct applications of the $k/(nV)$ formula. Q4–7 require connecting non-parametric estimation to the broader framework. Q8–10 explore the limits of non-parametric methods.

Given 8 data points in 1D: $\{-3, -1, 0, 0.5, 1, 2, 4, 6\}$. Estimate the density at $x = 1$ using a Parzen window with $h = 2$. Show your counting of $k$ and compute $\hat{P}_x(1)$.Easy

For the same dataset, estimate the density at $x = 1$ using KNN with $k = 3$. Find the 3 nearest neighbours, compute $V$ (the volume of the smallest interval containing them), and apply the formula $P_x(x) = k/(nV)$.Easy

A KNN classifier with $K = 7$ is applied to a query point. Among the 7 nearest neighbours: 4 labelled "Up", 2 "Down", 1 "Flat". (a) What does the classifier predict? (b) What are $\hat{P}(\text{Up}|x)$, $\hat{P}(\text{Down}|x)$, $\hat{P}(\text{Flat}|x)$? (c) Verify that the posteriors sum to 1.Easy

The Parzen window estimate can be written as $\hat{P}_x(x) = \frac{1}{n}\sum_i \frac{1}{h^d}\phi\!\left(\frac{x-x_i}{h}\right)$. Show that this is a valid density estimate: it integrates to 1 over all of $\mathbb{R}^d$, provided $\phi$ itself integrates to 1 (i.e. $\phi$ is a proper density). What does this tell you about the choice of Gaussian vs hypercube window functions?Medium

Explain the "curse of dimensionality" for non-parametric density estimation. If $n = 1000$ data points are spread uniformly over $[0,1]^d$, how many points fall in a hypercube of side $h = 0.1$ centred at a query point? Compute this for $d = 1, 2, 5, 10$ and comment on what happens to $k$ and therefore the density estimate as $d$ grows.Medium

The KNN density estimate $P_x(x) = k/(nV(x))$ is not a proper density — it does not integrate to 1 over all of $\mathbb{R}^d$. Show this by constructing a simple 1D example and computing the integral numerically. Does this limitation affect the KNN classifier? Why or why not?Medium

The KNN classifier with $K = 1$ is called the "1-NN classifier". Show that the 1-NN classifier achieves zero training error on any dataset with no duplicate points. Then explain why this makes the 1-NN classifier highly susceptible to overfitting, and what the relationship is between $K$ and the bias-variance tradeoff in KNN classification.Medium

Compare Parzen window density estimation to MLE with a Gaussian model for NIFTY daily returns. Under what conditions would you prefer the non-parametric approach? Under what conditions would the parametric MLE be better? Your answer should address: sample size, tail behaviour, computational cost, and interpretability.Hard

Silverman's rule of thumb for 1D KDE bandwidth selection is $h^* = 1.06\,\hat{\sigma}\, n^{-1/5}$, where $\hat{\sigma}$ is the sample standard deviation. (a) Show that this bandwidth goes to zero as $n \to \infty$ — meaning the Parzen estimator becomes consistent. (b) Explain intuitively why the optimal bandwidth decreases with more data. (c) Compute $h^*$ for $n = 250$ NIFTY daily returns with $\hat{\sigma} = 1.2\%$.Hard

The Cover-Hart theorem states that as $n \to \infty$, the 1-NN classifier achieves error rate at most $2 \times R^*$ where $R^*$ is the Bayes error. This means the 1-NN classifier is within a factor of 2 of optimal, using no assumptions about $P_{XY}$. (a) Argue intuitively why the nearest neighbour approaches the query point as $n \to \infty$. (b) Explain why the 1-NN error approaches $2R^*$ rather than $R^*$ — what is the source of the factor of 2? (c) What does this result say about the value of parametric assumptions when data is plentiful?Hard

← Hidden Variables and EM Next: Linear Models and the Bias-Variance Tradeoff →

Non-parametricDensity Estimation

Abandoning the Parametric Assumption

The Fundamental Counting Argument

Two Strategies: Fix $V$ or Fix $k$

Parzen Window Estimation

K-Nearest Neighbour Density Estimation

The KNN Classifier: Bayes Optimal from Counting

Non-parametric
Density Estimation