Non-parametric
Density Estimation
Every chapter so far has assumed a parametric family โ Gaussian, Bernoulli, mixture of Gaussians. What if you refuse to make that assumption? What if you let the data itself determine the shape of the density, with no model imposed? This chapter builds the answer from a single counting argument.
Abandoning the Parametric Assumption
Chapters 3 through 5 all operated within the parametric framework: choose a model family $\{P_\theta\}$, then find the best $\theta$ by MLE, MAP, or EM. The assumption that the true distribution belongs to your chosen family is always implicit โ and always potentially wrong.
Non-parametric density estimation makes no such assumption. Given data $D = \{x_1, \ldots, x_n\}$ drawn i.i.d. from an unknown $P_x$, we want to estimate $P_x(x)$ at any query point $x$ โ directly from the data, with no parametric model in between. The data determines the shape.
The Fundamental Counting Argument
The core idea of all non-parametric density estimation is elegant and elementary. Fix a region $R$ in the input space around the query point $x$. Let $P$ be the probability that a randomly drawn data point falls within $R$:
Now suppose we have $n$ data points drawn i.i.d. from $P_x$. Whether each point falls in $R$ is a Bernoulli trial with success probability $P$. If $k$ points fall in $R$, the MLE for $P$ is simply the empirical frequency:
So we have $\frac{k}{n} \approx \int_R P_x(x)\, dx$. If $R$ is small enough that $P_x(x)$ is approximately constant within $R$ โ a reasonable assumption for a sufficiently small region โ then:
where $V$ is the volume of region $R$. Combining these two approximations gives the fundamental non-parametric density estimate:
Density at $x$ $\approx$ fraction of data points near $x$, divided by the volume of the neighbourhood. This single formula is the foundation of every non-parametric density estimator.
Two Strategies: Fix $V$ or Fix $k$
The formula $P_x(x) = k/(nV)$ has two free quantities: $k$ (the count) and $V$ (the volume). We can only choose one โ the other is determined by the data. This gives rise to two families of non-parametric estimators:
| Strategy | Fix | Determine | Method |
|---|---|---|---|
| Parzen Windows | $V$ (volume of region) | $k$ (count points inside) | Fix a window, count neighbours |
| K-Nearest Neighbours | $k$ (number of neighbours) | $V$ (grow until $k$ points inside) | Fix count, grow neighbourhood |
Both give density estimates via the same formula. They differ in what they hold fixed and what they let the data determine.
Parzen Window Estimation
Fix a region $R$ of volume $V$ centred at the query point $x$, and count how many of the $n$ training points fall inside. The simplest choice: a $d$-dimensional hypercube with side length $h$, giving $V = h^d$.
To count the points inside, define the window function:
Then $\phi\!\left(\frac{x - x_i}{h}\right) = 1$ if and only if $x_i$ is within the hypercube of side $h$ centred at $x$. Summing over all training points gives $k = \sum_i \phi\!\left(\frac{x-x_i}{h}\right)$. The Parzen density estimate is:
Each training point gets its own spike. The estimate perfectly reproduces the training data but has no generalisability โ it cannot estimate density between training points. Variance is very high.
Analogous to memorising training data in ERM. Zero training error, useless generalisation.
Every training point contributes to every query point. The estimate becomes flat โ approximately uniform โ regardless of the true shape. Bias is very high.
Analogous to a hypothesis class that is too simple. The model cannot capture the true structure.
K-Nearest Neighbour Density Estimation
The second strategy fixes $k$ and grows the volume $V$ until it captures exactly $k$ training points. At a dense region of the input space, $V$ will be small โ many points are nearby. At a sparse region, $V$ will be large โ you need to look further to find $k$ neighbours.
This adaptive neighbourhood is the key advantage of KNN over Parzen windows. In regions where the data is dense, KNN uses a small neighbourhood and achieves high resolution. In sparse regions, it automatically widens the neighbourhood to maintain a stable estimate. Parzen windows use the same $V$ everywhere, which can lead to over-smoothing in dense regions or under-smoothing in sparse ones.
The KNN Classifier: Bayes Optimal from Counting
The most important application of KNN density estimation in this course is what happens when you use it to implement the Bayes classifier. This is where the chapter's payoff lives.
Consider $m$-class classification. Given a query point $x$, place a volume $V$ around it and capture $k$ training points. Let $k_i$ be the number of those $k$ points with class label $i$. Then $\sum_{i=1}^m k_i = k$.
Using the KNN estimate $P_x(x) = k/(nV)$, the joint density of $x$ and class label $y_i$ is estimated by counting only points of class $i$ within the volume:
The $nV$ factors cancel. The posterior estimate is simply the fraction of the $k$ nearest neighbours that belong to class $i$.
The Bayes classifier predicts the class with the highest posterior:
Five daily NIFTY log-returns: $D = \{-1.5\%, -0.5\%, 0.0\%, +0.8\%, +1.6\%\}$. We want to estimate the density at $x = 0.5\%$ using a 1D Parzen window with bandwidth $h$.
The window at $x = 0.5\%$ covers $[0.5 - 0.5, 0.5 + 0.5] = [0.0\%, 1.0\%]$. Count points in this interval: $\{0.0\%, 0.8\%\}$ โ two points. $k = 2$, $V = h = 1.0\%$.
The window covers $[0.3\%, 0.7\%]$. Count points: $\{0.8\%\}$ is outside, $\{0.0\%\}$ is outside. Only nothing falls in โ $k = 0$. Wait: $0.5\%$ is the query, not a data point. Check: $-1.5$, $-0.5$, $0.0$, $+0.8$, $+1.6$. None in $[0.3, 0.7]$. $k = 0$.
The narrow window finds no data and estimates zero density โ clearly wrong. The true density is positive in this region.
The window covers $[-1.0\%, 2.0\%]$. Count points: $\{-0.5\%, 0.0\%, +0.8\%, +1.6\%\}$ โ four points. $k = 4$.
We classify today's market state as Bullish (B), Neutral (N), or Bearish (R) based on two features: $x_1$ = INDIA VIX level, $x_2$ = NIFTY 5-day momentum (%). Six labelled training days:
Compute Euclidean distances from query $(15, 0.6)$ to all training points (normalise features to same scale first โ here VIX range $\approx 13$, Mom range $\approx 5.2$, so distances are approximate):
Sorted distances: Day 2 (1.35, B), Day 3 (1.35, N), Day 4 (2.44, N). The 3 nearest neighbours are: B, N, N.
Non-parametric methods are powerful but have a fundamental limitation: they have no learnable parameters. At test time, every prediction requires searching through all training data โ computation scales with $n$. They also scale poorly with dimension $d$ (the curse of dimensionality). Chapter 7 returns to parametric models โ but now with a complete understanding of what we want from them: linear regression, logistic regression, and the bias-variance tradeoff made fully rigorous.
In a Parzen window estimator, increasing the bandwidth $h$ has what effect on the density estimate?
Increasing $h$ makes the window larger, so more training points contribute to the estimate at every query point. This reduces variance โ the estimate is more stable because it averages over more data. But it also increases bias: by averaging over a wider region, the estimate loses the ability to capture local structure. In the extreme $h \to \infty$, every training point contributes equally everywhere, producing a flat estimate regardless of the true shape โ maximum bias, minimum variance.
This is the bias-variance tradeoff in density estimation, directly analogous to hypothesis class complexity in ERM. The bandwidth $h$ is the "complexity parameter" of the Parzen estimator.
The KNN density estimator uses an adaptive neighbourhood โ the volume $V(x)$ changes with the query point $x$. What is the key advantage of this over a fixed-volume Parzen window?
The adaptive neighbourhood is KNN's defining advantage. In a dense region, the $k$ nearest neighbours are very close to the query point โ $V(x)$ is small, giving a fine-resolution estimate. In a sparse region, you need to search further โ $V(x)$ is large, giving a smoother but stable estimate. A fixed-bandwidth Parzen window cannot do this: it either has $h$ too small (zero estimates in sparse regions) or $h$ too large (over-smoothed estimates in dense regions).
Option B is wrong: KNN still has a hyperparameter โ $k$. Choosing $k$ is analogous to choosing $h$. Option D is wrong: KNN still requires computing distances to all $n$ training points to find the $k$ nearest ones.
A KNN classifier with $K = 5$ is applied to a query point. Among the 5 nearest neighbours: 3 have label "NIFTY Up" and 2 have label "NIFTY Down". What does the classifier predict, and what are the posterior probability estimates for each class?
From the derivation: $\hat{P}(y_i|x) = k_i/k$. With $k_{\text{Up}} = 3$ and $k_{\text{Down}} = 2$ out of $k = 5$ total neighbours, the posterior estimates are $\hat{P}(\text{Up}|x) = 3/5 = 0.6$ and $\hat{P}(\text{Down}|x) = 2/5 = 0.4$. The classifier predicts the class with the higher posterior โ Up โ which corresponds to simple majority vote.
Option A is wrong: the minority class is not ignored โ it contributes to the posterior estimate. The KNN posterior gives calibrated probability estimates, not just hard labels. Option D is wrong: the volume $V$ cancels in the ratio $k_i/(nV) \div \sum_j k_j/(nV) = k_i/k$. The volume is irrelevant for classification.
A quant wants to estimate the density of NIFTY daily returns at the point $x = -3\%$ (a large down day) using 500 historical observations. She has very few observations near $-3\%$. Which non-parametric approach is most appropriate, and why?
The tail region around $-3\%$ is sparse โ few observations, so any fixed-bandwidth Parzen window has a problem. Small $h$: finds almost no points, gives an estimate near zero (and zero for some query points). Large $h$: captures many returns far from $-3\%$, contaminating the tail estimate with the bulk of the return distribution.
KNN handles this naturally: it grows the neighbourhood until it finds $k$ points, however far that requires searching. The resulting $V$ is large in the tail (reflecting sparsity), and the estimate $k/(nV)$ is correspondingly small โ correctly indicating low tail density. The adaptive neighbourhood is exactly what's needed in regions with non-uniform data density, which describes financial return distributions perfectly: dense near the mean, sparse in the tails.
Option D is wrong: non-parametric methods can estimate tail density; they just need enough total data and the right approach.
Q1โ3 are direct applications of the $k/(nV)$ formula. Q4โ7 require connecting non-parametric estimation to the broader framework. Q8โ10 explore the limits of non-parametric methods.