Regularisation, SVMs,
and the Kernel Trick
Every thread from the last seven chapters converges here. Regularisation fixes the overfitting problem with a Bayesian prior in disguise. SVMs find the provably optimal decision boundary by solving a beautifully structured constrained optimisation. Kernels dissolve the boundary between linear and nonlinear โ letting you operate in infinite-dimensional feature spaces at finite cost.
The Overfitting Problem, Revisited
Chapter 7 gave us the diagnosis: when hypothesis class $\mathcal{H}$ is too expressive relative to the training set size $N$, the learned model $h_D$ has high variance โ it overfits. The training error is small but the generalisation gap is large. The ERM objective $\hat{R}(\theta) = \frac{1}{N}\sum_i L(h_\theta(x_i), y_i)$ finds the parameters that best explain the training data, with no regard for whether those parameters generalise.
Regularisation adds a second term to the objective โ a penalty $\Omega(\theta)$ that discourages parameters from taking extreme values. The regularised ERM objective is:
Ridge and Lasso: Two Choices of Penalty
The two dominant choices of $\Omega(\theta)$ lead to the two most widely used regularised regression methods:
| Name | Penalty $\Omega(\theta)$ | Objective | Key Property |
|---|---|---|---|
| Ridge (L2) | $\|\theta\|_2^2 = \sum_j \theta_j^2$ | $\frac{1}{N}\sum(y_i - \theta^\top x_i)^2 + \lambda\|\theta\|_2^2$ | Shrinks all parameters toward zero; closed-form solution |
| Lasso (L1) | $\|\theta\|_1 = \sum_j |\theta_j|$ | $\frac{1}{N}\sum(y_i - \theta^\top x_i)^2 + \lambda\|\theta\|_1$ | Sets many parameters exactly to zero; automatic feature selection |
Ridge has a closed-form solution: $\theta^*_\text{Ridge} = (X^\top X + \lambda I)^{-1} X^\top y$. The $\lambda I$ term ensures invertibility even when $X^\top X$ is singular โ solving the collinearity problem in OLS. Lasso has no closed-form solution (the $|\cdot|$ function is not differentiable at zero) but is solvable by convex optimisation methods such as coordinate descent.
Regularisation Is MAP Estimation in Disguise
The deep reason regularisation works is revealed by connecting it to Chapter 5's MAP estimation. Consider the MAP objective for a linear regression model:
With model $P_\theta(y|x) = \mathcal{N}(y;\, h_\theta(x),\, \sigma^2)$ and prior $P(\theta)$, the MAP estimate maximises:
Set $P(\theta) = \mathcal{N}(\theta;\, 0,\, \tau^2 I)$, so $\log P(\theta) = -\frac{1}{2\tau^2}\|\theta\|_2^2 + \text{const}$. Maximising $\ell_r$ is equivalent to minimising:
With $\lambda = \sigma^2/\tau^2$, this is exactly Ridge regression. The regularisation parameter $\lambda$ is the ratio of noise variance to prior variance โ a small prior variance (strong prior belief that $\theta \approx 0$) gives large $\lambda$ (strong regularisation).
Set $P(\theta) = \text{Laplace}(\theta;\, 0,\, b)$, so $\log P(\theta) = -\frac{1}{b}\|\theta\|_1 + \text{const}$. The MAP objective becomes:
This is exactly Lasso. The Laplace prior has heavier tails than the Gaussian โ it places more probability mass near zero and also in the tails, encouraging sparse solutions (many $\theta_j = 0$, with a few large ones).
$\min \hat{R}(\theta)$
Ch 2
$\max \ell(\theta)$
Ch 3โ4
$\min D_{KL}(P\|P_\theta)$
Ch 3
$\min \hat{R} + \lambda\Omega$
Ch 8
$\max \log P(\theta|D)$
Ch 5
Support Vector Machines: The Maximum Margin Classifier
Regularisation fixes overfitting for regression. For classification, there is a different approach to finding the "best" hypothesis โ one based on geometry rather than probabilistic priors. When data is linearly separable, infinitely many hyperplanes separate the two classes. The SVM asks: which one is optimal?
Consider binary classification: $x \in \mathbb{R}^d$, $y \in \{-1, +1\}$. A linear classifier is defined by a discriminant function $g(x) = w^\top x + b$, predicting $\hat{y} = \text{sign}(g(x))$. A dataset is linearly separable if there exist $w, b$ such that: $$w^\top x_i + b > 0 \quad \text{for all } y_i = +1 \qquad \text{and} \qquad w^\top x_i + b < 0 \quad \text{for all } y_i = -1$$
The SVM finds the hyperplane that maximises this margin. Maximising $\frac{2}{\|w\|}$ is equivalent to minimising $\|w\|^2$:
Solving the SVM: KKT Conditions
The SVM primal is a constrained quadratic optimisation problem โ convex objective, linear constraints. It is solved via the Lagrangian and KKT (Karush-Kuhn-Tucker) conditions.
Introduce multipliers $\mu_i \geq 0$ for each constraint $[1 - y_i(w^\top x_i + b)] \leq 0$:
Set gradients to zero:
The optimal $w^*$ is a linear combination of the training points, weighted by $\mu_i y_i$. Only points with $\mu_i > 0$ contribute โ these are the support vectors $S = \{x_i : \mu_i > 0\}$.
$\mu_i^*\left[1 - y_i(w^\top x_i + b)\right] = 0$ for all $i$. A point either lies exactly on the margin boundary ($y_i(w^\top x_i + b) = 1$, called a support vector) or has $\mu_i = 0$ (interior point, irrelevant to the solution).
Substitute $w^* = \sum_i \mu_i y_i x_i$ back into $\mathcal{L}$ and simplify:
The dual problem maximises $q(\mu)$ subject to $\mu_i \geq 0$ and $\sum_i \mu_i y_i = 0$. It is quadratic in $\mu$ with linear constraints โ standard QP solvers apply.
Once $\mu^*$ is found, the optimal weight vector is $w^* = \sum_{s \in S} \mu_s y_s x_s$ โ a sparse combination of support vectors only. The bias $b^*$ is recovered from any support vector $x_s$: $y_s(w^{*\top} x_s + b^*) = 1 \implies b^* = y_s - w^{*\top} x_s$.
Inference on a new point $x$: $\hat{y} = \text{sign}(w^{*\top} x + b^*) = \text{sign}\!\left(\sum_{s \in S} \mu_s y_s\, x_s^\top x + b^*\right)$.
Soft-Margin SVM: When Data Is Not Linearly Separable
Real data is rarely linearly separable. The hard-margin SVM has no solution when the classes overlap. The fix: introduce slack variables $\xi_i \geq 0$ that allow points to violate the margin by a controlled amount.
The hyperparameter $C > 0$ controls the tradeoff between margin maximisation and margin violation. Large $C$: few violations allowed, narrow margin, potentially overfits. Small $C$: many violations tolerated, wider margin, better generalisation. The dual problem is identical to the hard-margin case except $\mu_i$ is now bounded: $0 \leq \mu_i \leq C$.
The Kernel Trick: Nonlinearity at Zero Extra Cost
The SVM dual problem only depends on the data through inner products $x_i^\top x_j$. Now consider mapping data to a higher-dimensional space: $\phi: \mathbb{R}^d \to \mathbb{R}^{d'}$ with $d' \gg d$. The dual becomes:
Computing $\phi(x_i)^\top \phi(x_j)$ directly requires materialising the high-dimensional features โ potentially expensive. The kernel trick: if there exists a function $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$, we can compute the inner product in the high-dimensional space using only original features, without ever computing $\phi$ explicitly.
$K(x,z) = (x^\top z + 1)^2$ in $\mathbb{R}^2$ implicitly maps to the feature space $\phi(x) = [x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2}x_1, \sqrt{2}x_2, 1]$ โ 6 dimensions from 2, capturing all quadratic interactions. Computing $K$ costs $O(d)$; computing $\phi(x)^\top \phi(z)$ directly costs $O(d^2)$.
$K(x,z) = \exp(-\|x-z\|^2/2\sigma^2)$ corresponds to an infinite-dimensional feature map. The SVM with an RBF kernel operates in a Hilbert space of infinite dimension โ computing the inner product as a simple exponential of the distance. $\sigma$ is the bandwidth hyperparameter.
A quant fits a 2-factor model for NIFTY returns: $y = \theta_1 x_1 + \theta_2 x_2 + \epsilon$ where $x_1$ = FII buying and $x_2 \approx x_1$ (DII selling, highly correlated with FII). With $N = 10$ observations, $X^\top X \approx \begin{bmatrix} 100 & 98 \\ 98 & 100 \end{bmatrix}$, $X^\top y = \begin{bmatrix} 5 \\ 4.9 \end{bmatrix}$.
Looks fine โ but a tiny change in the data (replace $4.9$ with $5.0$ in $X^\top y$) gives $\theta^* = [0.05, 0.0]$ vs $[0.025, 0.025]$ โ completely different factor attribution from nearly identical data. OLS is wildly unstable due to near-singularity.
With $\lambda = 2$, $(X^\top X + \lambda I) = \begin{bmatrix}102 & 98 \\ 98 & 102\end{bmatrix}$, determinant $= 102^2 - 98^2 = (102+98)(102-98) = 800$:
Four NIFTY trading days, 2 features: $x_1$ = INDIA VIX, $x_2$ = 5-day momentum (%). Labels: $y = +1$ (market up), $y = -1$ (market down).
The two classes are separated: low VIX + positive momentum = up, high VIX + negative momentum = down. The SVM finds the hyperplane $w_1 x_1 + w_2 x_2 + b = 0$ that maximises the margin. By symmetry, the boundary passes midway between the two classes โ roughly along $x_1 \approx 17.5$ (or equivalently a line in the 2D feature space).
The support vectors are the points closest to the decision boundary โ the points that lie on the margin boundaries $w^\top x + b = \pm 1$. In this dataset: Day B $(15, +1)$ and Day C $(20, -1)$ are the closest points to the boundary from each class. These are the only points with $\mu_i > 0$. Days A and D lie well inside their respective half-spaces โ $\mu_A = \mu_D = 0$.
SVMs with kernels showed that the right feature map can make any problem linearly separable โ but we had to specify the kernel manually. What if the feature map itself was learned from data? That is the deep learning idea. Chapter 9 introduces Convolutional Neural Networks โ where the feature extractor is not hand-designed but optimised end-to-end by gradient descent.
A quant uses Ridge regression ($L_2$ regularisation) on a NIFTY factor model. A colleague says "Ridge just adds noise to the matrix โ it's a numerical hack." What is the precise Bayesian interpretation that refutes this?
As derived in this chapter: with Gaussian likelihood $P(y|x,\theta) = \mathcal{N}(y; h_\theta(x), \sigma^2)$ and Gaussian prior $P(\theta) = \mathcal{N}(0, \tau^2 I)$, the MAP objective is $\arg\min \frac{1}{N}\sum \|y_i - h_\theta(x_i)\|^2 + \frac{\sigma^2}{\tau^2}\|\theta\|_2^2$. This is exactly Ridge regression with $\lambda = \sigma^2/\tau^2$. The regularisation parameter $\lambda$ encodes the ratio of noise to prior uncertainty โ not a numerical trick but a precise prior belief that parameters are unlikely to be very large.
Option C is wrong: Laplace prior gives Lasso (L1), not Ridge (L2). Option D is wrong: MAP is a frequentist-Bayesian bridge โ it computes a point estimate from the posterior, unifying both frameworks.
In a trained SVM on 500 NIFTY trading days, only 35 days are identified as support vectors. What happens to the decision boundary if the remaining 465 non-support-vector training points are removed?
From the KKT complementary slackness condition: $\mu_i^*[1 - y_i(w^\top x_i + b)] = 0$. For non-support-vector points that lie strictly inside their margin ($y_i(w^\top x_i + b) > 1$), the constraint is satisfied strictly, forcing $\mu_i^* = 0$. Since $w^* = \sum_i \mu_i y_i x_i$, these points contribute zero to $w^*$ and zero to $b^*$. Removing them changes nothing.
This is one of the most practically important properties of SVMs: the solution is sparse in the training data. You can store only the 35 support vectors and discard the other 465 points entirely โ the model is perfectly preserved. This makes SVMs memory-efficient at deployment time.
A data scientist wants to classify NIFTY trading days as trending or mean-reverting. The two classes are not linearly separable in the original 5-dimensional feature space. She applies an RBF kernel SVM. What is the RBF kernel implicitly computing, and why is this more powerful than expanding features manually?
By Mercer's theorem, the RBF kernel $K(x,z) = \exp(-\|x-z\|^2/2\sigma^2)$ corresponds to an inner product $\phi(x)^\top \phi(z)$ where $\phi$ maps to a Reproducing Kernel Hilbert Space (RKHS) of infinite dimension. The SVM with an RBF kernel finds the maximum-margin hyperplane in this infinite-dimensional space โ a nonlinear decision boundary in the original 5D feature space.
The power of the kernel trick: computing $K(x_i, x_j) = \exp(-\|x_i-x_j\|^2/2\sigma^2)$ costs $O(d)$ โ just the distance between two 5-dimensional vectors. Computing the explicit inner product $\phi(x_i)^\top\phi(x_j)$ in the infinite-dimensional space would be impossible. The kernel gives you the answer without the computation.
A quant asks: "Should I use Lasso or Ridge for my NIFTY factor model? I have 200 macro features but suspect only 10 are genuinely predictive." Which regulariser is more appropriate, and what is the precise Bayesian reason?
The prior belief "only 10 of 200 features are relevant" is a sparsity belief โ most $\theta_j$ should be exactly zero. The Laplace prior $P(\theta_j) \propto \exp(-|\theta_j|/b)$ encodes this: it has a sharp peak at zero (pulling coefficients to exactly zero) and heavier tails than Gaussian (allowing the 10 relevant coefficients to be large). This produces the Lasso penalty $\lambda\|\theta\|_1$, which shrinks irrelevant coefficients to exactly zero, performing automatic feature selection.
Ridge's Gaussian prior shrinks all coefficients toward zero proportionally but never sets any exactly to zero โ with 200 features, you get 200 small nonzero coefficients, harder to interpret and more prone to overfitting than Lasso's sparse solution.
Option C is partially valid practically but misses the probabilistic reasoning. Option D is wrong: linear models with Lasso regularisation are often preferable to neural networks when $N$ is moderate and interpretability matters.
Q1โ3 are direct computations and definitions. Q4โ7 connect regularisation, SVMs, and the probabilistic framework. Q8โ10 are synthesis questions that span the entire module.