/
Machine Learning โ€บ Mathematical Foundations โ€บ Regularisation, SVMs, and Kernels
Progress
8 / 13
โ† All Chapters
Advanced Chapter 08 of 13 ยท Mathematical Foundations of ML

Regularisation, SVMs,
and the Kernel Trick

Every thread from the last seven chapters converges here. Regularisation fixes the overfitting problem with a Bayesian prior in disguise. SVMs find the provably optimal decision boundary by solving a beautifully structured constrained optimisation. Kernels dissolve the boundary between linear and nonlinear โ€” letting you operate in infinite-dimensional feature spaces at finite cost.

The Overfitting Problem, Revisited

Chapter 7 gave us the diagnosis: when hypothesis class $\mathcal{H}$ is too expressive relative to the training set size $N$, the learned model $h_D$ has high variance โ€” it overfits. The training error is small but the generalisation gap is large. The ERM objective $\hat{R}(\theta) = \frac{1}{N}\sum_i L(h_\theta(x_i), y_i)$ finds the parameters that best explain the training data, with no regard for whether those parameters generalise.

Regularisation adds a second term to the objective โ€” a penalty $\Omega(\theta)$ that discourages parameters from taking extreme values. The regularised ERM objective is:

Regularised ERM (Reg-ERM): $$\hat{h}^*_\theta = \underset{h_\theta}{\arg\min}\;\frac{1}{N}\sum_{i=1}^N L(h_\theta(x_i), y_i) \quad \text{subject to}\quad \Omega(\theta) < \kappa$$ Equivalently, via Lagrangian with multiplier $\lambda$: $$\hat{R}_r(\theta) = \frac{1}{N}\sum_{i=1}^N L(h_\theta(x_i), y_i) + \lambda\,\Omega(\theta)$$ The hyperparameter $\lambda > 0$ controls the regularisation strength. Large $\lambda$: strong penalty, high bias, low variance. Small $\lambda$: weak penalty, low bias, higher variance.
Regularisation is a principled way to encode the belief that "large parameters are a priori unlikely." A model with very large $\theta$ is making extreme predictions based on the specific noise pattern in the training data. Penalising $\|\theta\|$ says: among all hypotheses that fit the data well, prefer the simpler one with smaller parameters. This is Occam's Razor implemented as a convex penalty.

Ridge and Lasso: Two Choices of Penalty

The two dominant choices of $\Omega(\theta)$ lead to the two most widely used regularised regression methods:

NamePenalty $\Omega(\theta)$ObjectiveKey Property
Ridge (L2)$\|\theta\|_2^2 = \sum_j \theta_j^2$$\frac{1}{N}\sum(y_i - \theta^\top x_i)^2 + \lambda\|\theta\|_2^2$Shrinks all parameters toward zero; closed-form solution
Lasso (L1)$\|\theta\|_1 = \sum_j |\theta_j|$$\frac{1}{N}\sum(y_i - \theta^\top x_i)^2 + \lambda\|\theta\|_1$Sets many parameters exactly to zero; automatic feature selection

Ridge has a closed-form solution: $\theta^*_\text{Ridge} = (X^\top X + \lambda I)^{-1} X^\top y$. The $\lambda I$ term ensures invertibility even when $X^\top X$ is singular โ€” solving the collinearity problem in OLS. Lasso has no closed-form solution (the $|\cdot|$ function is not differentiable at zero) but is solvable by convex optimisation methods such as coordinate descent.

Ridge regression is the standard tool for fitting multi-factor NIFTY return models. When 50 factor exposures are correlated (large-cap funds tend to hold similar stocks), $X^\top X$ is nearly singular and OLS estimates are unstable. Ridge's $\lambda I$ regularisation stabilises the solution. The parameter $\lambda$ is chosen by cross-validation over rolling windows. Lasso goes further: if only 5 of 50 factors are genuinely predictive, Lasso automatically zeroes out the other 45 โ€” performing feature selection and parameter estimation simultaneously. This is why Lasso-based factor models are standard at quantitative funds.

Regularisation Is MAP Estimation in Disguise

The deep reason regularisation works is revealed by connecting it to Chapter 5's MAP estimation. Consider the MAP objective for a linear regression model:

Derivation ยท Reg-ERM = MAP with a Specific Prior Ridge = Gaussian Prior ยท Lasso = Laplace Prior
MAP objective for linear regression

With model $P_\theta(y|x) = \mathcal{N}(y;\, h_\theta(x),\, \sigma^2)$ and prior $P(\theta)$, the MAP estimate maximises:

$$\ell_r(\theta) = \log P(\theta|y,x) \propto \log P(y|x,\theta) + \log P(\theta)$$ $$\propto -\frac{1}{N}\sum_{i=1}^N \|y_i - h_\theta(x_i)\|_2^2 + \log P(\theta)$$
Case 1 โ€” Gaussian prior: Ridge regression

Set $P(\theta) = \mathcal{N}(\theta;\, 0,\, \tau^2 I)$, so $\log P(\theta) = -\frac{1}{2\tau^2}\|\theta\|_2^2 + \text{const}$. Maximising $\ell_r$ is equivalent to minimising:

$$\underset{\theta}{\arg\min}\;\frac{1}{N}\sum_{i=1}^N \|y_i - h_\theta(x_i)\|_2^2 + \frac{\sigma^2}{\tau^2}\|\theta\|_2^2$$

With $\lambda = \sigma^2/\tau^2$, this is exactly Ridge regression. The regularisation parameter $\lambda$ is the ratio of noise variance to prior variance โ€” a small prior variance (strong prior belief that $\theta \approx 0$) gives large $\lambda$ (strong regularisation).

Case 2 โ€” Laplace prior: Lasso regression

Set $P(\theta) = \text{Laplace}(\theta;\, 0,\, b)$, so $\log P(\theta) = -\frac{1}{b}\|\theta\|_1 + \text{const}$. The MAP objective becomes:

$$\underset{\theta}{\arg\min}\;\frac{1}{N}\sum_{i=1}^N \|y_i - h_\theta(x_i)\|_2^2 + \frac{\sigma^2}{b}\|\theta\|_1$$

This is exactly Lasso. The Laplace prior has heavier tails than the Gaussian โ€” it places more probability mass near zero and also in the tails, encouraging sparse solutions (many $\theta_j = 0$, with a few large ones).

Ridge = MAP with Gaussian prior. Lasso = MAP with Laplace prior. Every regularised regression is a Bayesian estimate with a specific prior on the parameters. The choice of regulariser is the choice of prior. Regularisation is not a hack โ€” it is principled Bayesian inference.
The Grand Unification โ€” All Roads Lead Here
ERM
$\min \hat{R}(\theta)$
Ch 2
โ‰ก
MLE
$\max \ell(\theta)$
Ch 3โ€“4
โ‰ก
KL Min
$\min D_{KL}(P\|P_\theta)$
Ch 3
โ‰ก
Reg-ERM
$\min \hat{R} + \lambda\Omega$
Ch 8
โ‰ก
MAP
$\max \log P(\theta|D)$
Ch 5
One optimisation problem. Eight chapters. Five equivalent descriptions.

Support Vector Machines: The Maximum Margin Classifier

Regularisation fixes overfitting for regression. For classification, there is a different approach to finding the "best" hypothesis โ€” one based on geometry rather than probabilistic priors. When data is linearly separable, infinitely many hyperplanes separate the two classes. The SVM asks: which one is optimal?

Consider binary classification: $x \in \mathbb{R}^d$, $y \in \{-1, +1\}$. A linear classifier is defined by a discriminant function $g(x) = w^\top x + b$, predicting $\hat{y} = \text{sign}(g(x))$. A dataset is linearly separable if there exist $w, b$ such that: $$w^\top x_i + b > 0 \quad \text{for all } y_i = +1 \qquad \text{and} \qquad w^\top x_i + b < 0 \quad \text{for all } y_i = -1$$

The margin of a linear classifier $(w, b)$ is the perpendicular distance from the decision boundary $w^\top x + b = 0$ to the closest training point. For the canonical normalization where $|w^\top x_i + b| \geq 1$ for all training points, the margin equals $\frac{2}{\|w\|}$ โ€” the distance between the two parallel supporting hyperplanes $w^\top x + b = +1$ and $w^\top x + b = -1$.

The SVM finds the hyperplane that maximises this margin. Maximising $\frac{2}{\|w\|}$ is equivalent to minimising $\|w\|^2$:

$$\min_{w,b}\; \frac{1}{2} w^\top w \qquad \text{subject to}\quad (w^\top x_i + b)\,y_i \geq 1 \quad \forall\, i$$
Why maximise the margin? Intuitively, a classifier with a large margin is more robust โ€” small perturbations in the data (noise, new test points near the boundary) are less likely to cross the decision boundary. The maximum-margin classifier is the most "confident" separator: it commits as little as possible to any specific training point, staying as far as possible from all of them. This geometric principle turns out to be equivalent to minimising a specific regularised risk โ€” SVMs are regularised classifiers with a particular structure.

Solving the SVM: KKT Conditions

The SVM primal is a constrained quadratic optimisation problem โ€” convex objective, linear constraints. It is solved via the Lagrangian and KKT (Karush-Kuhn-Tucker) conditions.

Derivation ยท SVM via KKT Conditions Primal โ†’ Dual ยท Support Vectors
Step 1 โ€” Write the Lagrangian

Introduce multipliers $\mu_i \geq 0$ for each constraint $[1 - y_i(w^\top x_i + b)] \leq 0$:

$$\mathcal{L}(w, b, \mu) = \frac{w^\top w}{2} + \sum_{i=1}^n \mu_i\left[1 - y_i(w^\top x_i + b)\right]$$
Step 2 โ€” KKT stationarity conditions

Set gradients to zero:

$$\nabla_w \mathcal{L} = 0 \implies w^* = \sum_{i=1}^n \mu_i y_i x_i$$ $$\nabla_b \mathcal{L} = 0 \implies \sum_{i=1}^n \mu_i y_i = 0$$

The optimal $w^*$ is a linear combination of the training points, weighted by $\mu_i y_i$. Only points with $\mu_i > 0$ contribute โ€” these are the support vectors $S = \{x_i : \mu_i > 0\}$.

Step 3 โ€” KKT complementary slackness

$\mu_i^*\left[1 - y_i(w^\top x_i + b)\right] = 0$ for all $i$. A point either lies exactly on the margin boundary ($y_i(w^\top x_i + b) = 1$, called a support vector) or has $\mu_i = 0$ (interior point, irrelevant to the solution).

Step 4 โ€” Derive the dual problem

Substitute $w^* = \sum_i \mu_i y_i x_i$ back into $\mathcal{L}$ and simplify:

$$q(\mu) = \sum_{i=1}^n \mu_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \mu_i \mu_j y_i y_j\, x_i^\top x_j$$

The dual problem maximises $q(\mu)$ subject to $\mu_i \geq 0$ and $\sum_i \mu_i y_i = 0$. It is quadratic in $\mu$ with linear constraints โ€” standard QP solvers apply.

Key observation: the data $x$ appears only through inner products $x_i^\top x_j$. This is the crucial structural property that enables the kernel trick.

Once $\mu^*$ is found, the optimal weight vector is $w^* = \sum_{s \in S} \mu_s y_s x_s$ โ€” a sparse combination of support vectors only. The bias $b^*$ is recovered from any support vector $x_s$: $y_s(w^{*\top} x_s + b^*) = 1 \implies b^* = y_s - w^{*\top} x_s$.

Inference on a new point $x$: $\hat{y} = \text{sign}(w^{*\top} x + b^*) = \text{sign}\!\left(\sum_{s \in S} \mu_s y_s\, x_s^\top x + b^*\right)$.

Soft-Margin SVM: When Data Is Not Linearly Separable

Real data is rarely linearly separable. The hard-margin SVM has no solution when the classes overlap. The fix: introduce slack variables $\xi_i \geq 0$ that allow points to violate the margin by a controlled amount.

$$\min_{w,b,\xi}\; \frac{1}{2}w^\top w + C\sum_{i=1}^n \xi_i \qquad \text{subject to}\quad (w^\top x_i + b)y_i \geq 1 - \xi_i,\quad \xi_i \geq 0$$

The hyperparameter $C > 0$ controls the tradeoff between margin maximisation and margin violation. Large $C$: few violations allowed, narrow margin, potentially overfits. Small $C$: many violations tolerated, wider margin, better generalisation. The dual problem is identical to the hard-margin case except $\mu_i$ is now bounded: $0 \leq \mu_i \leq C$.

$C$ is the most important hyperparameter in a soft-margin SVM โ€” it directly controls the bias-variance tradeoff. Choosing $C$ by cross-validation on a held-out set is mandatory. In financial data, where classes almost always overlap (up-days and down-days have similar feature distributions), the soft-margin SVM is the only practically usable version.

The Kernel Trick: Nonlinearity at Zero Extra Cost

The SVM dual problem only depends on the data through inner products $x_i^\top x_j$. Now consider mapping data to a higher-dimensional space: $\phi: \mathbb{R}^d \to \mathbb{R}^{d'}$ with $d' \gg d$. The dual becomes:

$$\max_\mu \sum_i \mu_i - \frac{1}{2}\sum_i\sum_j \mu_i \mu_j y_i y_j\, \phi(x_i)^\top \phi(x_j)$$

Computing $\phi(x_i)^\top \phi(x_j)$ directly requires materialising the high-dimensional features โ€” potentially expensive. The kernel trick: if there exists a function $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$, we can compute the inner product in the high-dimensional space using only original features, without ever computing $\phi$ explicitly.

A kernel function $K: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ satisfies $K(x, z) = \phi(x)^\top \phi(z)$ for some feature map $\phi$. By Mercer's theorem, a symmetric function $K$ is a valid kernel if and only if the $n \times n$ kernel matrix with entries $K_{ij} = K(x_i, x_j)$ is positive semi-definite for all datasets. Common valid kernels: $$\text{Polynomial: } K(x,z) = (x^\top z + c)^d \qquad \text{Gaussian/RBF: } K(x,z) = \exp\!\left(-\frac{\|x-z\|^2}{2\sigma^2}\right)$$
๐Ÿ“ Polynomial Kernel

$K(x,z) = (x^\top z + 1)^2$ in $\mathbb{R}^2$ implicitly maps to the feature space $\phi(x) = [x_1^2, x_2^2, \sqrt{2}x_1 x_2, \sqrt{2}x_1, \sqrt{2}x_2, 1]$ โ€” 6 dimensions from 2, capturing all quadratic interactions. Computing $K$ costs $O(d)$; computing $\phi(x)^\top \phi(z)$ directly costs $O(d^2)$.

๐Ÿ“ Gaussian (RBF) Kernel

$K(x,z) = \exp(-\|x-z\|^2/2\sigma^2)$ corresponds to an infinite-dimensional feature map. The SVM with an RBF kernel operates in a Hilbert space of infinite dimension โ€” computing the inner product as a simple exponential of the distance. $\sigma$ is the bandwidth hyperparameter.

The kernel trick is one of the most elegant ideas in all of ML. You want to classify with a nonlinear boundary โ€” which requires a high-dimensional feature map. But you don't need the features explicitly; you only need their inner products. The kernel function computes those inner products cheaply, as if you had done the full feature expansion, without actually doing it. The SVM with an RBF kernel effectively finds the maximum-margin hyperplane in an infinite-dimensional space โ€” something that sounds computationally impossible but is achieved in $O(n^2)$ kernel evaluations.
SVMs with RBF kernels are used in credit scoring at Indian NBFCs and banks. The feature space of a loan applicant โ€” income, assets, bureau score, transaction history โ€” is nonlinearly separable between default and non-default. An RBF kernel SVM finds a nonlinear boundary without specifying its form in advance. The support vectors are the most informative applicant profiles โ€” those on the boundary between creditworthy and risky. Every bank's credit model has some version of this idea, even if implemented as gradient-boosted trees rather than explicit SVMs.
Worked Example 1 ยท Ridge vs OLS on a Collinear NIFTY Factor Model ๐Ÿ‡ฎ๐Ÿ‡ณ Regularisation in Practice

A quant fits a 2-factor model for NIFTY returns: $y = \theta_1 x_1 + \theta_2 x_2 + \epsilon$ where $x_1$ = FII buying and $x_2 \approx x_1$ (DII selling, highly correlated with FII). With $N = 10$ observations, $X^\top X \approx \begin{bmatrix} 100 & 98 \\ 98 & 100 \end{bmatrix}$, $X^\top y = \begin{bmatrix} 5 \\ 4.9 \end{bmatrix}$.

OLS solution โ€” unstable
$$(X^\top X)^{-1} = \frac{1}{100^2 - 98^2}\begin{bmatrix}100 & -98 \\ -98 & 100\end{bmatrix} = \frac{1}{396}\begin{bmatrix}100 & -98 \\ -98 & 100\end{bmatrix}$$ $$\theta^*_{\text{OLS}} = \frac{1}{396}\begin{bmatrix}100(5) - 98(4.9) \\ -98(5) + 100(4.9)\end{bmatrix} = \frac{1}{396}\begin{bmatrix}500 - 480.2 \\ -490 + 490\end{bmatrix} = \begin{bmatrix}0.05 \\ 0.0\end{bmatrix}$$

Looks fine โ€” but a tiny change in the data (replace $4.9$ with $5.0$ in $X^\top y$) gives $\theta^* = [0.05, 0.0]$ vs $[0.025, 0.025]$ โ€” completely different factor attribution from nearly identical data. OLS is wildly unstable due to near-singularity.

Ridge solution โ€” stable

With $\lambda = 2$, $(X^\top X + \lambda I) = \begin{bmatrix}102 & 98 \\ 98 & 102\end{bmatrix}$, determinant $= 102^2 - 98^2 = (102+98)(102-98) = 800$:

$$\theta^*_{\text{Ridge}} = \frac{1}{800}\begin{bmatrix}102 & -98 \\ -98 & 102\end{bmatrix}\begin{bmatrix}5 \\ 4.9\end{bmatrix} = \frac{1}{800}\begin{bmatrix}510 - 480.2 \\ -490 + 499.8\end{bmatrix} = \begin{bmatrix}0.037 \\ 0.012\end{bmatrix}$$
Ridge shrinks both coefficients toward zero and distributes the attribution more evenly. More importantly, the solution is stable: small changes in data produce small changes in $\theta^*$. The $\lambda I$ term in $(X^\top X + \lambda I)$ ensures the matrix is well-conditioned regardless of collinearity. In a real factor model with 20 correlated macro variables, Ridge regression is essential โ€” OLS would produce absurd, unstable factor loadings.
Worked Example 2 ยท SVM Support Vectors on NIFTY Direction Data ๐Ÿ‡ฎ๐Ÿ‡ณ Maximum Margin ยท 2D Feature Space

Four NIFTY trading days, 2 features: $x_1$ = INDIA VIX, $x_2$ = 5-day momentum (%). Labels: $y = +1$ (market up), $y = -1$ (market down).

$$\begin{array}{lccc} \text{Day} & x_1\text{ (VIX)} & x_2\text{ (Mom\%)} & y \\ \hline A & 13 & +2 & +1 \\ B & 15 & +1 & +1 \\ C & 20 & -1 & -1 \\ D & 22 & -2 & -1 \end{array}$$
Identify the decision boundary

The two classes are separated: low VIX + positive momentum = up, high VIX + negative momentum = down. The SVM finds the hyperplane $w_1 x_1 + w_2 x_2 + b = 0$ that maximises the margin. By symmetry, the boundary passes midway between the two classes โ€” roughly along $x_1 \approx 17.5$ (or equivalently a line in the 2D feature space).

Identify support vectors

The support vectors are the points closest to the decision boundary โ€” the points that lie on the margin boundaries $w^\top x + b = \pm 1$. In this dataset: Day B $(15, +1)$ and Day C $(20, -1)$ are the closest points to the boundary from each class. These are the only points with $\mu_i > 0$. Days A and D lie well inside their respective half-spaces โ€” $\mu_A = \mu_D = 0$.

The key implication
Only the support vectors matter. If you remove days A and D from the training set, the SVM solution is identical โ€” they contribute nothing to $w^*$ since $\mu_A = \mu_D = 0$. The maximum-margin hyperplane is fully determined by the most informative points on the boundary. This sparsity is what makes SVMs computationally tractable and robust: a model trained on 1000 days may have only 50 support vectors determining its decision boundary.
Chapter 9 Preview

SVMs with kernels showed that the right feature map can make any problem linearly separable โ€” but we had to specify the kernel manually. What if the feature map itself was learned from data? That is the deep learning idea. Chapter 9 introduces Convolutional Neural Networks โ€” where the feature extractor is not hand-designed but optimised end-to-end by gradient descent.


Practice Problems
4 questions ยท Chapter 08
ml / mathematical-foundations / ch08 / q01 โ˜… Conceptual

A quant uses Ridge regression ($L_2$ regularisation) on a NIFTY factor model. A colleague says "Ridge just adds noise to the matrix โ€” it's a numerical hack." What is the precise Bayesian interpretation that refutes this?

A
Ridge regression is equivalent to MLE with a larger dataset โ€” it is not Bayesian.
B
Ridge regression is MAP estimation with a Gaussian prior $P(\theta) = \mathcal{N}(0, \tau^2 I)$ on the parameters. The $\lambda I$ term encodes the prior belief that parameters are likely small. It is principled inference, not a hack.
C
Ridge regression is MAP estimation with a Laplace prior, which encourages sparsity in the coefficients.
D
Ridge regression is a frequentist method with no Bayesian interpretation โ€” the two frameworks are incompatible.
Answer: B.

As derived in this chapter: with Gaussian likelihood $P(y|x,\theta) = \mathcal{N}(y; h_\theta(x), \sigma^2)$ and Gaussian prior $P(\theta) = \mathcal{N}(0, \tau^2 I)$, the MAP objective is $\arg\min \frac{1}{N}\sum \|y_i - h_\theta(x_i)\|^2 + \frac{\sigma^2}{\tau^2}\|\theta\|_2^2$. This is exactly Ridge regression with $\lambda = \sigma^2/\tau^2$. The regularisation parameter $\lambda$ encodes the ratio of noise to prior uncertainty โ€” not a numerical trick but a precise prior belief that parameters are unlikely to be very large.

Option C is wrong: Laplace prior gives Lasso (L1), not Ridge (L2). Option D is wrong: MAP is a frequentist-Bayesian bridge โ€” it computes a point estimate from the posterior, unifying both frameworks.
ml / mathematical-foundations / ch08 / q02 โ˜… Conceptual

In a trained SVM on 500 NIFTY trading days, only 35 days are identified as support vectors. What happens to the decision boundary if the remaining 465 non-support-vector training points are removed?

A
The decision boundary changes significantly โ€” fewer training points means less information.
B
The decision boundary becomes undefined โ€” SVMs require all training points to compute $w^*$.
C
The decision boundary is completely unchanged โ€” since $w^* = \sum_{s \in S} \mu_s y_s x_s$ depends only on support vectors, the 465 non-support-vector points contribute $\mu_i = 0$ and are irrelevant.
D
The decision boundary shifts slightly โ€” removing points always affects the margin calculation.
Answer: C.

From the KKT complementary slackness condition: $\mu_i^*[1 - y_i(w^\top x_i + b)] = 0$. For non-support-vector points that lie strictly inside their margin ($y_i(w^\top x_i + b) > 1$), the constraint is satisfied strictly, forcing $\mu_i^* = 0$. Since $w^* = \sum_i \mu_i y_i x_i$, these points contribute zero to $w^*$ and zero to $b^*$. Removing them changes nothing.

This is one of the most practically important properties of SVMs: the solution is sparse in the training data. You can store only the 35 support vectors and discard the other 465 points entirely โ€” the model is perfectly preserved. This makes SVMs memory-efficient at deployment time.
ml / mathematical-foundations / ch08 / q03 โ˜…โ˜… Conceptual

A data scientist wants to classify NIFTY trading days as trending or mean-reverting. The two classes are not linearly separable in the original 5-dimensional feature space. She applies an RBF kernel SVM. What is the RBF kernel implicitly computing, and why is this more powerful than expanding features manually?

A
The RBF kernel computes the Euclidean distance between points, which is equivalent to a linear classifier in the original space.
B
The RBF kernel squares each feature, mapping to a 25-dimensional space of all pairwise products.
C
The RBF kernel implicitly computes an inner product in an infinite-dimensional feature space, where the two classes may be linearly separable. The kernel trick evaluates this inner product as a simple exponential $\exp(-\|x_i - x_j\|^2/2\sigma^2)$ without materialising the infinite-dimensional features.
D
The RBF kernel is a smoothing operation that averages nearby feature vectors, reducing noise before linear classification.
Answer: C.

By Mercer's theorem, the RBF kernel $K(x,z) = \exp(-\|x-z\|^2/2\sigma^2)$ corresponds to an inner product $\phi(x)^\top \phi(z)$ where $\phi$ maps to a Reproducing Kernel Hilbert Space (RKHS) of infinite dimension. The SVM with an RBF kernel finds the maximum-margin hyperplane in this infinite-dimensional space โ€” a nonlinear decision boundary in the original 5D feature space.

The power of the kernel trick: computing $K(x_i, x_j) = \exp(-\|x_i-x_j\|^2/2\sigma^2)$ costs $O(d)$ โ€” just the distance between two 5-dimensional vectors. Computing the explicit inner product $\phi(x_i)^\top\phi(x_j)$ in the infinite-dimensional space would be impossible. The kernel gives you the answer without the computation.
ml / mathematical-foundations / ch08 / q04 โ˜…โ˜… Synthesis

A quant asks: "Should I use Lasso or Ridge for my NIFTY factor model? I have 200 macro features but suspect only 10 are genuinely predictive." Which regulariser is more appropriate, and what is the precise Bayesian reason?

A
Ridge โ€” it produces smaller coefficients overall, which is always better for financial models.
B
Lasso โ€” the belief that only 10 of 200 features are relevant is encoded by a Laplace prior, which concentrates mass near zero and has heavier tails, naturally producing sparse solutions where most coefficients are exactly zero.
C
Ridge โ€” it has a closed-form solution, making it more reliable than Lasso's iterative solver for financial applications.
D
Neither โ€” with 200 features you need a neural network, not a linear model with regularisation.
Answer: B.

The prior belief "only 10 of 200 features are relevant" is a sparsity belief โ€” most $\theta_j$ should be exactly zero. The Laplace prior $P(\theta_j) \propto \exp(-|\theta_j|/b)$ encodes this: it has a sharp peak at zero (pulling coefficients to exactly zero) and heavier tails than Gaussian (allowing the 10 relevant coefficients to be large). This produces the Lasso penalty $\lambda\|\theta\|_1$, which shrinks irrelevant coefficients to exactly zero, performing automatic feature selection.

Ridge's Gaussian prior shrinks all coefficients toward zero proportionally but never sets any exactly to zero โ€” with 200 features, you get 200 small nonzero coefficients, harder to interpret and more prone to overfitting than Lasso's sparse solution.

Option C is partially valid practically but misses the probabilistic reasoning. Option D is wrong: linear models with Lasso regularisation are often preferable to neural networks when $N$ is moderate and interpretability matters.

Terminal Questions โ€” Chapter 08 10 problems ยท No answers given

Q1โ€“3 are direct computations and definitions. Q4โ€“7 connect regularisation, SVMs, and the probabilistic framework. Q8โ€“10 are synthesis questions that span the entire module.

1
For the Ridge regression problem with $X^\top X = \begin{bmatrix}4 & 2 \\ 2 & 3\end{bmatrix}$, $X^\top y = \begin{bmatrix}6 \\ 5\end{bmatrix}$, and $\lambda = 1$, compute the Ridge solution $\theta^* = (X^\top X + \lambda I)^{-1}X^\top y$. Compare to the OLS solution and describe how Ridge shrinks the coefficients.Easy
2
Verify that $K(x, z) = (x^\top z)^2$ is a valid kernel for $x, z \in \mathbb{R}^2$ by: (a) finding the feature map $\phi(x)$ such that $K(x,z) = \phi(x)^\top \phi(z)$, and (b) showing the $2\times 2$ kernel matrix $K = \begin{bmatrix}K(x_1,x_1) & K(x_1,x_2) \\ K(x_2,x_1) & K(x_2,x_2)\end{bmatrix}$ is positive semi-definite for $x_1 = [1,0]^\top$, $x_2 = [0,1]^\top$.Easy
3
For a hard-margin SVM with two support vectors at $x_+ = [2, 1]^\top$ ($y=+1$) and $x_- = [0, 1]^\top$ ($y=-1$), find $w^*$ and $b^*$ analytically. What is the margin width $2/\|w^*\|$?Easy
4
Derive the Ridge regression solution $\theta^*_\text{Ridge} = (X^\top X + \lambda I)^{-1}X^\top y$ from the Reg-ERM objective $\min_\theta \|X\theta - y\|^2 + \lambda\|\theta\|^2$ by differentiating and setting to zero. Show that Ridge always has a unique solution even when $X^\top X$ is singular.Medium
5
Explain geometrically why Lasso ($L_1$) produces sparse solutions while Ridge ($L_2$) does not. Draw the constraint regions $\|\theta\|_1 \leq \kappa$ and $\|\theta\|_2^2 \leq \kappa$ in 2D and show why the $L_1$ ball has corners that align with the coordinate axes, causing the OLS loss ellipses to first contact the $L_1$ constraint at a corner (setting one coordinate to zero).Medium
6
In the soft-margin SVM, the hyperparameter $C$ controls the bias-variance tradeoff. (a) Explain what happens as $C \to \infty$ (reverts to hard-margin). (b) Explain what happens as $C \to 0$ (all points become support vectors). (c) In terms of the bias-variance decomposition from Ch7, which direction is high-bias and which is high-variance?Medium
7
The SVM dual problem shows that data only enters through inner products $x_i^\top x_j$. Show that the same is true for the prediction function: $h(x) = \text{sign}(w^{*\top}x + b^*) = \text{sign}(\sum_{s\in S} \mu_s y_s x_s^\top x + b^*)$. Why does this mean that replacing $x_i^\top x_j$ with $K(x_i, x_j)$ in both the training and prediction steps is sufficient to "kernelise" the SVM?Medium
8
Trace the following path through the entire module: a quant wants to classify NIFTY trading days as "good for selling straddles" vs "bad for selling straddles." For each of the following models, identify which chapter(s) provide the mathematical foundations: (a) logistic regression, (b) KNN classifier, (c) Gaussian Mixture Model + Bayes classifier, (d) SVM with RBF kernel. For each, state the key assumption about $P_{XY}$ and the computational procedure for finding the optimal classifier.Hard
9
The hinge loss is defined as $L_\text{hinge}(h(x), y) = \max(0, 1 - y\cdot h(x))$. Show that ERM with hinge loss and an $L_2$ regulariser is equivalent to the soft-margin SVM primal problem. This demonstrates that the SVM is a regularised ERM with a specific loss function โ€” completing the unification of all classifiers under the ERM framework.Hard
10
Module capstone. The probabilistic view (Ch1โ€“5) and the geometric/optimisation view (Ch6โ€“8) of ML are different languages for the same ideas. For each pair below, identify the correspondence and explain it in one precise sentence: (a) MLE โ†” ERM, (b) MAP โ†” Reg-ERM, (c) KL divergence โ†” cross-entropy loss, (d) latent variable model โ†” GMM density estimation, (e) ELBO lower bound โ†” Jensen's inequality. Then state: is there any fundamentally new idea in Ch6โ€“8 that does not have a probabilistic counterpart, or are the two frameworks truly unified?Hard