Machine Learning · Mathematical Foundations

The Mathematics
Behind Every ML Algorithm

Most ML courses teach you to call model.fit(). This one teaches you what's actually happening inside it — with full mathematical rigour, genuine intuition, and examples that matter to Indian traders and practitioners.

Chapters

8 Live

5 More Coming

32+

MCQ Problems

80+

Terminal Questions

₹0

Forever Free

Course Chapters

Each chapter builds on the last. Start from Chapter 1.

What Is Learning? ● Live

The function approximation view, the probabilistic worldview, random variables and $P_{XY}$ as the god-object of ML, the i.i.d. assumption fully earned, and the parametric estimation framework.

$P_{XY}$ i.i.d. Parametric Estimation Distribution Shift

4 MCQ · 10 Terminal →

The Learning Machine ● Live

Hypothesis classes, loss functions, true risk, empirical risk minimisation, the generalisation gap, the Bayes classifier and its proof, and the conditional expectation as the regression oracle.

Hypothesis Class Loss Functions ERM Bayes Classifier Generalisation Gap

4 MCQ · 10 Terminal →

Empirical Risk and the Likelihood Game ● Live

Surprisal, entropy, cross-entropy, KL divergence from scratch. The central equivalence: minimising KL divergence = maximising likelihood = ERM with NLL loss. Squared error as MLE under a Gaussian assumption.

KL Divergence Entropy MLE Cross-Entropy Model Misspecification

4 MCQ · 10 Terminal →

Maximum Likelihood — Let the Data Speak ● Live

Density estimation for the Bayes classifier, MLE for Gaussian and discrete models with full derivations, the multimodal problem, Gaussian Mixture Models as universal approximators, and why mixture MLE has no closed form.

MLE Density Estimation GMM Lagrangian Multimodal Data

4 MCQ · 10 Terminal →

Hidden Variables, Prior Beliefs, and the EM Algorithm ● Live

Latent variable models, the ELBO derived via Jensen's inequality, KL gap and tightness condition, the EM algorithm with convergence proof, responsibilities for GMMs, Bayesian estimation, MAP with conjugate priors, and the Beta-Bernoulli example.

EM Algorithm ELBO Latent Variables MAP Estimation Conjugate Priors

4 MCQ · 10 Terminal →

Non-parametric Density Estimation ● Live

The $k/nV$ counting argument derived from scratch, Parzen windows with hypercube and Gaussian kernels, KNN density estimation with adaptive neighbourhoods, and the KNN classifier derived as the Bayes classifier with non-parametric density estimates.

Parzen Windows KNN Kernel Density Adaptive Neighbourhood Bayes Classifier

4 MCQ · 10 Terminal →

Linear Models, Generalisation, and the Bias-Variance Tradeoff ● Live

ERM ↔ MLE equivalence, linear regression closed form $(X^\top X)^{-1}X^\top y$, generalised linear models with basis functions, logistic regression and softmax, train-test generalisation, and the full bias-variance-noise decomposition derived from scratch.

Linear Regression Logistic Regression Softmax Bias-Variance Generalisation

4 MCQ · 10 Terminal →

Regularisation, SVMs, and the Kernel Trick ● Live

Regularised ERM with L1/L2 penalties, Ridge and Lasso as MAP estimation with Gaussian/Laplace priors, max-margin classifiers, KKT conditions and support vectors, soft-margin SVM, and the kernel trick with Mercer's theorem.

Ridge · Lasso SVM KKT Conditions Kernel Trick

4 MCQ · 10 Terminal →

Convolutional Neural Networks ● Live

Local receptive fields, parameter sharing, convolution output dimensions, hyperparameters (filters, stride, padding), pooling, and end-to-end classification architectures including LeNet and AlexNet.

Convolution Pooling Feature Maps LeNet · AlexNet

4 MCQ · 10 Terminal →

Recurrent Neural Networks ● Live

Sequential data, parameter sharing across time, BPTT and vanishing gradients — the mathematical cause and the additive update solution, residual connections, and the road to LSTMs.

RNN BPTT Vanishing Gradients Additive Updates

4 MCQ · 10 Terminal →

Attention and Transformers Coming Soon

The attention mechanism, self-attention vs recurrence, encoder-decoder architecture, and how transformers replaced RNNs as the dominant sequence model.

Attention Self-Attention Transformers

4 MCQ · 10 Terminal 🔒

Ensembles and Decision Trees Coming Soon

Decision trees, bagging, random forests, boosting — AdaBoost and XGBoost — and why ensemble methods are the dominant approach in tabular data ML.

Random Forests Boosting XGBoost

4 MCQ · 10 Terminal 🔒

Unsupervised Learning and Generative Models Coming Soon

k-Means, Gaussian mixtures, PCA, dimensionality reduction, and a preview of generative models — GANs, VAEs, and diffusion models.

k-Means PCA GANs · VAEs Diffusion

4 MCQ · 10 Terminal 🔒

A note on pace. This module assumes no prior ML knowledge — only solid +2 mathematics and basic probability. Each chapter is self-contained but deliberately sequential: Chapter 3 (ERM) builds directly on Chapter 2 (loss functions), and Chapter 4 (MLE) is the payoff for both. If you've seen some of this material before, use the terminal questions at the end of each chapter to find your gaps. New chapters are added regularly — push the site, not the curriculum.

What you will actually understand by the end

✓

Why every ML problem is distribution estimation — and what $P_{XY}$ has to do with it

✓

What i.i.d. actually means, why it's almost never exactly true in finance, and when to worry

✓

Why MLE, ridge regression, and logistic regression are all the same idea in different clothes

✓

The bias-variance tradeoff — mathematically, not just as a cartoon

✓

What the EM algorithm is doing geometrically, and why it always increases the likelihood

✓

Why SVMs find the maximum margin, and what the kernel trick actually does to the input space

✓

Backpropagation as repeated chain rule — not magic, just calculus applied carefully

✓

Why model misspecification, distribution shift, and overfitting are the three ways every ML model fails in practice

The MathematicsBehind Every ML Algorithm

The Mathematics
Behind Every ML Algorithm