Machine Learning · Mathematical Foundations

The Mathematics
Behind Every ML Algorithm

Most ML courses teach you to call model.fit(). This one teaches you what's actually happening inside it — with full mathematical rigour, genuine intuition, and examples that matter to Indian traders and practitioners.

13
Chapters
8 Live
5 More Coming
32+
MCQ Problems
80+
Terminal Questions
0
Forever Free
Prerequisites
✓ +2 Mathematics ✓ Basic Probability ✓ Calculus (differentiation) Linear Algebra (helpful, not required) No ML background needed
Course Chapters
Each chapter builds on the last. Start from Chapter 1.
01
What Is Learning? ● Live
The function approximation view, the probabilistic worldview, random variables and $P_{XY}$ as the god-object of ML, the i.i.d. assumption fully earned, and the parametric estimation framework.
$P_{XY}$ i.i.d. Parametric Estimation Distribution Shift
4 MCQ · 10 Terminal
02
The Learning Machine ● Live
Hypothesis classes, loss functions, true risk, empirical risk minimisation, the generalisation gap, the Bayes classifier and its proof, and the conditional expectation as the regression oracle.
Hypothesis Class Loss Functions ERM Bayes Classifier Generalisation Gap
4 MCQ · 10 Terminal
03
Empirical Risk and the Likelihood Game ● Live
Surprisal, entropy, cross-entropy, KL divergence from scratch. The central equivalence: minimising KL divergence = maximising likelihood = ERM with NLL loss. Squared error as MLE under a Gaussian assumption.
KL Divergence Entropy MLE Cross-Entropy Model Misspecification
4 MCQ · 10 Terminal
04
Maximum Likelihood — Let the Data Speak ● Live
Density estimation for the Bayes classifier, MLE for Gaussian and discrete models with full derivations, the multimodal problem, Gaussian Mixture Models as universal approximators, and why mixture MLE has no closed form.
MLE Density Estimation GMM Lagrangian Multimodal Data
4 MCQ · 10 Terminal
05
Hidden Variables, Prior Beliefs, and the EM Algorithm ● Live
Latent variable models, the ELBO derived via Jensen's inequality, KL gap and tightness condition, the EM algorithm with convergence proof, responsibilities for GMMs, Bayesian estimation, MAP with conjugate priors, and the Beta-Bernoulli example.
EM Algorithm ELBO Latent Variables MAP Estimation Conjugate Priors
4 MCQ · 10 Terminal
06
Non-parametric Density Estimation ● Live
The $k/nV$ counting argument derived from scratch, Parzen windows with hypercube and Gaussian kernels, KNN density estimation with adaptive neighbourhoods, and the KNN classifier derived as the Bayes classifier with non-parametric density estimates.
Parzen Windows KNN Kernel Density Adaptive Neighbourhood Bayes Classifier
4 MCQ · 10 Terminal
07
Linear Models, Generalisation, and the Bias-Variance Tradeoff ● Live
ERM ↔ MLE equivalence, linear regression closed form $(X^\top X)^{-1}X^\top y$, generalised linear models with basis functions, logistic regression and softmax, train-test generalisation, and the full bias-variance-noise decomposition derived from scratch.
Linear Regression Logistic Regression Softmax Bias-Variance Generalisation
4 MCQ · 10 Terminal
08
Regularisation, SVMs, and the Kernel Trick ● Live
Regularised ERM with L1/L2 penalties, Ridge and Lasso as MAP estimation with Gaussian/Laplace priors, max-margin classifiers, KKT conditions and support vectors, soft-margin SVM, and the kernel trick with Mercer's theorem.
Ridge · Lasso SVM KKT Conditions Kernel Trick
4 MCQ · 10 Terminal
09
Convolutional Neural Networks ● Live
Local receptive fields, parameter sharing, convolution output dimensions, hyperparameters (filters, stride, padding), pooling, and end-to-end classification architectures including LeNet and AlexNet.
Convolution Pooling Feature Maps LeNet · AlexNet
4 MCQ · 10 Terminal
10
Recurrent Neural Networks ● Live
Sequential data, parameter sharing across time, BPTT and vanishing gradients — the mathematical cause and the additive update solution, residual connections, and the road to LSTMs.
RNN BPTT Vanishing Gradients Additive Updates
4 MCQ · 10 Terminal
11
Attention and Transformers Coming Soon
The attention mechanism, self-attention vs recurrence, encoder-decoder architecture, and how transformers replaced RNNs as the dominant sequence model.
Attention Self-Attention Transformers
4 MCQ · 10 Terminal 🔒
12
Ensembles and Decision Trees Coming Soon
Decision trees, bagging, random forests, boosting — AdaBoost and XGBoost — and why ensemble methods are the dominant approach in tabular data ML.
Random Forests Boosting XGBoost
4 MCQ · 10 Terminal 🔒
13
Unsupervised Learning and Generative Models Coming Soon
k-Means, Gaussian mixtures, PCA, dimensionality reduction, and a preview of generative models — GANs, VAEs, and diffusion models.
k-Means PCA GANs · VAEs Diffusion
4 MCQ · 10 Terminal 🔒
A note on pace. This module assumes no prior ML knowledge — only solid +2 mathematics and basic probability. Each chapter is self-contained but deliberately sequential: Chapter 3 (ERM) builds directly on Chapter 2 (loss functions), and Chapter 4 (MLE) is the payoff for both. If you've seen some of this material before, use the terminal questions at the end of each chapter to find your gaps. New chapters are added regularly — push the site, not the curriculum.
What you will actually understand by the end
Why every ML problem is distribution estimation — and what $P_{XY}$ has to do with it
What i.i.d. actually means, why it's almost never exactly true in finance, and when to worry
Why MLE, ridge regression, and logistic regression are all the same idea in different clothes
The bias-variance tradeoff — mathematically, not just as a cartoon
What the EM algorithm is doing geometrically, and why it always increases the likelihood
Why SVMs find the maximum margin, and what the kernel trick actually does to the input space
Backpropagation as repeated chain rule — not magic, just calculus applied carefully
Why model misspecification, distribution shift, and overfitting are the three ways every ML model fails in practice