/
Machine Learning โ€บ Mathematical Foundations โ€บ Convolutional Neural Networks
Progress
9 / 13
โ† All Chapters
Intermediate Chapter 09 of 13 ยท Mathematical Foundations of ML

Convolutional
Neural Networks

Chapter 8 showed that the right feature map can make any problem linearly separable โ€” but the kernel had to be hand-designed. CNNs remove that constraint entirely. The feature extractor is not chosen by the practitioner; it is learned from data by gradient descent. The key insight: for grid-structured data like images, two structural constraints โ€” local receptive fields and parameter sharing โ€” make this learning both tractable and powerful.

The MLP's Problem with Images

A fully connected MLP with input $x \in \mathbb{R}^d$ connects every input unit to every unit in the first hidden layer. For a modest $256 \times 256$ greyscale image, $d = 65{,}536$. With 1000 hidden units in the first layer, that is $65{,}536 \times 1{,}000 = 65.5$ million parameters โ€” from the input layer alone. For a colour image, multiply by 3.

This is wasteful for two reasons. First, spatial locality: in natural images, nearby pixels are correlated and meaningful patterns are local โ€” an edge, a curve, a texture patch. Connecting every pixel to every neuron ignores this structure. Second, translation invariance: a cat in the top-left corner of an image is the same cat in the bottom-right. The MLP has no way to share the knowledge about "what a cat looks like" across spatial positions โ€” it must relearn it from scratch for every location.

CNNs address both problems with two structural constraints applied to the MLP: local receptive fields and parameter sharing.

A Convolutional Neural Network (CNN) is a regularised MLP designed for data with grid-like topology (images, time-series, spectrograms). It enforces two inductive biases:

(i) Local receptive field: each neuron in a hidden layer connects only to a small spatial region of the input โ€” not the entire input.

(ii) Parameter sharing: the weights of each local filter are shared across all spatial positions. The same filter is applied everywhere.

Local Receptive Fields and the Convolution Operation

Fix a filter of size $k \times k$ (e.g. $3 \times 3$). Slide it across a $P \times Q$ input image. At each position, compute the dot product between the filter weights $w \in \mathbb{R}^{k \times k}$ and the local patch of the image. The result at each position is one element of the output feature map.

The convolution operation for a single filter $w$ applied to input $x$ produces a feature map $z$ where each element is: $$z_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} w_{m,n}\cdot x_{i+m,\; j+n} + b$$ The filter $w$ and bias $b$ are the learnable parameters. The same $w$ is used at every position $(i,j)$ โ€” this is parameter sharing.
Think of the filter as a pattern detector. A filter with a horizontal gradient pattern will fire strongly wherever the input has a horizontal edge. A filter with a circular pattern fires at circular objects. Crucially, because the same filter is applied everywhere, it detects its pattern regardless of where in the image it appears. In a deep CNN, early layers learn edge detectors, middle layers learn texture and shape detectors, and deep layers learn object-level features โ€” all from gradient descent on the training data, with no hand-engineering.

Output Dimensions: The Formula

Given an input of size $P \times Q$, a filter of size $k \times k$, stride $s$, and padding $p$, the output feature map has dimensions:

$$\text{Output height} = \left\lfloor\frac{P + 2p - k}{s}\right\rfloor + 1 \qquad \text{Output width} = \left\lfloor\frac{Q + 2p - k}{s}\right\rfloor + 1$$

With no padding ($p=0$) and stride $s=1$ โ€” the simplest case โ€” a $P \times Q$ input with a $k \times k$ filter gives output of size $(P - k + 1) \times (Q - k + 1)$. Each convolution slightly reduces spatial dimensions unless padding compensates.

Derivation ยท Output Dimension Formula Stride ยท Padding ยท No Magic
Setup

Input has $P$ positions along one dimension. Filter has size $k$. Stride $s$ means the filter jumps $s$ positions at a time. Padding $p$ adds $p$ zeros on each side, making the padded input have $P + 2p$ positions.

Count valid positions

The filter of size $k$ can start at positions $0, s, 2s, \ldots$ The last valid start position is where the filter still fits: $\text{start} + k - 1 \leq P + 2p - 1$, i.e. $\text{start} \leq P + 2p - k$. The number of valid start positions (counting from 0 in steps of $s$) is:

$$\left\lfloor\frac{P + 2p - k}{s}\right\rfloor + 1$$
Common cases

"Valid" padding ($p=0$, $s=1$): output = $P - k + 1$. Each convolution shrinks the input.

"Same" padding ($p = \lfloor k/2 \rfloor$, $s=1$): output = $P$. Spatial dimensions preserved. This is the standard choice in modern networks.

The formula counts how many times a window of size $k$ fits into a padded input of size $P + 2p$, in steps of $s$. No memorisation needed โ€” just count valid filter positions.

Multiple Filters and Depth

A single filter produces a single feature map โ€” one detection channel. In practice, a convolutional layer uses $F$ filters simultaneously, producing $F$ feature maps. Each filter learns to detect a different pattern. The output of a convolutional layer is a 3D tensor of shape: $$\left(\left\lfloor\frac{P+2p-k}{s}\right\rfloor + 1\right) \times \left(\left\lfloor\frac{Q+2p-k}{s}\right\rfloor + 1\right) \times F$$

For a colour (RGB) input with $C=3$ channels, each filter is $k \times k \times C$ โ€” it spans all input channels. The number of parameters per filter is $k \times k \times C + 1$ (including bias). With $F$ filters, the total parameter count for one layer is $F \times (k^2 C + 1)$.

๐Ÿ“Š Parameter Count: MLP vs CNN

Input: $32 \times 32 \times 3$ (CIFAR image). First hidden layer: 100 neurons.

MLP: $32 \times 32 \times 3 \times 100 = 307{,}200$ parameters.

CNN ($3\times3$ filter, 32 filters): $3^2 \times 3 \times 32 + 32 = 896$ parameters. 343ร— fewer.

๐Ÿ‡ฎ๐Ÿ‡ณ Chart Pattern Recognition

A CNN trained on NIFTY candlestick charts learns filters that detect head-and-shoulders, double-bottoms, and engulfing patterns โ€” the same filters applied at every time position. Parameter sharing means the model can recognise a pattern regardless of when it appeared in the chart window.

Hyperparameters of a Convolutional Layer

HyperparameterMeaningTypical values
Filter size $k$Spatial extent of each filter$3\times3$ (standard), $1\times1$ (channel mixing), $5\times5$ (larger context)
Number of filters $F$How many feature maps to produce32, 64, 128, 256 โ€” increasing with depth
Stride $s$Step size when sliding the filter1 (full resolution), 2 (halves spatial dims)
Padding $p$Zeros added to border before convolution0 (valid), $\lfloor k/2 \rfloor$ (same)
Depth (layers)How many convolutional layers to stack4โ€“100+ in modern networks
These hyperparameters are not learned โ€” they are chosen by the architect. The filter weights are learned; the filter size is not. Choosing $k=3$ vs $k=5$ vs $k=7$ affects the receptive field (how much of the input each neuron can "see"), the parameter count, and the computational cost. Modern practice: almost exclusively use $3\times3$ filters, stack more layers for larger receptive fields, and use "same" padding to preserve spatial dimensions until you want to downsample.

Pooling: Reducing Spatial Dimensions

After several convolutional layers, the feature maps are downsampled using a pooling operation. Pooling reduces spatial dimensions without learnable parameters โ€” it is a fixed aggregation over a local neighbourhood.

๐Ÿ“ Max Pooling

Take the maximum value in each $k \times k$ window. Preserves the strongest activation โ€” "was this feature present anywhere in this region?" Discards exact location, keeps existence information.

Most common. Makes the model more invariant to small translations of the input.

๐Ÿ“ Average Pooling

Take the mean value in each $k \times k$ window. Retains information about the average activation strength. More commonly used in the final global pooling step before the classifier.

Global Average Pooling (GAP) collapses each entire feature map to a single number โ€” replacing fully connected layers at the end of modern networks.

Pooling does two things: it reduces the spatial resolution (fewer positions = fewer parameters in subsequent layers) and it provides a form of translation invariance (a feature shifted by one pixel in the input often produces the same max-pool output). After $n$ rounds of $2\times2$ max-pooling, the network can tolerate shifts of up to $2^n$ pixels without changing its output โ€” crucial for robust recognition.

End-to-End Architecture: The Classification Pipeline

A standard CNN classifier stacks convolutional and pooling layers to progressively extract features, then flattens and uses fully connected layers for the final classification:

Input
$H \times W \times C$
โ†’
Conv + ReLU
$F$ filters
โ†’
Pool
$\div 2$
โ†’
Conv + ReLU
$2F$ filters
โ†’
Pool
$\div 2$
โ†’
Flatten
1D vector
โ†’
FC + Softmax
$K$ classes

The convolutional layers act as a learned feature extractor โ€” they transform the raw input into a compact, semantically meaningful representation. The fully connected layers act as a classifier on that representation. This modularity is one of the most powerful ideas in deep learning: the same backbone (convolutional feature extractor) can be used with different classifiers for different tasks.

Famous Architectures

LeNet-5 (1998, LeCun et al.) โ€” the architecture that started it all. Designed for handwritten digit recognition (MNIST). Two convolutional layers with average pooling, followed by three fully connected layers. Input: $32 \times 32$ greyscale. Showed that convolutional networks could be trained end-to-end by backpropagation on real tasks.

AlexNet (2012, Krizhevsky et al.) โ€” won ImageNet (1.2 million images, 1000 classes) by a large margin, launching the modern deep learning era. Key innovations: ReLU activations (replacing sigmoid/tanh, fixing vanishing gradients in shallow networks), dropout for regularisation, data augmentation, and GPU training. Five convolutional layers, three fully connected layers, 60 million parameters.

The ImageNet moment in 2012 is directly relevant to Indian fintech. The same convolutional architectures that classified ImageNet photos now power: NACH mandate verification (signature matching at every bank), PAN card OCR (NSDL processes millions of documents), and cheque truncation (CTS system at RBI). Every time your UPI payment is verified, a CNN somewhere is reading the QR code or matching a facial scan. The mathematical structure is identical to what we derived above โ€” local filters, parameter sharing, pooling, softmax.

Beyond Classification: Semantic Segmentation

Classification assigns a single label to the entire image. Semantic segmentation assigns a label to every pixel โ€” "what class does this pixel belong to?" This requires the network's output to have the same spatial resolution as the input.

Fully Convolutional Networks (FCN) replace the final fully connected layers with convolutional layers, allowing any input size and producing spatial output maps. But repeated pooling reduces resolution โ€” the output is coarser than the input. The fix is upsampling (transposed convolutions or bilinear interpolation) to restore resolution.

U-Net (2015) โ€” the dominant architecture for biomedical and satellite image segmentation. An encoder-decoder structure:

๐Ÿ“‰ Encoder (Contracting Path)

Standard CNN: Conv โ†’ Pool โ†’ Conv โ†’ Pool. Spatial resolution decreases, number of feature maps increases. The encoder extracts semantics at the cost of spatial detail.

๐Ÿ“ˆ Decoder (Expanding Path)

Upsample โ†’ Conv โ†’ Upsample โ†’ Conv. Spatial resolution restored. Skip connections from encoder layers concatenated at each decoder stage โ€” restoring fine spatial detail lost during pooling.

U-Net's skip connections are the key idea. The encoder "compresses" the image into a semantic representation (what is present) but loses spatial precision (where exactly). The decoder reconstructs spatial detail using the skip connections โ€” direct copies of the encoder's feature maps at each resolution. This allows the network to simultaneously know what something is (from the deep encoder) and where it is (from the shallow skip connections). The same idea appears in ResNets as residual connections โ€” gradient highways through the network.
Worked Example 1 ยท Computing Output Dimensions Through a CNN ๐Ÿ“ Full Forward Pass

Trace a $32 \times 32 \times 1$ greyscale image through a small CNN with the following layers. Compute the output shape at each stage.

Layer 1 โ€” Conv: 8 filters, $5 \times 5$, stride 1, no padding
$$\text{Output} = \left(\left\lfloor\frac{32-5}{1}\right\rfloor+1\right) \times \left(\left\lfloor\frac{32-5}{1}\right\rfloor+1\right) \times 8 = 28 \times 28 \times 8$$

Parameters: $5 \times 5 \times 1 \times 8 + 8 = 208$ (25 weights per filter ร— 8 filters + 8 biases).

Layer 2 โ€” Max Pool: $2 \times 2$, stride 2
$$\text{Output} = 14 \times 14 \times 8$$

No learnable parameters. Spatial dimensions halved.

Layer 3 โ€” Conv: 16 filters, $3 \times 3$, stride 1, no padding
$$\text{Output} = \left(\left\lfloor\frac{14-3}{1}\right\rfloor+1\right) \times \left(\left\lfloor\frac{14-3}{1}\right\rfloor+1\right) \times 16 = 12 \times 12 \times 16$$

Parameters: $3 \times 3 \times 8 \times 16 + 16 = 1{,}168$.

Layer 4 โ€” Max Pool: $2 \times 2$, stride 2
$$\text{Output} = 6 \times 6 \times 16 = 576 \text{ values}$$
Layer 5 โ€” Flatten + FC: 10 output classes

Flatten: $6 \times 6 \times 16 = 576$ inputs. FC: $576 \times 10 + 10 = 5{,}770$ parameters.

Total parameters: $208 + 0 + 1{,}168 + 0 + 5{,}770 = 7{,}146$. An equivalent MLP would need $32 \times 32 \times 100 + 100 \times 10 = 103{,}400$ parameters just for two layers โ€” 14ร— more, with no spatial inductive bias. The CNN uses far fewer parameters precisely because the same filter weights are shared across all spatial positions.
Worked Example 2 ยท What a Convolution Filter Detects ๐Ÿ‡ฎ๐Ÿ‡ณ Candlestick Pattern Recognition

Consider a 1D convolutional filter applied to a time series of NIFTY closes $x = [x_1, x_2, \ldots, x_T]$ (treated as a 1D "image"). Filter size $k = 3$, single filter $w = [-1, 0, +1]$.

Apply the filter

At each position $t$, the filter output is:

$$z_t = w_1 x_t + w_2 x_{t+1} + w_3 x_{t+2} = -x_t + x_{t+2}$$

This is the two-step return: $x_{t+2} - x_t$. The filter detects local upward momentum.

What different filters detect
$$w = [+1, -2, +1] \implies z_t = x_t - 2x_{t+1} + x_{t+2} \;\text{(second difference โ€” detects curvature, reversals)}$$ $$w = [+1, +1, +1] \implies z_t = x_t + x_{t+1} + x_{t+2} \;\text{(local sum โ€” detects high-volume regions)}$$
In a trained CNN on financial time series, the filters are not hand-designed. Gradient descent finds the filter weights that minimise the loss โ€” and they naturally converge to momentum detectors, reversal detectors, volatility detectors, and more complex patterns depending on the task. This is the power of end-to-end learning: the feature engineering is done by optimisation, not by the practitioner.
Chapter 10 Preview

CNNs exploit spatial structure in grid data. But many problems โ€” language, clinical records, financial time series โ€” are sequential: the order matters, and inputs can have variable length. Chapter 10 introduces Recurrent Neural Networks, where the network maintains a hidden state that persists across time steps. And then we hit the fundamental obstacle: training long RNNs by backpropagation causes gradients to vanish exponentially โ€” and we'll prove exactly why.


Practice Problems
4 questions ยท Chapter 09
ml / mathematical-foundations / ch09 / q01 โ˜… Computational

A convolutional layer receives an input of size $64 \times 64 \times 3$. It applies 32 filters of size $5 \times 5$ with stride 1 and same padding. What is the output shape, and how many learnable parameters does this layer have?

A
Output: $60 \times 60 \times 32$, Parameters: $2{,}400$
B
Output: $64 \times 64 \times 32$, Parameters: $2{,}432$
C
Output: $64 \times 64 \times 32$, Parameters: $75$
D
Output: $60 \times 60 \times 32$, Parameters: $2{,}432$
Answer: B.

Output shape: Same padding preserves spatial dimensions. With $p = \lfloor 5/2 \rfloor = 2$, $s=1$: output height $= \lfloor(64 + 4 - 5)/1\rfloor + 1 = 64$. Same for width. With 32 filters: output = $64 \times 64 \times 32$.

Parameters: Each filter has shape $5 \times 5 \times 3$ (height ร— width ร— input channels) + 1 bias = 76 parameters. With 32 filters: $76 \times 32 = 2{,}432$.

Option A uses valid padding (no padding, output shrinks to $60 \times 60$). Option C only counts one filter's weights without biases or multiplying by 32.
ml / mathematical-foundations / ch09 / q02 โ˜… Conceptual

A colleague says "CNN parameter sharing means every pixel uses the same weights โ€” so a CNN can't learn different features at different locations." What is wrong with this reasoning?

A
They are right โ€” CNNs cannot distinguish features by location, which is why they fail on non-grid data.
B
Parameter sharing means the same filter is applied at every location, detecting the same feature type everywhere. But with multiple filters per layer and multiple layers, different feature maps capture different aspects. Later layers combine local features into complex location-specific patterns.
C
Parameter sharing only applies to the first layer โ€” deeper layers have independent weights per position.
D
The colleague is wrong because pooling breaks translation invariance, compensating for parameter sharing.
Answer: B.

Parameter sharing means filter $w$ is the same at every spatial position โ€” it detects the same local pattern (e.g. a horizontal edge) wherever it appears. This is a feature, not a bug: it is exactly what enables translation invariance and reduces parameter count. The network doesn't need separate edge detectors for every pixel location.

Different spatial information is captured because: (1) multiple filters in each layer detect different patterns, (2) pooling summarises spatial regions, and (3) deeper layers combine activations from many spatial positions, effectively integrating information over larger and larger regions of the input (the receptive field grows with depth). A unit in a deep layer has a very large effective receptive field and encodes complex, location-dependent context โ€” even though each individual filter was shared.
ml / mathematical-foundations / ch09 / q03 โ˜…โ˜… Conceptual

In U-Net's encoder-decoder architecture, why are skip connections from encoder to decoder necessary? What problem do they solve that a plain encoder-decoder without skip connections cannot?

A
Skip connections speed up training by providing shorter gradient paths, but have no effect on output quality.
B
Skip connections prevent overfitting by adding noise to the decoder's input at each stage.
C
Repeated pooling in the encoder destroys fine spatial detail. Skip connections pass high-resolution feature maps directly to the decoder, allowing it to recover precise pixel-level boundaries that cannot be reconstructed from the compressed bottleneck alone.
D
Skip connections allow the network to use both RGB and greyscale inputs simultaneously.
Answer: C.

The encoder progressively applies convolution and pooling, building deep semantic representations while halving spatial resolution at each pooling step. By the bottleneck, a $512\times512$ input might be compressed to $16\times16$ โ€” 1024ร— reduction in spatial positions. The semantic content is preserved but the precise pixel locations are not.

The decoder upsamples back to full resolution, but upsampling alone cannot recover the fine spatial details lost during pooling โ€” it only interpolates from the coarse bottleneck. Skip connections provide the decoder with the encoder's feature maps at each resolution scale, which still carry the precise spatial information before pooling. The decoder can then use both "what is here" (from the deep path) and "exactly where is it" (from the skip connection) to produce accurate pixel-wise segmentation masks. Without skip connections, boundaries are blurry and spatially imprecise.
ml / mathematical-foundations / ch09 / q04 โ˜…โ˜… Conceptual

A CNN trained on NIFTY candlestick chart images achieves 72% accuracy predicting next-day direction. A quant argues this is just "pattern matching with no theoretical basis." How does the framework of this course โ€” specifically Ch8 and Ch9 โ€” respond to that critique?

A
The quant is right โ€” CNNs are purely empirical and have no connection to statistical learning theory.
B
The CNN is performing ERM with squared error loss, which has a clear theoretical basis.
C
The CNN is performing ERM with cross-entropy loss over a hypothesis class of convolutional networks. The feature extractor (convolutional layers) is a learned kernel map โ€” exactly what Ch8's kernel trick did manually, now learned end-to-end. The theoretical basis is identical; only the hypothesis class changed.
D
CNNs require a different theoretical framework entirely โ€” ERM does not apply to neural networks.
Answer: C.

A CNN classifier is ERM: minimise $\frac{1}{N}\sum_i L(h_\theta(x_i), y_i)$ where $h_\theta$ is a composition of convolutional layers followed by a softmax classifier, and $L$ is cross-entropy loss. The hypothesis class $\mathcal{H}$ is the set of all functions representable by networks with the given architecture. This is exactly the ERM framework from Ch2.

The connection to Ch8: the SVM with an RBF kernel implicitly maps inputs to a high-dimensional feature space, then finds the maximum-margin hyperplane in that space. The kernel (feature map) was fixed and hand-chosen. The CNN's convolutional layers perform a learned, nonlinear feature map $\phi_\theta(x)$ that is optimised jointly with the classifier. The theoretical structure is identical โ€” $\phi$ is the feature map, the final FC layer is the linear classifier โ€” but the feature map is no longer fixed: it is the learnable part.

Terminal Questions โ€” Chapter 09 10 problems ยท No answers given

Q1โ€“3 are direct dimension calculations. Q4โ€“7 require connecting CNN structure to the learning framework. Q8โ€“10 push into architecture design and limitations.

1
An input of size $128 \times 128 \times 1$ passes through: Conv(16 filters, $3\times3$, stride 1, same padding) โ†’ MaxPool($2\times2$, stride 2) โ†’ Conv(32 filters, $3\times3$, stride 1, same padding) โ†’ MaxPool($2\times2$, stride 2). Compute the output shape after each layer and the total parameter count (excluding biases).Easy
2
A $7 \times 7$ input is convolved with a $3 \times 3$ filter, stride 2, no padding. (a) Compute the output size using the formula. (b) Verify by manually counting how many positions the filter can be placed. (c) What padding would be needed to keep the output size equal to $\lceil 7/2 \rceil = 4$?Easy
3
Apply the 1D filter $w = [1, -1]$ to the sequence $x = [3, 5, 2, 8, 1, 4]$ with stride 1, no padding. Write out all output values and interpret what this filter computes (hint: finite difference). What financial quantity does this correspond to if $x$ represents daily NIFTY closes?Easy
4
Show that a $1 \times 1$ convolution with $F$ filters applied to an input of depth $C$ is equivalent to applying a fully connected layer to each spatial position independently. What is the parameter count? Why are $1 \times 1$ convolutions useful in deep networks like Inception?Medium
5
The receptive field of a unit in layer $L$ is the region of the input that influences its value. For a network with $L$ convolutional layers each using $k \times k$ filters with stride 1 and no pooling, show that the receptive field size is $k + (L-1)(k-1) = L(k-1) + 1$. Why does this grow linearly with depth, and what does it imply for how deep networks "see" large-scale patterns?Medium
6
CNNs exploit two properties of natural images: locality (nearby pixels are correlated) and translation invariance (a pattern is the same regardless of location). Identify a financial dataset where these properties hold and one where they do not. For the one where they do not hold, explain why a plain MLP or RNN might be more appropriate than a CNN.Medium
7
In terms of the bias-variance tradeoff from Ch7, explain the effect of: (a) increasing the number of filters $F$, (b) increasing filter size $k$, (c) increasing depth (number of layers), (d) adding dropout after convolutional layers. For each, state whether it primarily affects bias or variance and why.Medium
8
Global Average Pooling (GAP) collapses each $h \times w$ feature map to a single scalar (the spatial average). Suppose the final convolutional layer produces 256 feature maps of size $4 \times 4$, and there are 10 output classes. Compare the parameter counts for: (a) flatten + FC classification, (b) GAP + FC classification. What is the regularisation effect of GAP, and why does it reduce overfitting?Hard
9
Backpropagation through a convolutional layer requires computing gradients w.r.t. the filter weights. Show that the gradient of the loss $\mathcal{L}$ with respect to filter weight $w_{m,n}$ is: $\frac{\partial \mathcal{L}}{\partial w_{m,n}} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial z_{i,j}} \cdot x_{i+m,\; j+n}$. Interpret this: what operation is being performed, and why is it another convolution (cross-correlation)?Hard
10
Design a CNN architecture for classifying NIFTY options expiry outcomes (Large Up / Small Up / Flat / Small Down / Large Down) from a $20$-day window of 5 daily features (open, high, low, close, volume). Specify: input shape, all layer configurations with output dimensions, parameter counts, and justify each design choice using the principles from this chapter. How would you modify the architecture for the regression task of predicting the exact expiry return?Hard