Convolutional
Neural Networks
Chapter 8 showed that the right feature map can make any problem linearly separable โ but the kernel had to be hand-designed. CNNs remove that constraint entirely. The feature extractor is not chosen by the practitioner; it is learned from data by gradient descent. The key insight: for grid-structured data like images, two structural constraints โ local receptive fields and parameter sharing โ make this learning both tractable and powerful.
The MLP's Problem with Images
A fully connected MLP with input $x \in \mathbb{R}^d$ connects every input unit to every unit in the first hidden layer. For a modest $256 \times 256$ greyscale image, $d = 65{,}536$. With 1000 hidden units in the first layer, that is $65{,}536 \times 1{,}000 = 65.5$ million parameters โ from the input layer alone. For a colour image, multiply by 3.
This is wasteful for two reasons. First, spatial locality: in natural images, nearby pixels are correlated and meaningful patterns are local โ an edge, a curve, a texture patch. Connecting every pixel to every neuron ignores this structure. Second, translation invariance: a cat in the top-left corner of an image is the same cat in the bottom-right. The MLP has no way to share the knowledge about "what a cat looks like" across spatial positions โ it must relearn it from scratch for every location.
CNNs address both problems with two structural constraints applied to the MLP: local receptive fields and parameter sharing.
(i) Local receptive field: each neuron in a hidden layer connects only to a small spatial region of the input โ not the entire input.
(ii) Parameter sharing: the weights of each local filter are shared across all spatial positions. The same filter is applied everywhere.
Local Receptive Fields and the Convolution Operation
Fix a filter of size $k \times k$ (e.g. $3 \times 3$). Slide it across a $P \times Q$ input image. At each position, compute the dot product between the filter weights $w \in \mathbb{R}^{k \times k}$ and the local patch of the image. The result at each position is one element of the output feature map.
Output Dimensions: The Formula
Given an input of size $P \times Q$, a filter of size $k \times k$, stride $s$, and padding $p$, the output feature map has dimensions:
With no padding ($p=0$) and stride $s=1$ โ the simplest case โ a $P \times Q$ input with a $k \times k$ filter gives output of size $(P - k + 1) \times (Q - k + 1)$. Each convolution slightly reduces spatial dimensions unless padding compensates.
Input has $P$ positions along one dimension. Filter has size $k$. Stride $s$ means the filter jumps $s$ positions at a time. Padding $p$ adds $p$ zeros on each side, making the padded input have $P + 2p$ positions.
The filter of size $k$ can start at positions $0, s, 2s, \ldots$ The last valid start position is where the filter still fits: $\text{start} + k - 1 \leq P + 2p - 1$, i.e. $\text{start} \leq P + 2p - k$. The number of valid start positions (counting from 0 in steps of $s$) is:
"Valid" padding ($p=0$, $s=1$): output = $P - k + 1$. Each convolution shrinks the input.
"Same" padding ($p = \lfloor k/2 \rfloor$, $s=1$): output = $P$. Spatial dimensions preserved. This is the standard choice in modern networks.
Multiple Filters and Depth
A single filter produces a single feature map โ one detection channel. In practice, a convolutional layer uses $F$ filters simultaneously, producing $F$ feature maps. Each filter learns to detect a different pattern. The output of a convolutional layer is a 3D tensor of shape: $$\left(\left\lfloor\frac{P+2p-k}{s}\right\rfloor + 1\right) \times \left(\left\lfloor\frac{Q+2p-k}{s}\right\rfloor + 1\right) \times F$$
For a colour (RGB) input with $C=3$ channels, each filter is $k \times k \times C$ โ it spans all input channels. The number of parameters per filter is $k \times k \times C + 1$ (including bias). With $F$ filters, the total parameter count for one layer is $F \times (k^2 C + 1)$.
Input: $32 \times 32 \times 3$ (CIFAR image). First hidden layer: 100 neurons.
MLP: $32 \times 32 \times 3 \times 100 = 307{,}200$ parameters.
CNN ($3\times3$ filter, 32 filters): $3^2 \times 3 \times 32 + 32 = 896$ parameters. 343ร fewer.
A CNN trained on NIFTY candlestick charts learns filters that detect head-and-shoulders, double-bottoms, and engulfing patterns โ the same filters applied at every time position. Parameter sharing means the model can recognise a pattern regardless of when it appeared in the chart window.
Hyperparameters of a Convolutional Layer
| Hyperparameter | Meaning | Typical values |
|---|---|---|
| Filter size $k$ | Spatial extent of each filter | $3\times3$ (standard), $1\times1$ (channel mixing), $5\times5$ (larger context) |
| Number of filters $F$ | How many feature maps to produce | 32, 64, 128, 256 โ increasing with depth |
| Stride $s$ | Step size when sliding the filter | 1 (full resolution), 2 (halves spatial dims) |
| Padding $p$ | Zeros added to border before convolution | 0 (valid), $\lfloor k/2 \rfloor$ (same) |
| Depth (layers) | How many convolutional layers to stack | 4โ100+ in modern networks |
Pooling: Reducing Spatial Dimensions
After several convolutional layers, the feature maps are downsampled using a pooling operation. Pooling reduces spatial dimensions without learnable parameters โ it is a fixed aggregation over a local neighbourhood.
Take the maximum value in each $k \times k$ window. Preserves the strongest activation โ "was this feature present anywhere in this region?" Discards exact location, keeps existence information.
Most common. Makes the model more invariant to small translations of the input.
Take the mean value in each $k \times k$ window. Retains information about the average activation strength. More commonly used in the final global pooling step before the classifier.
Global Average Pooling (GAP) collapses each entire feature map to a single number โ replacing fully connected layers at the end of modern networks.
End-to-End Architecture: The Classification Pipeline
A standard CNN classifier stacks convolutional and pooling layers to progressively extract features, then flattens and uses fully connected layers for the final classification:
The convolutional layers act as a learned feature extractor โ they transform the raw input into a compact, semantically meaningful representation. The fully connected layers act as a classifier on that representation. This modularity is one of the most powerful ideas in deep learning: the same backbone (convolutional feature extractor) can be used with different classifiers for different tasks.
Famous Architectures
LeNet-5 (1998, LeCun et al.) โ the architecture that started it all. Designed for handwritten digit recognition (MNIST). Two convolutional layers with average pooling, followed by three fully connected layers. Input: $32 \times 32$ greyscale. Showed that convolutional networks could be trained end-to-end by backpropagation on real tasks.
AlexNet (2012, Krizhevsky et al.) โ won ImageNet (1.2 million images, 1000 classes) by a large margin, launching the modern deep learning era. Key innovations: ReLU activations (replacing sigmoid/tanh, fixing vanishing gradients in shallow networks), dropout for regularisation, data augmentation, and GPU training. Five convolutional layers, three fully connected layers, 60 million parameters.
Beyond Classification: Semantic Segmentation
Classification assigns a single label to the entire image. Semantic segmentation assigns a label to every pixel โ "what class does this pixel belong to?" This requires the network's output to have the same spatial resolution as the input.
Fully Convolutional Networks (FCN) replace the final fully connected layers with convolutional layers, allowing any input size and producing spatial output maps. But repeated pooling reduces resolution โ the output is coarser than the input. The fix is upsampling (transposed convolutions or bilinear interpolation) to restore resolution.
U-Net (2015) โ the dominant architecture for biomedical and satellite image segmentation. An encoder-decoder structure:
Standard CNN: Conv โ Pool โ Conv โ Pool. Spatial resolution decreases, number of feature maps increases. The encoder extracts semantics at the cost of spatial detail.
Upsample โ Conv โ Upsample โ Conv. Spatial resolution restored. Skip connections from encoder layers concatenated at each decoder stage โ restoring fine spatial detail lost during pooling.
Trace a $32 \times 32 \times 1$ greyscale image through a small CNN with the following layers. Compute the output shape at each stage.
Parameters: $5 \times 5 \times 1 \times 8 + 8 = 208$ (25 weights per filter ร 8 filters + 8 biases).
No learnable parameters. Spatial dimensions halved.
Parameters: $3 \times 3 \times 8 \times 16 + 16 = 1{,}168$.
Flatten: $6 \times 6 \times 16 = 576$ inputs. FC: $576 \times 10 + 10 = 5{,}770$ parameters.
Consider a 1D convolutional filter applied to a time series of NIFTY closes $x = [x_1, x_2, \ldots, x_T]$ (treated as a 1D "image"). Filter size $k = 3$, single filter $w = [-1, 0, +1]$.
At each position $t$, the filter output is:
This is the two-step return: $x_{t+2} - x_t$. The filter detects local upward momentum.
CNNs exploit spatial structure in grid data. But many problems โ language, clinical records, financial time series โ are sequential: the order matters, and inputs can have variable length. Chapter 10 introduces Recurrent Neural Networks, where the network maintains a hidden state that persists across time steps. And then we hit the fundamental obstacle: training long RNNs by backpropagation causes gradients to vanish exponentially โ and we'll prove exactly why.
A convolutional layer receives an input of size $64 \times 64 \times 3$. It applies 32 filters of size $5 \times 5$ with stride 1 and same padding. What is the output shape, and how many learnable parameters does this layer have?
Output shape: Same padding preserves spatial dimensions. With $p = \lfloor 5/2 \rfloor = 2$, $s=1$: output height $= \lfloor(64 + 4 - 5)/1\rfloor + 1 = 64$. Same for width. With 32 filters: output = $64 \times 64 \times 32$.
Parameters: Each filter has shape $5 \times 5 \times 3$ (height ร width ร input channels) + 1 bias = 76 parameters. With 32 filters: $76 \times 32 = 2{,}432$.
Option A uses valid padding (no padding, output shrinks to $60 \times 60$). Option C only counts one filter's weights without biases or multiplying by 32.
A colleague says "CNN parameter sharing means every pixel uses the same weights โ so a CNN can't learn different features at different locations." What is wrong with this reasoning?
Parameter sharing means filter $w$ is the same at every spatial position โ it detects the same local pattern (e.g. a horizontal edge) wherever it appears. This is a feature, not a bug: it is exactly what enables translation invariance and reduces parameter count. The network doesn't need separate edge detectors for every pixel location.
Different spatial information is captured because: (1) multiple filters in each layer detect different patterns, (2) pooling summarises spatial regions, and (3) deeper layers combine activations from many spatial positions, effectively integrating information over larger and larger regions of the input (the receptive field grows with depth). A unit in a deep layer has a very large effective receptive field and encodes complex, location-dependent context โ even though each individual filter was shared.
In U-Net's encoder-decoder architecture, why are skip connections from encoder to decoder necessary? What problem do they solve that a plain encoder-decoder without skip connections cannot?
The encoder progressively applies convolution and pooling, building deep semantic representations while halving spatial resolution at each pooling step. By the bottleneck, a $512\times512$ input might be compressed to $16\times16$ โ 1024ร reduction in spatial positions. The semantic content is preserved but the precise pixel locations are not.
The decoder upsamples back to full resolution, but upsampling alone cannot recover the fine spatial details lost during pooling โ it only interpolates from the coarse bottleneck. Skip connections provide the decoder with the encoder's feature maps at each resolution scale, which still carry the precise spatial information before pooling. The decoder can then use both "what is here" (from the deep path) and "exactly where is it" (from the skip connection) to produce accurate pixel-wise segmentation masks. Without skip connections, boundaries are blurry and spatially imprecise.
A CNN trained on NIFTY candlestick chart images achieves 72% accuracy predicting next-day direction. A quant argues this is just "pattern matching with no theoretical basis." How does the framework of this course โ specifically Ch8 and Ch9 โ respond to that critique?
A CNN classifier is ERM: minimise $\frac{1}{N}\sum_i L(h_\theta(x_i), y_i)$ where $h_\theta$ is a composition of convolutional layers followed by a softmax classifier, and $L$ is cross-entropy loss. The hypothesis class $\mathcal{H}$ is the set of all functions representable by networks with the given architecture. This is exactly the ERM framework from Ch2.
The connection to Ch8: the SVM with an RBF kernel implicitly maps inputs to a high-dimensional feature space, then finds the maximum-margin hyperplane in that space. The kernel (feature map) was fixed and hand-chosen. The CNN's convolutional layers perform a learned, nonlinear feature map $\phi_\theta(x)$ that is optimised jointly with the classifier. The theoretical structure is identical โ $\phi$ is the feature map, the final FC layer is the linear classifier โ but the feature map is no longer fixed: it is the learnable part.
Q1โ3 are direct dimension calculations. Q4โ7 require connecting CNN structure to the learning framework. Q8โ10 push into architecture design and limitations.