Recurrent
Neural Networks
CNNs exploit spatial structure. But language, clinical time-series, and financial tick data have a different kind of structure: order matters, inputs have variable length, and the meaning of any element depends on what came before. Recurrent Neural Networks process sequences by maintaining a hidden state that persists across time โ and then we discover, with full mathematical proof, why training them on long sequences is fundamentally broken.
Why Sequential Data Breaks the MLP
An MLP takes a fixed-size input $x \in \mathbb{R}^d$ and produces an output. This is fine for images (fixed dimensions) and tabular data (fixed features). But many real-world problems involve sequences of variable length:
A sentence has variable length. "NIFTY fell" has 2 tokens; "NIFTY fell sharply after RBI policy announcement" has 8. An MLP cannot take inputs of different sizes. And the meaning of "fell" depends on the context built by the preceding words.
A tick-by-tick sequence of NIFTY trades is variable length โ some days have 2 million ticks, some have 3 million. More importantly, whether a price move at time $t$ is meaningful depends on the preceding sequence of prices, volumes, and order book states.
The CNN's solution โ local filters and parameter sharing โ helps with fixed-size sequences but still can't handle variable-length inputs naturally, and its local receptive field may miss long-range dependencies. A fundamentally different architecture is needed: one that reads a sequence one element at a time, maintaining a summary of everything seen so far in a hidden state.
The RNN: State Across Time
Let $X = \{x_1, x_2, \ldots, x_\tau\}$ be a sequence of inputs where each $x_t \in \mathbb{R}^d$. An RNN processes this sequence step by step, maintaining a hidden state $h_t \in \mathbb{R}^m$ that is updated at each time step:
The hidden state $h_t$ is the RNN's memory โ it encodes a summary of the entire input history $\{x_1, \ldots, x_t\}$ in a fixed-size vector of dimension $m$, regardless of how long the sequence is. The output $\hat{y}_t$ at each step is computed from this summary.
RNN Tasks: Sequence Classification and Sequence-to-Sequence
The same RNN architecture supports different tasks depending on which outputs you use:
Read the entire sequence, then use the final hidden state $h_\tau$ to make a single prediction. Examples: sentiment classification of an earnings call transcript (positive/negative), classifying a NIFTY intraday session as trending/ranging/reversing from the full tick sequence.
Output: one label per sequence, produced from $h_\tau$.
Produce an output at every time step. Examples: machine translation (input: Hindi sequence, output: English sequence), next-step prediction of NIFTY price from the history of returns up to time $t$.
Output: one prediction $\hat{y}_t$ per input $x_t$, produced at every step.
Backpropagation Through Time (BPTT)
Training an RNN requires computing gradients of the loss with respect to parameters $W_h$, $W_x$, $W_y$. Since the same parameters are used at every step, the gradient must be accumulated across all time steps. This is called Backpropagation Through Time (BPTT) โ unroll the RNN across $\tau$ steps, treat it as a very deep feedforward network with shared weights, and apply the chain rule.
The key gradient involves the loss at time $T$ with respect to the hidden state at time $t < T$. By the chain rule:
The term $\frac{\partial h_T}{\partial h_t}$ requires propagating the gradient back through all intermediate time steps $t, t+1, \ldots, T-1$. Each step contributes one Jacobian:
The full product over all intermediate steps is:
Vanishing Gradients: The Mathematical Proof
This product of matrices is where the problem lives. Taking the norm:
Using the submultiplicativity of matrix norms ($\|AB\| \leq \|A\| \cdot \|B\|$):
For the sigmoid activation $\sigma(z) = 1/(1+e^{-z})$, the derivative satisfies $0 < \sigma'(z) \leq 1/4$ for all $z$. So $\|\text{diag}(\sigma'(z_{k+1}))\| \leq 1/4$. Therefore:
If $\|W_h\| < 4$, the bound $\left(\|W_h\|/4\right)^{T-t} \to 0$ exponentially as $T - t$ grows. With $T - t = 50$ (a 50-step sequence) and $\|W_h\| = 2$: the bound is $(2/4)^{50} = (0.5)^{50} \approx 10^{-15}$. The gradient has numerically vanished.
The vanishing gradient is not a bug that better initialisation or a different learning rate can fix. It is a structural consequence of multiplying many matrices together. For a sequence of length 100, the gradient from step 1 to step 100 has passed through 99 matrix multiplications. Each multiplication either shrinks or explodes the gradient exponentially.
In practice: vanilla RNNs can only learn dependencies spanning a few time steps. For financial time series where a pattern from 20 trading days ago is relevant to today's prediction, vanilla RNNs simply cannot capture it โ the gradient signal has decayed to numerical zero before reaching those early steps.
The Root Cause: Multiplicative State Updates
The vanishing gradient problem arises because the vanilla RNN's state update is multiplicative: $$h_t = \sigma(W_h h_{t-1} + W_x x_t + b_1)$$
To get from $h_t$ to $h_T$, the gradient must pass through $T - t$ applications of $\sigma' \cdot W_h$ โ a product that decays exponentially. The fix is conceptually simple: make the state update additive rather than purely multiplicative. If $h_t$ is constructed partly as $h_{t-1} + \text{something}$, then $\frac{\partial h_T}{\partial h_t}$ contains a direct path with a multiplicative factor of 1 at each step โ no exponential decay.
The Solution: Additive Updates with Modulators
The additive update principle: instead of completely overwriting the hidden state at each step, interpolate between the previous state and a new candidate state, controlled by learned gating signals:
The key change: $h_{t-1}$ now contributes directly to $h_t$ via the $\alpha_t \odot h_{t-1}$ term, without passing through $\sigma(W_h \cdot)$. The gradient $\frac{\partial h_T}{\partial h_t}$ now expands as:
where $\epsilon_{k+1}$ contains the gradient through the $\beta \odot \tilde{h}$ path. If $\alpha_{k+1} \approx 1$ (the gate is open, preserve the state), the dominant term in the product is $\text{diag}(\alpha_{k+1}) \approx I$ โ the identity matrix. A product of near-identity matrices does not decay exponentially. Long-range gradients can survive.
Residual Connections: The Same Idea in a Different Form
The additive update principle is not unique to RNNs. In deep feedforward networks and CNNs, the same problem โ gradient decay through many layers โ is solved by residual connections:
The output is the input $x$ plus a learned residual $F(x)$. The gradient of the loss with respect to the input has a direct path: $\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(1 + \frac{\partial F(x)}{\partial x}\right)$. The $+1$ ensures that even if $\frac{\partial F}{\partial x}$ vanishes, the gradient still flows with factor 1. This is how ResNets (Residual Networks) enabled training of networks with 100+ layers โ the same additive structure prevents gradient decay with depth.
Consider a scalar vanilla RNN: $h_t = \sigma(w_h h_{t-1} + w_x x_t)$ with $w_h = 0.9$ (a typical initialisation). Compute the gradient $\frac{\partial h_T}{\partial h_1}$ for several sequence lengths $T$.
In the scalar case, $\frac{\partial h_T}{\partial h_1} = \prod_{k=1}^{T-1} \sigma'(z_{k+1}) \cdot w_h$. With $\sigma'(z) \leq 0.25$ and $w_h = 0.9$, each factor is at most $0.25 \times 0.9 = 0.225$.
A tiny RNN with hidden size $m = 2$ processes 3 NIFTY intraday returns (in %) to classify the session as "trending" (+1) or "mean-reverting" (โ1). Parameters: $W_h = \begin{bmatrix}0.5 & 0.1 \\ 0.1 & 0.5\end{bmatrix}$, $W_x = \begin{bmatrix}0.8 \\ -0.3\end{bmatrix}$ (1D input), $b_1 = \mathbf{0}$. Initial state $h_0 = \mathbf{0}$. Input sequence: $x_1 = +0.5\%$, $x_2 = +0.8\%$, $x_3 = +0.6\%$ (consistent positive returns โ trending session).
The final hidden state $h_3$ is fed to a linear classifier $\hat{y} = w_y^\top h_3 + b_y$. With $w_y = [1, 1]^\top$, $\hat{y} \approx 0.81 + 0.62 = 1.43 > 0$ โ predict "trending."
The additive update principle โ modulators, gating, residual connections โ solves the vanishing gradient problem. But there is a deeper architectural question: does sequential processing (one step at a time) need to be the default? Attention mechanisms allow the network to directly connect any two positions in a sequence, regardless of distance, in a single operation. Chapter 11 derives the attention mechanism from first principles, shows why self-attention replaces recurrence, and builds the Transformer encoder-decoder from the ground up.
An RNN is trained to predict the next NIFTY return given the last 100 daily returns. After training, the model performs well on short-range patterns but seems to ignore patterns from more than 10 days ago. What is the most likely mathematical explanation?
The vanishing gradient proof shows that $\|\partial h_T/\partial h_t\| \leq (\|W_h\|/4)^{T-t}$. For $T - t = 10$ and typical $\|W_h\|$, this bound is already very small. For $T - t = 90$ (patterns from 90 days ago), the gradient is numerically zero. The parameters $W_h$, $W_x$ are updated by gradients that sum over all time steps โ but the terms for distant steps are zero, so those dependencies are never learned regardless of training duration or learning rate.
Option B is wrong: a hidden state of even modest dimension (say 64) could theoretically encode 100 days of information โ the issue is not capacity but trainability. Option D is wrong: the vanishing gradient is independent of learning rate โ it is a structural problem, not an optimisation parameter problem.
In the additive state update $h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$, what happens when $\alpha_t \approx \mathbf{1}$ and $\beta_t \approx \mathbf{0}$ for several consecutive time steps? What does this represent semantically?
With $\alpha_t \approx \mathbf{1}$ and $\beta_t \approx \mathbf{0}$: $h_t \approx \mathbf{1} \odot h_{t-1} + \mathbf{0} \odot \tilde{h}_t = h_{t-1}$. The hidden state is preserved almost exactly. Semantically, this means the network has decided that the current inputs $x_t$ are not informative enough to update its memory โ it holds its state, waiting for a relevant input.
This is precisely the "memory" property that vanilla RNNs lack. A well-trained LSTM (which implements exactly this gating) learns to hold relevant information over long stretches of uninformative input. For example, in processing an earnings call transcript, the network might latch onto "revenue beat" early in the call and hold that information through many sentences of boilerplate text, until it becomes relevant again at the sentiment prediction step.
The vanishing gradient bound is $\|\partial h_T / \partial h_t\| \leq (\|W_h\| \cdot \gamma)^{T-t}$ where $\gamma = \max_k \|\text{diag}(\sigma'(z_k))\|$. For the ReLU activation $\sigma(z) = \max(0, z)$, what is $\gamma$, and does the vanishing gradient problem disappear?
ReLU: $\sigma'(z) = 1$ if $z > 0$ (active), $0$ if $z \leq 0$ (inactive). So $\gamma = 1$ for active units. The bound becomes $(\|W_h\| \cdot 1)^{T-t} = \|W_h\|^{T-t}$. If $\|W_h\| < 1$, gradients still vanish (just more slowly than with sigmoid). If $\|W_h\| = 1$, gradients neither vanish nor explode for active neurons โ a significant improvement.
But the "dead neuron" problem: a ReLU unit with $z \leq 0$ has $\sigma'(z) = 0$, completely blocking gradient flow through that unit. If many units become permanently inactive (which can happen with poor initialisation or large learning rates), the effective $\gamma$ approaches 0 and gradients vanish just as badly. ReLU helps but does not solve the structural problem โ that solution requires additive updates (LSTMs/GRUs) or attention mechanisms.
A residual connection $y = x + F(x)$ and an RNN additive update $h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$ both solve the vanishing gradient problem. What is the common mathematical principle they share, and how does it guarantee gradient flow?
For residual connections: $\frac{\partial y}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}$. Even if $\frac{\partial F}{\partial x} \to 0$ (the learned transformation vanishes), the gradient is at least 1 โ it cannot vanish below 1 due to the identity term. Stacking $L$ such layers: $\frac{\partial \mathcal{L}}{\partial x_0} = \prod_l (1 + \frac{\partial F_l}{\partial x_l})$. Each factor is $\geq 1$ when $\frac{\partial F_l}{\partial x_l} \geq 0$, preventing decay.
For the RNN additive update: $\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1}(\text{diag}(\alpha_{k+1}) + \epsilon_{k+1})$. When $\alpha_{k+1} \approx 1$, each factor is near the identity $I$ โ the product does not decay exponentially. The common principle: add the input directly to the output, creating an identity shortcut that the gradient can always travel through, regardless of what the learned transformation does.
Q1โ3 are direct computations on the RNN equations. Q4โ7 require connecting BPTT, vanishing gradients, and the additive update principle. Q8โ10 are synthesis questions spanning this chapter and Ch9.