/
Machine Learning โ€บ Mathematical Foundations โ€บ Recurrent Neural Networks
Progress
10 / 13
โ† All Chapters
Intermediate Chapter 10 of 13 ยท Mathematical Foundations of ML

Recurrent
Neural Networks

CNNs exploit spatial structure. But language, clinical time-series, and financial tick data have a different kind of structure: order matters, inputs have variable length, and the meaning of any element depends on what came before. Recurrent Neural Networks process sequences by maintaining a hidden state that persists across time โ€” and then we discover, with full mathematical proof, why training them on long sequences is fundamentally broken.

Why Sequential Data Breaks the MLP

An MLP takes a fixed-size input $x \in \mathbb{R}^d$ and produces an output. This is fine for images (fixed dimensions) and tabular data (fixed features). But many real-world problems involve sequences of variable length:

๐Ÿ“ Language and Text

A sentence has variable length. "NIFTY fell" has 2 tokens; "NIFTY fell sharply after RBI policy announcement" has 8. An MLP cannot take inputs of different sizes. And the meaning of "fell" depends on the context built by the preceding words.

๐Ÿ‡ฎ๐Ÿ‡ณ Financial Time Series

A tick-by-tick sequence of NIFTY trades is variable length โ€” some days have 2 million ticks, some have 3 million. More importantly, whether a price move at time $t$ is meaningful depends on the preceding sequence of prices, volumes, and order book states.

The CNN's solution โ€” local filters and parameter sharing โ€” helps with fixed-size sequences but still can't handle variable-length inputs naturally, and its local receptive field may miss long-range dependencies. A fundamentally different architecture is needed: one that reads a sequence one element at a time, maintaining a summary of everything seen so far in a hidden state.

The RNN: State Across Time

Let $X = \{x_1, x_2, \ldots, x_\tau\}$ be a sequence of inputs where each $x_t \in \mathbb{R}^d$. An RNN processes this sequence step by step, maintaining a hidden state $h_t \in \mathbb{R}^m$ that is updated at each time step:

The vanilla RNN update equations are: $$z_t = W_h h_{t-1} + W_x x_t + b_1$$ $$h_t = \sigma(z_t)$$ $$\hat{y}_t = W_y h_t + b_2$$ Parameters: $W_h \in \mathbb{R}^{m \times m}$ (hidden-to-hidden), $W_x \in \mathbb{R}^{m \times d}$ (input-to-hidden), $W_y \in \mathbb{R}^{k \times m}$ (hidden-to-output), biases $b_1, b_2$. The same parameters are used at every time step โ€” this is parameter sharing across time, the temporal analogue of the CNN's spatial parameter sharing.

The hidden state $h_t$ is the RNN's memory โ€” it encodes a summary of the entire input history $\{x_1, \ldots, x_t\}$ in a fixed-size vector of dimension $m$, regardless of how long the sequence is. The output $\hat{y}_t$ at each step is computed from this summary.

Think of the hidden state as a running summary being updated as the sequence is read. At each step, the RNN decides: what from the previous summary $h_{t-1}$ is worth keeping, and what from the new input $x_t$ is worth incorporating? The update $h_t = \sigma(W_h h_{t-1} + W_x x_t + b_1)$ blends the two. The weight matrix $W_h$ governs how much of the previous memory persists; $W_x$ governs how strongly new inputs update the memory. These are learned from data.

RNN Tasks: Sequence Classification and Sequence-to-Sequence

The same RNN architecture supports different tasks depending on which outputs you use:

๐Ÿ“‹ Sequence Classification

Read the entire sequence, then use the final hidden state $h_\tau$ to make a single prediction. Examples: sentiment classification of an earnings call transcript (positive/negative), classifying a NIFTY intraday session as trending/ranging/reversing from the full tick sequence.

Output: one label per sequence, produced from $h_\tau$.

๐Ÿ”„ Sequence-to-Sequence Regression

Produce an output at every time step. Examples: machine translation (input: Hindi sequence, output: English sequence), next-step prediction of NIFTY price from the history of returns up to time $t$.

Output: one prediction $\hat{y}_t$ per input $x_t$, produced at every step.

NSE's surveillance system uses sequence models to detect spoofing and layering โ€” manipulative order patterns where large orders are placed and cancelled to create artificial price pressure. The system reads the sequence of order book events (place, cancel, modify, execute) for each security and flags anomalous patterns. An RNN processes the variable-length order event sequence and produces a suspicion score โ€” exactly the sequence classification task described above. The same architecture underlies fraud detection in UPI transaction sequences at NPCI.

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients of the loss with respect to parameters $W_h$, $W_x$, $W_y$. Since the same parameters are used at every step, the gradient must be accumulated across all time steps. This is called Backpropagation Through Time (BPTT) โ€” unroll the RNN across $\tau$ steps, treat it as a very deep feedforward network with shared weights, and apply the chain rule.

The key gradient involves the loss at time $T$ with respect to the hidden state at time $t < T$. By the chain rule:

$$\frac{\partial L_T}{\partial h_t} = \frac{\partial L_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_t}$$

The term $\frac{\partial h_T}{\partial h_t}$ requires propagating the gradient back through all intermediate time steps $t, t+1, \ldots, T-1$. Each step contributes one Jacobian:

$$\frac{\partial h_{k+1}}{\partial h_k} = \text{diag}(\sigma'(z_{k+1})) \cdot W_h$$

The full product over all intermediate steps is:

$$\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k} = \prod_{k=t}^{T-1} \text{diag}(\sigma'(z_{k+1})) \cdot W_h$$

Vanishing Gradients: The Mathematical Proof

This product of matrices is where the problem lives. Taking the norm:

Derivation ยท Exponential Decay of Gradients in Vanilla RNNs Chain Rule ยท Norm Bound ยท $T - t$ steps
Step 1 โ€” Bound the norm of the product

Using the submultiplicativity of matrix norms ($\|AB\| \leq \|A\| \cdot \|B\|$):

$$\left\|\frac{\partial h_T}{\partial h_t}\right\| \leq \prod_{k=t}^{T-1} \left\|\text{diag}(\sigma'(z_{k+1}))\right\| \cdot \|W_h\|$$
Step 2 โ€” Bound the sigmoid derivative

For the sigmoid activation $\sigma(z) = 1/(1+e^{-z})$, the derivative satisfies $0 < \sigma'(z) \leq 1/4$ for all $z$. So $\|\text{diag}(\sigma'(z_{k+1}))\| \leq 1/4$. Therefore:

$$\left\|\frac{\partial h_T}{\partial h_t}\right\| \leq \left(\frac{\|W_h\|}{4}\right)^{T-t}$$
Step 3 โ€” Exponential decay

If $\|W_h\| < 4$, the bound $\left(\|W_h\|/4\right)^{T-t} \to 0$ exponentially as $T - t$ grows. With $T - t = 50$ (a 50-step sequence) and $\|W_h\| = 2$: the bound is $(2/4)^{50} = (0.5)^{50} \approx 10^{-15}$. The gradient has numerically vanished.

Vanishing gradient theorem: In a vanilla RNN with bounded activation derivatives and $\|W_h\| < 4$, the gradient $\frac{\partial L_T}{\partial h_t}$ decays exponentially in the distance $T - t$. For long sequences, gradients from early time steps are effectively zero โ€” the network cannot learn long-range dependencies.
๐Ÿ”ด Why This Is a Fundamental Problem

The vanishing gradient is not a bug that better initialisation or a different learning rate can fix. It is a structural consequence of multiplying many matrices together. For a sequence of length 100, the gradient from step 1 to step 100 has passed through 99 matrix multiplications. Each multiplication either shrinks or explodes the gradient exponentially.

In practice: vanilla RNNs can only learn dependencies spanning a few time steps. For financial time series where a pattern from 20 trading days ago is relevant to today's prediction, vanilla RNNs simply cannot capture it โ€” the gradient signal has decayed to numerical zero before reaching those early steps.

There is also an exploding gradient problem when $\|W_h\| > 4$ โ€” the gradient grows exponentially and training diverges. The standard fix for exploding gradients is gradient clipping: if $\|\nabla \mathcal{L}\| > \text{threshold}$, rescale the gradient to have norm equal to the threshold. This prevents divergence but does not solve the vanishing problem.

The Root Cause: Multiplicative State Updates

The vanishing gradient problem arises because the vanilla RNN's state update is multiplicative: $$h_t = \sigma(W_h h_{t-1} + W_x x_t + b_1)$$

To get from $h_t$ to $h_T$, the gradient must pass through $T - t$ applications of $\sigma' \cdot W_h$ โ€” a product that decays exponentially. The fix is conceptually simple: make the state update additive rather than purely multiplicative. If $h_t$ is constructed partly as $h_{t-1} + \text{something}$, then $\frac{\partial h_T}{\partial h_t}$ contains a direct path with a multiplicative factor of 1 at each step โ€” no exponential decay.

The Solution: Additive Updates with Modulators

The additive update principle: instead of completely overwriting the hidden state at each step, interpolate between the previous state and a new candidate state, controlled by learned gating signals:

The additive state update with modulators $\alpha_t$ and $\beta_t$: $$h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$$ where $\tilde{h}_t = \sigma(W_2 x_t + W_1 h_{t-1})$ is the candidate new state, and $\odot$ denotes element-wise multiplication. The modulators $\alpha_t, \beta_t \in [0,1]^m$ are vectors of gating values โ€” learned functions of the current input and previous state.

The key change: $h_{t-1}$ now contributes directly to $h_t$ via the $\alpha_t \odot h_{t-1}$ term, without passing through $\sigma(W_h \cdot)$. The gradient $\frac{\partial h_T}{\partial h_t}$ now expands as:

$$\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1}\left(\text{diag}(\alpha_{k+1}) + \epsilon_{k+1}\right)$$

where $\epsilon_{k+1}$ contains the gradient through the $\beta \odot \tilde{h}$ path. If $\alpha_{k+1} \approx 1$ (the gate is open, preserve the state), the dominant term in the product is $\text{diag}(\alpha_{k+1}) \approx I$ โ€” the identity matrix. A product of near-identity matrices does not decay exponentially. Long-range gradients can survive.

The modulator $\alpha_t$ is a forget gate: $\alpha_t \approx 1$ means "keep the previous state almost intact," $\alpha_t \approx 0$ means "reset โ€” this is a new context." The modulator $\beta_t$ is an input gate: $\beta_t \approx 1$ means "incorporate the new input fully," $\beta_t \approx 0$ means "ignore this input." The network learns when to remember and when to forget โ€” adapting to the structure of the specific sequence it is processing. This gating mechanism is the core idea of the LSTM (Long Short-Term Memory) network, which instantiates exactly this principle with specific parametric forms for $\alpha_t$ and $\beta_t$.

Residual Connections: The Same Idea in a Different Form

The additive update principle is not unique to RNNs. In deep feedforward networks and CNNs, the same problem โ€” gradient decay through many layers โ€” is solved by residual connections:

$$y = x + F(x)$$

The output is the input $x$ plus a learned residual $F(x)$. The gradient of the loss with respect to the input has a direct path: $\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(1 + \frac{\partial F(x)}{\partial x}\right)$. The $+1$ ensures that even if $\frac{\partial F}{\partial x}$ vanishes, the gradient still flows with factor 1. This is how ResNets (Residual Networks) enabled training of networks with 100+ layers โ€” the same additive structure prevents gradient decay with depth.

The additive update in RNNs and the residual connection in deep CNNs are the same mathematical idea expressed in different architectures. In both cases: instead of transforming the signal $x \to F(x)$, transform it as $x \to x + F(x)$. The identity path $x \to x$ is always available for gradients to travel through, bypassing the potentially vanishing transformation $F(x)$. This principle is universal: any time you stack many transformations, adding identity shortcuts prevents gradient decay.
Worked Example 1 ยท Vanishing Gradient: Numerical Demonstration ๐Ÿ“Š Exponential Decay in Practice

Consider a scalar vanilla RNN: $h_t = \sigma(w_h h_{t-1} + w_x x_t)$ with $w_h = 0.9$ (a typical initialisation). Compute the gradient $\frac{\partial h_T}{\partial h_1}$ for several sequence lengths $T$.

Gradient formula (scalar case)

In the scalar case, $\frac{\partial h_T}{\partial h_1} = \prod_{k=1}^{T-1} \sigma'(z_{k+1}) \cdot w_h$. With $\sigma'(z) \leq 0.25$ and $w_h = 0.9$, each factor is at most $0.25 \times 0.9 = 0.225$.

Compute for various $T$
$$\begin{array}{lcc} T & \text{Steps back} & \text{Gradient bound}\\ \hline 2 & 1 & 0.225^1 = 0.225\\ 5 & 4 & 0.225^4 \approx 0.0026\\ 10 & 9 & 0.225^9 \approx 1.5 \times 10^{-6}\\ 20 & 19 & 0.225^{19} \approx 5 \times 10^{-13}\\ 50 & 49 & 0.225^{49} \approx 10^{-31} \end{array}$$
By step 10, the gradient from step 1 is already at the level of floating-point noise ($10^{-6}$). By step 20 it is effectively zero. A NIFTY intraday model trying to learn that a gap-up opening at 9:15 AM is relevant to the 3:30 PM close โ€” 50+ time steps later โ€” will find that gradient signal completely extinguished before it reaches the first few time steps. The RNN simply cannot learn this dependency.
Worked Example 2 ยท RNN Forward Pass on a Financial Sequence ๐Ÿ‡ฎ๐Ÿ‡ณ NIFTY Intraday โ€” Sequence Classification

A tiny RNN with hidden size $m = 2$ processes 3 NIFTY intraday returns (in %) to classify the session as "trending" (+1) or "mean-reverting" (โˆ’1). Parameters: $W_h = \begin{bmatrix}0.5 & 0.1 \\ 0.1 & 0.5\end{bmatrix}$, $W_x = \begin{bmatrix}0.8 \\ -0.3\end{bmatrix}$ (1D input), $b_1 = \mathbf{0}$. Initial state $h_0 = \mathbf{0}$. Input sequence: $x_1 = +0.5\%$, $x_2 = +0.8\%$, $x_3 = +0.6\%$ (consistent positive returns โ€” trending session).

Step 1 โ€” $t = 1$: $z_1 = W_h h_0 + W_x x_1$
$$z_1 = \begin{bmatrix}0.5&0.1\\0.1&0.5\end{bmatrix}\begin{bmatrix}0\\0\end{bmatrix} + \begin{bmatrix}0.8\\-0.3\end{bmatrix}(0.5) = \begin{bmatrix}0.4\\-0.15\end{bmatrix}$$ $$h_1 = \sigma(z_1) = \begin{bmatrix}\sigma(0.4)\\\sigma(-0.15)\end{bmatrix} \approx \begin{bmatrix}0.599\\0.463\end{bmatrix}$$
Step 2 โ€” $t = 2$: $z_2 = W_h h_1 + W_x x_2$
$$z_2 = \begin{bmatrix}0.5&0.1\\0.1&0.5\end{bmatrix}\begin{bmatrix}0.599\\0.463\end{bmatrix} + \begin{bmatrix}0.8\\-0.3\end{bmatrix}(0.8) \approx \begin{bmatrix}0.986\\0.352\end{bmatrix}$$ $$h_2 \approx \begin{bmatrix}\sigma(0.986)\\\sigma(0.352)\end{bmatrix} \approx \begin{bmatrix}0.728\\0.587\end{bmatrix}$$
Step 3 โ€” $t = 3$: after $x_3 = +0.6\%$, final state $h_3 \approx [0.81,\; 0.62]^\top$

The final hidden state $h_3$ is fed to a linear classifier $\hat{y} = w_y^\top h_3 + b_y$. With $w_y = [1, 1]^\top$, $\hat{y} \approx 0.81 + 0.62 = 1.43 > 0$ โ†’ predict "trending."

The hidden state accumulates evidence of the trending pattern. Both components of $h_t$ increase monotonically across the three positive-return steps โ€” the hidden state is tracking the momentum. With a longer sequence, this evidence would accumulate further. Note how $W_h$ allows the previous state to influence the next: $h_1$'s large first component (0.599) contributes 0.5 ร— 0.599 = 0.30 to $z_2$'s first component, carrying forward the memory of step 1.
Chapter 11 Preview

The additive update principle โ€” modulators, gating, residual connections โ€” solves the vanishing gradient problem. But there is a deeper architectural question: does sequential processing (one step at a time) need to be the default? Attention mechanisms allow the network to directly connect any two positions in a sequence, regardless of distance, in a single operation. Chapter 11 derives the attention mechanism from first principles, shows why self-attention replaces recurrence, and builds the Transformer encoder-decoder from the ground up.


Practice Problems
4 questions ยท Chapter 10
ml / mathematical-foundations / ch10 / q01 โ˜… Conceptual

An RNN is trained to predict the next NIFTY return given the last 100 daily returns. After training, the model performs well on short-range patterns but seems to ignore patterns from more than 10 days ago. What is the most likely mathematical explanation?

A
The model has overfit โ€” it memorised the training data and ignores useful patterns.
B
The hidden state dimension is too small to store 100 days of information.
C
Vanishing gradients โ€” during BPTT, the gradient from the loss at time $T$ with respect to hidden states more than 10 steps back has decayed exponentially to near zero, so the parameters received no useful signal to learn those long-range dependencies.
D
The learning rate is too high โ€” a smaller learning rate would allow the model to learn longer-range patterns.
Answer: C.

The vanishing gradient proof shows that $\|\partial h_T/\partial h_t\| \leq (\|W_h\|/4)^{T-t}$. For $T - t = 10$ and typical $\|W_h\|$, this bound is already very small. For $T - t = 90$ (patterns from 90 days ago), the gradient is numerically zero. The parameters $W_h$, $W_x$ are updated by gradients that sum over all time steps โ€” but the terms for distant steps are zero, so those dependencies are never learned regardless of training duration or learning rate.

Option B is wrong: a hidden state of even modest dimension (say 64) could theoretically encode 100 days of information โ€” the issue is not capacity but trainability. Option D is wrong: the vanishing gradient is independent of learning rate โ€” it is a structural problem, not an optimisation parameter problem.
ml / mathematical-foundations / ch10 / q02 โ˜… Conceptual

In the additive state update $h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$, what happens when $\alpha_t \approx \mathbf{1}$ and $\beta_t \approx \mathbf{0}$ for several consecutive time steps? What does this represent semantically?

A
The hidden state resets to zero โ€” the network forgets all previous context.
B
The hidden state is nearly unchanged โ€” the network carries its existing memory forward without updating it, effectively ignoring uninformative inputs during those steps.
C
The network enters an unstable regime where gradients explode.
D
The candidate state $\tilde{h}_t$ dominates and completely overwrites the previous memory.
Answer: B.

With $\alpha_t \approx \mathbf{1}$ and $\beta_t \approx \mathbf{0}$: $h_t \approx \mathbf{1} \odot h_{t-1} + \mathbf{0} \odot \tilde{h}_t = h_{t-1}$. The hidden state is preserved almost exactly. Semantically, this means the network has decided that the current inputs $x_t$ are not informative enough to update its memory โ€” it holds its state, waiting for a relevant input.

This is precisely the "memory" property that vanilla RNNs lack. A well-trained LSTM (which implements exactly this gating) learns to hold relevant information over long stretches of uninformative input. For example, in processing an earnings call transcript, the network might latch onto "revenue beat" early in the call and hold that information through many sentences of boilerplate text, until it becomes relevant again at the sentiment prediction step.
ml / mathematical-foundations / ch10 / q03 โ˜…โ˜… Mathematical

The vanishing gradient bound is $\|\partial h_T / \partial h_t\| \leq (\|W_h\| \cdot \gamma)^{T-t}$ where $\gamma = \max_k \|\text{diag}(\sigma'(z_k))\|$. For the ReLU activation $\sigma(z) = \max(0, z)$, what is $\gamma$, and does the vanishing gradient problem disappear?

A
$\gamma = 0.25$ for ReLU โ€” same as sigmoid, vanishing gradients are equally severe.
B
$\gamma = \infty$ for ReLU โ€” the gradient explodes rather than vanishes.
C
$\gamma = 1$ where ReLU is active (derivative = 1) and $\gamma = 0$ where inactive (derivative = 0). Vanishing is reduced for active units but dead neurons (permanently inactive) still block gradient flow entirely.
D
ReLU has no derivative so BPTT cannot be applied to ReLU RNNs.
Answer: C.

ReLU: $\sigma'(z) = 1$ if $z > 0$ (active), $0$ if $z \leq 0$ (inactive). So $\gamma = 1$ for active units. The bound becomes $(\|W_h\| \cdot 1)^{T-t} = \|W_h\|^{T-t}$. If $\|W_h\| < 1$, gradients still vanish (just more slowly than with sigmoid). If $\|W_h\| = 1$, gradients neither vanish nor explode for active neurons โ€” a significant improvement.

But the "dead neuron" problem: a ReLU unit with $z \leq 0$ has $\sigma'(z) = 0$, completely blocking gradient flow through that unit. If many units become permanently inactive (which can happen with poor initialisation or large learning rates), the effective $\gamma$ approaches 0 and gradients vanish just as badly. ReLU helps but does not solve the structural problem โ€” that solution requires additive updates (LSTMs/GRUs) or attention mechanisms.
ml / mathematical-foundations / ch10 / q04 โ˜…โ˜… Synthesis

A residual connection $y = x + F(x)$ and an RNN additive update $h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$ both solve the vanishing gradient problem. What is the common mathematical principle they share, and how does it guarantee gradient flow?

A
Both use normalisation layers to keep activations in a range where gradients are large.
B
Both increase the learning rate for early layers, compensating for gradient decay.
C
Both create an additive identity path: the input (or previous state) is added directly to the output, giving the gradient a direct route with multiplicative factor $\geq 1$ even when the learned transformation vanishes.
D
Both use gating signals to selectively zero out uninformative gradients, keeping only the strongest signals.
Answer: C.

For residual connections: $\frac{\partial y}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}$. Even if $\frac{\partial F}{\partial x} \to 0$ (the learned transformation vanishes), the gradient is at least 1 โ€” it cannot vanish below 1 due to the identity term. Stacking $L$ such layers: $\frac{\partial \mathcal{L}}{\partial x_0} = \prod_l (1 + \frac{\partial F_l}{\partial x_l})$. Each factor is $\geq 1$ when $\frac{\partial F_l}{\partial x_l} \geq 0$, preventing decay.

For the RNN additive update: $\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1}(\text{diag}(\alpha_{k+1}) + \epsilon_{k+1})$. When $\alpha_{k+1} \approx 1$, each factor is near the identity $I$ โ€” the product does not decay exponentially. The common principle: add the input directly to the output, creating an identity shortcut that the gradient can always travel through, regardless of what the learned transformation does.

Terminal Questions โ€” Chapter 10 10 problems ยท No answers given

Q1โ€“3 are direct computations on the RNN equations. Q4โ€“7 require connecting BPTT, vanishing gradients, and the additive update principle. Q8โ€“10 are synthesis questions spanning this chapter and Ch9.

1
Count the total number of learnable parameters in a vanilla RNN with input size $d = 5$, hidden size $m = 32$, and output size $k = 3$. Show your working for each weight matrix and bias vector.Easy
2
For a scalar RNN $h_t = \tanh(w_h h_{t-1} + w_x x_t)$ with $w_h = 1.5$: (a) What is the maximum value of $|\tanh'(z)|$? (b) Compute the bound on $|\partial h_{10}/\partial h_1|$. (c) Does the gradient vanish or explode? Compare to the case $w_h = 0.5$.Easy
3
Perform one full forward pass of a vanilla RNN on the sequence $x = [1.0, -0.5, 0.8]$ with scalar hidden state, $w_h = 0.7$, $w_x = 0.4$, $b = 0$, $h_0 = 0$, $\sigma = \tanh$. Then write out the chain rule expression for $\partial h_3 / \partial h_1$ and compute it numerically.Easy
4
Explain precisely why BPTT treats an RNN as a deep feedforward network with shared weights. For an RNN unrolled over $\tau$ steps, what is the "depth" of this equivalent network? Why does sharing weights across layers create a different optimisation landscape compared to a feedforward network of the same depth with independent weights?Medium
5
Gradient clipping addresses the exploding gradient problem but not vanishing gradients. (a) Describe the gradient clipping procedure mathematically. (b) Explain why clipping prevents explosion. (c) Explain why clipping cannot fix vanishing โ€” what would "reverse clipping" (amplifying small gradients) do, and why is it not done in practice?Medium
6
The additive update $h_t = \alpha_t \odot h_{t-1} + \beta_t \odot \tilde{h}_t$ has $\alpha_t, \beta_t$ as learnable functions of $(h_{t-1}, x_t)$. Write out a specific parametric form for $\alpha_t$ using a sigmoid output (so that $\alpha_t \in [0,1]^m$). Count the additional parameters this gating introduces compared to a vanilla RNN. How does this relate to the LSTM architecture?Medium
7
In a sequence-to-sequence NIFTY return prediction model, the RNN reads 20 daily returns and predicts the next day's return at each step. (a) Draw the computational graph showing inputs, hidden states, outputs, and loss terms at each step. (b) During BPTT, which loss term's gradient reaches $h_1$ with the least decay? (c) If you could only supervise at one time step, which would give the best training signal for learning long-range dependencies?Medium
8
Compare RNNs and CNNs as sequence models for a 100-step financial time series. For each, state: (a) how long-range dependencies are captured, (b) the computational complexity for a sequence of length $T$, (c) whether the model can handle variable-length sequences, (d) the primary training difficulty. Conclude with a recommendation for modelling high-frequency NIFTY tick data and justify it.Hard
9
The vanishing gradient problem in RNNs and the bias problem in Chapter 7 are both about information loss โ€” one during training, one during inference. Draw a precise analogy: what plays the role of "hypothesis class complexity" in the RNN context, and what plays the role of "noise floor"? Is there an irreducible lower bound on how well an RNN can learn long-range dependencies, analogous to Bayes error?Hard
10
Design an RNN-based system to classify NSE intraday sessions as "trending" or "mean-reverting" using 5-minute OHLCV bars. Specify: input features at each time step, hidden state size, task formulation (sequence classification or sequence-to-sequence), loss function, and how you would address the vanishing gradient problem. Then identify two specific failure modes this system would have on NSE data that a human expert would not โ€” and connect each failure mode to a mathematical property of the RNN discussed in this chapter.Hard