/
ProgrammingAlgorithm Design & AnalysisHashing, Augmentation & B-Trees
Progress
4 / 13
← All Chapters
Intermediate Chapter 04 of 13 · Algorithm Design and Analysis

Hashing, Augmentation & B-Trees

Three ideas that power real systems. Universal hashing removes the assumption that keys are random, guaranteeing $O(1)$ expected operations. Augmentation adds information to existing trees without rebuilding them, enabling rank, select, and range queries. B-trees adapt trees to block storage, making them the backbone of every database and filesystem.

Part I — Universal Hashing

Basic hashing with chaining achieves $\Theta(1+\alpha)$ per operation (where $\alpha = n/m$ is the load factor) under simple uniform hashing — the assumption that keys are drawn from a random distribution. This assumption is unreasonable in practice. Universal hashing removes it entirely.

Universal hash family. A family $\mathcal{H}$ of hash functions $h:\{0,\ldots,u-1\}\to\{0,\ldots,m-1\}$ is universal if for all distinct keys $k \neq k'$: $$\Pr_{h \in \mathcal{H}}[h(k) = h(k')] \leq \frac{1}{m}$$ The probability is over a uniformly random choice of $h$, not over random keys. The keys can be adversarial.

Theorem. For $n$ arbitrary distinct keys and a random $h \in \mathcal{H}$ where $\mathcal{H}$ is universal:

$$\mathbb{E}[\text{collisions with } k_i] \leq 1 + \frac{n}{m} = 1 + \alpha$$

Proof. For each pair $i \neq j$, let $I_{i,j} = 1$ if $h(k_i) = h(k_j)$. By universality, $\mathbb{E}[I_{i,j}] \leq 1/m$. By linearity of expectation: $\mathbb{E}[\sum_{j \neq i} I_{i,j}] = \sum_{j \neq i} \mathbb{E}[I_{i,j}] \leq (n-1)/m \leq n/m = \alpha$. Add 1 for the key itself. $\square$

Two Universal Hash Families

Dot-product family. Requires $m$ prime, $u = m^r$. View key $k$ in base $m$: $k = \langle k_0, k_1, \ldots, k_{r-1}\rangle$. For $a \in \{0,\ldots,u-1\}$: $$h_a(k) = a \cdot k \bmod m = \left(\sum_{i=0}^{r-1} a_i k_i\right) \bmod m$$ Family: $\mathcal{H} = \{h_a \mid a \in \{0,\ldots,u-1\}\}$. Storing $h_a$ costs $O(1)$ (just store $a$). Computing $h_a(k)$ costs $O(1)$.

CLRS family (multiplicative). Choose prime $p \geq u$. For $a, b \in \{1,\ldots,p-1\}$: $$h_{a,b}(k) = [(ak + b) \bmod p] \bmod m$$ Family: $\mathcal{H} = \{h_{a,b}\}$. Also universal. Simpler to implement.

Perfect Hashing — $O(1)$ Worst-Case Search

For a static dictionary (no inserts/deletes after build), we can achieve $O(1)$ worst-case search (not just expected). The idea: two-level hashing (FKS hashing, Fredman-Komlรณs-Szemerรฉdi 1984).

Perfect hashing (FKS) — two-level construction:
Level 1: Pick $h_1$ from a universal family for $m = \Theta(n)$ slots. Hash all $n$ items with chaining. Let $l_j$ = number of items in slot $j$.
Level 2: For each slot $j$, pick $h_{2,j}$ for a table of size $m_j = l_j^2$. No collisions guaranteed because $\Pr[\text{any collision}] \leq l_j(l_j-1)/2 \cdot 1/m_j = (l_j^2-l_j)/(2l_j^2) < 1/2$. Retry if collision.
Space: $\sum_j m_j = \sum_j l_j^2$. Since $\mathbb{E}[\sum_j l_j^2] \leq cn$ for a universal $h_1$, this is $O(n)$ in expectation. Retry $h_1$ if $\sum_j l_j^2 > cn$.
Build time: $O(n \log^2 n)$ w.h.p. — acceptable for static tables.
Search: $O(1)$ worst case — two hash computations, no collision possible.
Universal Hashing + Perfect Hashing · O(1) expected / worst-case
import random, sympy

# โ”€โ”€ UNIVERSAL HASH FAMILY (CLRS multiplicative) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class UniversalHash:
    """
    h_{a,b}(k) = ((a*k + b) mod p) mod m
    Universal: Pr[collision] <= 1/m for any k != k'.
    """
    def __init__(self, m, u=10**9):
        self.m = m
        self.p = sympy.nextprime(u)          # prime >= u
        self.a = random.randint(1, self.p-1)
        self.b = random.randint(0, self.p-1)

    def __call__(self, k):
        return ((self.a * k + self.b) % self.p) % self.m


class HashTable:
    """Universal hashing with chaining. O(1+ฮฑ) expected per operation."""
    def __init__(self, m=64):
        self.m    = m
        self.h    = UniversalHash(m)
        self.table = [[] for _ in range(m)]
        self.n    = 0

    def insert(self, key, val):
        slot = self.h(key)
        for i, (k, _) in enumerate(self.table[slot]):
            if k == key:
                self.table[slot][i] = (key, val)
                return
        self.table[slot].append((key, val))
        self.n += 1

    def search(self, key):
        for k, v in self.table[self.h(key)]:
            if k == key: return v
        return None

    def delete(self, key):
        slot = self.h(key)
        self.table[slot] = [(k,v) for k,v in self.table[slot] if k != key]


# โ”€โ”€ PERFECT HASHING (FKS) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class PerfectHash:
    """
    Static O(1) worst-case search via two-level universal hashing.
    Build: O(n log^2 n) w.h.p.  Search: O(1) worst case.
    """
    def __init__(self, keys):
        n = len(keys)
        m = max(n, 2)

        while True:                               # retry until good h1
            h1   = UniversalHash(m)
            slots = [[] for _ in range(m)]
            for k in keys: slots[h1(k)].append(k)
            if sum(len(s)**2 for s in slots) <= 4*n:
                break                             # Step 1.5: space OK

        self._h1    = h1
        self._h2    = []
        self._table = []

        for slot in slots:
            lj = len(slot)
            if lj == 0:
                self._h2.append(None); self._table.append({})
                continue
            mj = max(lj * lj, 1)
            while True:                           # retry until no collision
                h2 = UniversalHash(mj)
                t  = {}
                collision = False
                for k in slot:
                    hk = h2(k)
                    if hk in t: collision = True; break
                    t[hk] = k
                if not collision: break
            self._h2.append(h2)
            self._table.append(t)

    def search(self, key):                        # O(1) worst case!
        j  = self._h1(key)
        h2 = self._h2[j]
        if h2 is None: return False
        return key in self._table[j] and self._table[j][h2(key)] == key
#include <vector>
#include <random>
#include <unordered_map>
using namespace std;

// Universal hash h_{a,b}(k) = ((a*k+b) mod p) mod m
struct UniversalHash {
    long long a, b, p, m;
    UniversalHash(long long m, long long p = 1000000007LL)
        : m(m), p(p) {
        mt19937_64 rng(random_device{}());
        a = uniform_int_distribution<long long>(1, p-1)(rng);
        b = uniform_int_distribution<long long>(0, p-1)(rng);
    }
    long long operator()(long long k) const {
        return ((a * k + b) % p) % m;
    }
};

// Hash table with universal hashing + chaining
struct HashTable {
    int m;
    UniversalHash h;
    vector<vector<pair<long long, int>>> table;

    HashTable(int m = 64) : m(m), h(m), table(m) {}

    void insert(long long key, int val) {
        auto& slot = table[h(key)];
        for (auto& [k, v] : slot)
            if (k == key) { v = val; return; }
        slot.push_back({key, val});
    }

    int* search(long long key) {
        for (auto& [k, v] : table[h(key)])
            if (k == key) return &v;
        return nullptr;
    }
};
↑ Run C++ on Compiler Explorer

Part II — Tree Augmentation

Augmentation adds extra information to existing balanced BST nodes to answer richer queries, at the cost of updating that information during Insert/Delete. The key principle:

Easy tree augmentation. If we want to store $x.f = f(\text{subtree at } x)$ at each node $x$, and $x.f$ can be recomputed in $O(1)$ from $x$, its children, and their $f$-values, then modifying a set $S$ of nodes costs $O(\text{ancestors of } S)$ to update $f$ bottom-up. For AVL/2-3 trees, any update touches $O(\log n)$ ancestors, so augmentation adds $O(\log n)$ overhead.

Order-Statistics Trees

Augment an AVL tree with subtree sizes: $x.\text{size} = 1 + \sum_c c.\text{size}$. This enables two new operations:

rank($x$): how many elements are $< x$? Walk up from $x$ to root: whenever moving left to parent, add parent's left subtree size + 1. $O(\log n)$.

select($i$): find the $i$-th smallest element. Walk down: if $i = $ left size + 1, return current node; if $i <$ left size + 1, go left; else go right subtracting left size + 1 from $i$. $O(\log n)$.

Range Trees — Multidimensional Range Queries

Given $n$ points in $d$-dimensional space, answer orthogonal range queries: find all points in an axis-aligned box. A range tree achieves $O(\log^d n + k)$ query time where $k$ is the output size.

1D range tree: a balanced BST. Range-query$(a, b)$: search for $a$ and $b$, find their LCA $v_\text{split}$, then report all nodes and subtrees "between" $a$ and $b$ in the tree. Time: $O(\log n + k)$.

2D range tree: primary BST keyed on $x$-coordinate. Each node $v$ in the primary tree stores a secondary BST of all points in $v$'s subtree, keyed on $y$-coordinate. Range-query$(x_1 \leq x \leq x_2, y_1 \leq y \leq y_2)$: use primary tree to find $O(\log n)$ canonical nodes/subtrees with correct $x$-range, then use each secondary tree to filter by $y$-range. Time: $O(\log^2 n + k)$. Space: $O(n\log n)$ (each point duplicated $O(\log n)$ times).

$d$-D range tree: recurse. Time: $O(\log^d n + k)$. Space: $O(n\log^{d-1} n)$.
Order-Statistics Tree + 2D Range Query · O(log n) / O(logยฒn + k)
from sortedcontainers import SortedList

# โ”€โ”€ ORDER-STATISTICS TREE (using SortedList) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class OrderStatTree:
    """
    Augmented BST supporting rank, select, insert, delete.
    Uses Python's SortedList (backed by a balanced B-tree-like structure).
    All operations O(log n).
    """
    def __init__(self):
        self._sl = SortedList()

    def insert(self, x):  self._sl.add(x)
    def delete(self, x):  self._sl.remove(x)

    def rank(self, x):
        """Number of elements strictly less than x. O(log n)."""
        return self._sl.bisect_left(x)

    def select(self, i):
        """Return the i-th smallest (0-indexed). O(log n)."""
        return self._sl[i]

    def successor(self, x):
        """Smallest element > x."""
        idx = self._sl.bisect_right(x)
        return self._sl[idx] if idx < len(self._sl) else None


# โ”€โ”€ 2D RANGE TREE (simplified) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class RangeTree2D:
    """
    Static 2D range tree for orthogonal range queries.
    Build: O(n log n).  Query: O(log^2 n + k).  Space: O(n log n).
    """
    def __init__(self, points):
        """points: list of (x, y) tuples."""
        self.points = sorted(points, key=lambda p: p[0])
        # Build primary structure (sorted by x) with secondary (by y)
        self._build(self.points)

    def _build(self, pts):
        # For each possible x-range, store y-sorted list
        # Simplified: just use sorted structures at each node level
        self._by_y = SortedList(pts, key=lambda p: p[1])

    def query(self, x1, x2, y1, y2):
        """
        Return all points (x, y) with x1 <= x <= x2 and y1 <= y <= y2.
        O(log^2 n + k) for proper implementation; O(n) for this demo.
        """
        # Demo: linear scan (production would use nested BSTs)
        return [(x, y) for x, y in self.points
                if x1 <= x <= x2 and y1 <= y <= y2]

# Example: find NIFTY option strikes in price range [23000,24000]
# with IV in range [0.18, 0.25]
strikes_iv = [(23000, 0.20), (23500, 0.22), (24000, 0.19),
              (24500, 0.17), (23200, 0.23), (23800, 0.21)]
rt = RangeTree2D(strikes_iv)
print(rt.query(23000, 24000, 0.18, 0.25))
# [(23000,0.20),(23500,0.22),(23800,0.21),(23200,0.23)]
// Order-statistics tree using GNU policy-based tree
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
using namespace __gnu_pbds;
using namespace std;

// order_statistics_tree: supports find_by_order (select) and
// order_of_key (rank) in O(log n)
typedef tree<int, null_type, less<int>,
             rb_tree_tag,
             tree_order_statistics_node_update> OST;

// Usage:
// OST t;
// t.insert(5); t.insert(3); t.insert(7);
// t.order_of_key(5)   = 1  (rank: 0-indexed, 1 element < 5)
// *t.find_by_order(0) = 3  (select: 0-th smallest = 3)

// 1D range query on sorted array: O(log n + k)
#include <vector>
#include <algorithm>
vector<int> range_query_1d(const vector<int>& sorted_arr, int lo, int hi) {
    auto it1 = lower_bound(sorted_arr.begin(), sorted_arr.end(), lo);
    auto it2 = upper_bound(sorted_arr.begin(), sorted_arr.end(), hi);
    return vector<int>(it1, it2);   // O(log n + k)
}
↑ Run C++ on Compiler Explorer

Part III — 2-3 Trees and B-Trees

2-3 trees are balanced search trees where every internal node has either 2 or 3 children. All leaves are at the same depth, guaranteeing $O(\log n)$ height. Operations maintain this property via splits (on insert overflow) and merges/redistributions (on delete underflow).

2-3 tree invariants: every non-leaf is a 2-node (1 key, 2 children) or 3-node (2 keys, 3 children). All leaves at the same level. Data sorted across levels. Height $\leq \log_2 n$.

Insert($x$): search for $x$'s position, insert at leaf, fix overflow by splitting: a 4-element node splits into two nodes, promoting the median to the parent. If the root overflows, a new root is created. $O(\log n)$.

Delete($x$): replace $x$ with its inorder successor if not at a leaf. Fix underflow by redistributing from a sibling (if sibling is a 3-node) or merging (if sibling is a 2-node, merge with sibling and demote parent key). $O(\log n)$.

B-trees generalise 2-3 trees to branching factor $B$: each node has between $B-1$ and $2B-1$ keys, and $B$ to $2B$ children. The motivation is external memory: a disk block holds $B$ keys, so a single I/O reads an entire node.

B-tree parameters and costs:
— Height: $O(\log_B n)$ — each level multiplies capacity by $B$.
— Search: $O(\log_B n)$ node reads. With block size $B$, this is $O(\log_B n)$ disk I/Os.
— Insert/Delete: $O(\log_B n)$ with at most one split/merge per level.
— Optimal $B$: set $B$ = disk block size / key size. Typical: $B = 100\text{โ€“}1000$. Height $\leq 4$ for $n = 10^9$.

Used by: MySQL InnoDB, PostgreSQL, SQLite (B+ trees), macOS HFS+, NTFS, Linux ext3/ext4, InfluxDB (time-series data).
B-Tree search + insert sketch · O(log_B n) disk I/Os
class BTreeNode:
    def __init__(self, leaf=True):
        self.keys     = []          # sorted keys in this node
        self.children = []          # child node pointers (len = len(keys)+1)
        self.leaf     = leaf

class BTree:
    """
    B-Tree with minimum degree t (branching factor B = t).
    Each non-root node: t-1 <= #keys <= 2t-1.
    Each node: t <= #children <= 2t.
    All operations: O(t * log_t(n)) time, O(log_t(n)) disk I/Os.
    """
    def __init__(self, t=2):       # t=2 gives a 2-3-4 tree
        self.t    = t
        self.root = BTreeNode()

    def search(self, k, node=None):
        """Search for key k. Returns (node, index) or None."""
        if node is None: node = self.root
        i = 0
        while i < len(node.keys) and k > node.keys[i]:
            i += 1
        if i < len(node.keys) and k == node.keys[i]:
            return (node, i)                      # found
        if node.leaf:
            return None                           # not found
        return self.search(k, node.children[i])  # recurse into child

    def insert(self, k):
        """Insert key k. O(log_t n) time."""
        root = self.root
        if len(root.keys) == 2 * self.t - 1:     # root is full
            # Split root: create new root, old root becomes child
            new_root = BTreeNode(leaf=False)
            new_root.children.append(self.root)
            self._split_child(new_root, 0)
            self.root = new_root
        self._insert_non_full(self.root, k)

    def _split_child(self, parent, i):
        """Split parent.children[i] (which must be full). O(t)."""
        t     = self.t
        y     = parent.children[i]
        z     = BTreeNode(leaf=y.leaf)
        mid   = y.keys[t - 1]
        # Right half of y goes into z
        z.keys = y.keys[t:]
        y.keys = y.keys[:t - 1]
        if not y.leaf:
            z.children = y.children[t:]
            y.children = y.children[:t]
        # Promote median to parent
        parent.keys.insert(i, mid)
        parent.children.insert(i + 1, z)

    def _insert_non_full(self, node, k):
        """Insert k into a non-full node. O(t * log_t n)."""
        i = len(node.keys) - 1
        if node.leaf:
            node.keys.append(None)
            while i >= 0 and k < node.keys[i]:
                node.keys[i + 1] = node.keys[i]
                i -= 1
            node.keys[i + 1] = k
        else:
            while i >= 0 and k < node.keys[i]: i -= 1
            i += 1
            if len(node.children[i].keys) == 2 * self.t - 1:
                self._split_child(node, i)
                if k > node.keys[i]: i += 1
            self._insert_non_full(node.children[i], k)
// B-Tree node structure โ€” O(log_t n) search
#include <vector>
#include <algorithm>
using namespace std;

struct BNode {
    vector<int> keys;
    vector<BNode*> ch;
    bool leaf;
    BNode(bool lf=true) : leaf(lf) {}
};

class BTree {
    int t;   // min degree
    BNode* root;
public:
    BTree(int t=2) : t(t), root(new BNode()) {}

    // Search: O(t * log_t n) time, O(log_t n) node accesses (disk I/Os)
    pair<BNode*, int> search(BNode* node, int k) {
        int i = lower_bound(node->keys.begin(),
                            node->keys.end(), k) - node->keys.begin();
        if (i < (int)node->keys.size() && node->keys[i] == k)
            return {node, i};
        if (node->leaf) return {nullptr, -1};
        return search(node->ch[i], k);   // one disk I/O per level
    }

    // Height for n elements: floor(log_t((n+1)/2)) + 1
    // With t=1000: height <= 3 for n = 10^9
    int height(BNode* v=nullptr) {
        if (!v) v = root;
        if (v->leaf) return 0;
        return 1 + height(v->ch[0]);
    }
};
↑ Run C++ on Compiler Explorer
Worked Example · NSE Instrument Master Lookup 🇮🇳 Pashupati / TTT Systems

The NSE instrument master has ~5 lakh (500,000) records. Each instrument has a token (integer key) and metadata (symbol, expiry, strike, lot size). Two query patterns: (1) token โ†’ metadata lookup (constant time), (2) range query: find all NIFTY options with strike between 23,000 and 24,000 expiring in the current week.

Choosing the right data structure
from sortedcontainers import SortedList
import collections

# Pattern 1: token -> metadata (O(1) average via universal hashing)
instrument_map = {}   # Python dict uses universal-style hashing internally
instrument_map[35001] = {'symbol':'NIFTY','strike':23000,'expiry':'27MAR25'}

# Pattern 2: strike range query (O(log n + k) via order-statistics)
# Build sorted list of (strike, token) pairs for weekly expiry
weekly = SortedList(key=lambda x: x[0])  # sorted by strike
weekly.add((23000, 35001))
weekly.add((23500, 35002))
weekly.add((24000, 35003))
weekly.add((24500, 35004))

# O(log n + k): find all strikes in [23000, 24000]
lo_idx = weekly.bisect_left((23000,))
hi_idx = weekly.bisect_right((24000, float('inf')))
result = list(weekly[lo_idx:hi_idx])
print(result)  # [(23000,35001),(23500,35002),(24000,35003)]
Architecture: universal hash map for $O(1)$ token lookups (millions of times per second during market hours); augmented BST (SortedList) for strike range queries during order placement. The XTS API at Pashupati uses exactly this pattern โ€” token-keyed hash map for instrument lookup, sorted structures for building option chains.
Python’s built-in dict uses open addressing with a hash function that is effectively universal for string and integer keys (random seed per process, changed in Python 3.3 to prevent hash-flooding DoS attacks). The sortedcontainers.SortedList is built on a B-tree-like structure with block size ~1000, giving $O(\log n)$ operations. InfluxDB (used for time-series data in many algo trading platforms including TTT’s XTS WebSocket โ†’ InfluxDB pipeline) stores time-series data in B+ trees on disk — exactly the B-tree external memory motivation: one disk block per node, minimising I/Os per query.

Practice Problems
4 questions · Chapter 04
prog / daa / ch04 / q01 ★ Universal Hashing

Universal hashing guarantees $\Pr[h(k) = h(k')] \leq 1/m$. What does this mean for an adversary who chooses the keys before the hash function is chosen?

A
The adversary can still force $\Omega(n)$ collisions by choosing keys cleverly
B
The adversary cannot force collisions: for any fixed pair of keys, the collision probability is at most $1/m$ over the random choice of $h$
C
The adversary can force collisions if they know which hash family is being used
D
Universal hashing only helps when keys are uniformly distributed
Answer: B.

This is the point of universal hashing. The adversary picks keys $k_1, k_2, \ldots, k_n$ (possibly the worst possible set). Then we choose $h$ uniformly at random from $\mathcal{H}$. For any fixed pair $k_i \neq k_j$, $\Pr_h[h(k_i) = h(k_j)] \leq 1/m$. The adversary chose the keys before seeing $h$, so they cannot exploit any specific hash function's weaknesses. Expected collisions are $O(n/m) = O(\alpha)$ regardless of key distribution.
prog / daa / ch04 / q02 ★★ Perfect Hashing Space

In FKS perfect hashing, the second-level table for slot $j$ has size $m_j = l_j^2$ (where $l_j$ items hash to slot $j$). Why is the total space $\sum_j l_j^2 = O(n)$ in expectation?

A
Because the $l_j$ are equal: each slot gets exactly $n/m$ items
B
Because $\sum l_j = n$, and $\sum l_j^2 \leq (\sum l_j)^2 = n^2$
C
Because $\mathbb{E}[\sum_j l_j^2] = \mathbb{E}[\sum_{i\neq i'} I_{i,i'}] + n \leq n + n(n-1)/m = O(n)$ when $m = \Theta(n)$, by universality
D
Because $l_j \leq O(\log n)$ with high probability, so $l_j^2 \leq O(\log^2 n)$ and total is $O(n\log^2 n)$
Answer: C.

$\sum_j l_j^2 = \sum_j \sum_{i: h(k_i)=j} l_j = \sum_i l_{h(k_i)} = n + \sum_{i \neq i'} I_{i,i'}$ where $I_{i,i'} = 1$ iff $h(k_i) = h(k_{i'})$. By universality, $\mathbb{E}[I_{i,i'}] \leq 1/m$. By linearity: $\mathbb{E}[\sum_{i\neq i'} I_{i,i'}] \leq n(n-1)/m$. With $m = cn$ for constant $c$: this is $\leq n/c$. Total: $\mathbb{E}[\sum_j l_j^2] \leq n + n/c = O(n)$. Markov then bounds the probability of exceeding $2cn$.
prog / daa / ch04 / q03 ★★ Augmentation

Why can we store subtree size at each BST node (enabling rank/select) but not subtree rank?

A
Because rank requires $O(n)$ space while size only requires $O(1)$ per node
B
Inserting a new minimum changes the rank of every node, requiring $O(n)$ updates โ€” violating the $O(\log n)$ augmentation update bound
C
Rank can only be computed for leaf nodes, not internal nodes
D
Because rank requires knowing the entire tree, not just the subtree
Answer: B.

Easy augmentation works when $x.f$ depends only on $x$ and its children’s $f$-values: $x.f = g(x, x.\text{left}.f, x.\text{right}.f)$. Subtree size satisfies this: $x.\text{size} = 1 + x.\text{left.size} + x.\text{right.size}$. After an insert, only $O(\log n)$ ancestors need their size recomputed.

Rank does not satisfy this: $x.\text{rank} = x.\text{left.size} + 1 + \text{rank of all elements less than } x \text{ outside its subtree}$. The “outside subtree” part depends on the entire tree. Insert($-\infty$) changes every node’s rank, requiring $O(n)$ updates.
prog / daa / ch04 / q04 ★★★ B-Tree I/Os

A database stores $n = 10^9$ records. A B-tree with branching factor $B = 1000$ is used. How many disk I/Os does a search require, and why is this better than a binary BST?

A
$\log_2(10^9) \approx 30$ I/Os — same as binary BST since each comparison is one I/O
B
$\lceil\log_{1000}(10^9)\rceil = 3$ I/Os for B-tree vs $\approx 30$ for binary BST โ€” $10\times$ fewer disk reads
C
$O(1)$ I/Os because large branching factor puts everything in one disk block
D
$\log_{1000}(10^9) = 3$ I/Os for B-tree, but binary BST also only needs 3 since it uses the same disk blocks
Answer: B — 3 I/Os for B-tree vs 30 for binary BST.

$\log_{1000}(10^9) = 3$ exactly (since $1000^3 = 10^9$). Each B-tree node fills one disk block (holding 1000 keys), so 3 node reads = 3 disk I/Os. A binary BST needs $\log_2(10^9) \approx 30$ node comparisons, each potentially a separate disk I/O (since each node is tiny). Disk I/O is $\sim 10^4$ times slower than memory access, so 3 vs 30 I/Os is a $10\times$ speedup in the bottleneck operation. This is why every database uses B-trees for their index structures.

Terminal Questions — Chapter 04 10 problems · No answers given

Work through independently. Q1–3 direct application. Q4–7 synthesis. Q8–10 careful argument.

1
Prove that the CLRS family $h_{a,b}(k) = [(ak+b) \bmod p] \bmod m$ is universal. You need to show that for any $k \neq k'$, $\Pr_{a,b}[h_{a,b}(k) = h_{a,b}(k')] \leq 1/m$. Hint: $p$ is prime, so $\mathbb{Z}_p$ is a field with multiplicative inverses. Easy
2
Build a 2-3 tree by inserting keys $[10, 20, 5, 6, 12, 30, 7, 17]$ in that order. Show the tree state after each insertion, clearly indicating when splits occur and which median is promoted. Easy
3
On a 2-3 tree containing $\{3, 7, 10, 14, 17, 20, 24, 30, 32, 41, 48\}$, perform: delete(20), then delete(38). For each deletion show whether redistribution or merging is needed, and trace the tree updates. Easy
4
Implement a hash table with FKS perfect hashing for the static set of NIFTY weekly option tokens $T = \{35001, 35002, \ldots, 35200\}$ (200 tokens). Build the two-level structure, verify search costs $O(1)$, and measure the actual build time in Python. Medium
5
Augment an AVL tree with subtree minimum: each node stores the minimum key in its subtree. Write the update rule ($x.\text{min}$ in terms of $x$, $x.\text{left}$, $x.\text{right}$). Describe how this enables a $O(\log n)$ query: “find the minimum element in the range $[a,b]$.” Medium
6
Implement a 2D range tree to answer: given a list of (timestamp, price) pairs from NIFTY tick data, find all ticks with timestamp in $[t_1, t_2]$ and price in $[p_1, p_2]$. Use Python’s SortedList for inner structure. Verify correctness on a 1000-tick dataset. Medium
7
A finger search tree lets you search for $x$ from a given node $y$ in $O(\log|rank(y) - rank(x)|)$ time. How would you use this to implement an efficient “find the nearest strike above/below spot price” query after each tick update? What is the amortised cost per tick? Medium
8
Prove that a B-tree with $n$ keys and minimum degree $t$ has height at most $\lfloor\log_t((n+1)/2)\rfloor$. Use the fact that a full B-tree of height $h$ has at least $2t^{h-1} - 1$ keys. Hard
9
Hash flooding attack: Python dicts before version 3.3 used a deterministic hash function. Show that for Python 2 string hashing $h(s) = \sum_i s[i] \cdot 31^i \bmod 2^{32}$, you can construct $O(\sqrt{n})$ strings that all hash to the same slot, causing $O(n^2)$ total work in a hash table. Why does random seeding (PYTHONHASHSEED) prevent this? Hard
10
B+ trees (used by all databases) store data only in leaves, with internal nodes holding only routing keys. Compare B-trees and B+ trees on: (a) range queries from $a$ to $b$ โ€” which is faster and why? (b) sequential scan of all elements โ€” which is $O(n/B)$ disk I/Os? Explain why MySQL InnoDB uses B+ trees specifically. Hard