Foundational Chapter 04 of 13 · Algorithms and Data Structures

Hashing

We proved in Chapter 3 that no comparison-based algorithm can find a key faster than $\Theta(\log n)$. This chapter breaks that barrier — not by being cleverer about comparisons, but by abandoning them entirely.

The Comparison Lower Bound — Why $\Theta(\log n)$ Is the Ceiling

Any deterministic comparison search algorithm can be modelled as a binary decision tree. Each internal node is a comparison between two items — the outcome is True or False, so the tree branches in exactly two directions. Each leaf represents a distinct output: either "found item $x$" or "not found." With $n$ items stored, there are $n + 1$ possible outputs, so the decision tree needs at least $n + 1$ leaves.

A binary tree with at least $n + 1$ leaves has height at least $\lceil \log_2(n+1) \rceil - 1 = \Omega(\log n)$. The height is the length of the longest root-to-leaf path — which is the worst-case number of comparisons. Therefore:

\text{Any comparison search algorithm requires } \Omega(\log n) \text{ time in the worst case.}

Sorted arrays and balanced BSTs achieve this bound. But what if we are not limited to comparisons? The Word-RAM model has one operation with super-constant branching factor: random memory access. Given an address, we can read the value there in $\Theta(1)$ — branching factor equal to the size of memory itself. This is the doorway hashing walks through.

Direct Access Arrays — $\Theta(1)$ Find, Catastrophic Space

The simplest exploitation of random access: give every item an integer key $k \in \{0, \ldots, u-1\}$ and store item $x$ at index $x.\text{key}$ in an array of length $u$. Finding any item is exactly one array access — $\Theta(1)$. No comparisons at all.

DirectAccessArray · O(1) find, O(u) space

class DirectAccessArray:
    def __init__(self, u):          # O(u) space — the problem
        self.A = [None] * u

    def find(self, k):              # O(1) — just index
        return self.A[k]

    def insert(self, x):            # O(1)
        self.A[x.key] = x

    def delete(self, k):            # O(1)
        self.A[k] = None

    def find_next(self, k):         # O(u) — must scan
        for i in range(k, len(self.A)):
            if self.A[i] is not None:
                return self.A[i]

    def find_max(self):             # O(u) — must scan
        for i in range(len(self.A) - 1, -1, -1):
            if self.A[i] is not None:
                return self.A[i]

# Problem: if keys are 10-letter strings, u = 26^10 ≈ 1.4 × 10^14
# Storing even 1 bit per slot = 17.6 TB. Completely unusable.

#include <vector>
#include <optional>
using namespace std;

template<typename T>
class DirectAccessArray {
    vector<optional<T>> A;
public:
    DirectAccessArray(int u) : A(u) {}  // O(u) space — the problem

    optional<T> find(int k)  { return A[k]; }     // O(1)
    void insert(int k, T x)  { A[k] = x; }        // O(1)
    void remove(int k)       { A[k] = nullopt; }   // O(1)

    // Order operations require O(u) scan — no structure to exploit
    optional<T> find_max() {
        for (int i = A.size()-1; i >= 0; --i)
            if (A[i]) return A[i];
        return nullopt;
    }
};

↗ Run C++ on Compiler Explorer (Godbolt)

The space cost is $\Theta(u)$ — the size of the entire key universe, regardless of how many items you actually store. If $n \ll u$, this is catastrophically wasteful. For NSE instrument tokens (32-bit integers), $u = 2^{32} \approx 4 \times 10^9$. Allocating that array takes 16 GB for 4-byte pointers, for storing perhaps 9,000 instruments. We need a smarter approach.

Hashing — Compressing the Key Space

The fix: instead of an array of size $u$, use an array of size $m = \Theta(n)$ — just enough slots for the items you actually store. Map each key $k$ to a slot via a hash function $h : \{0, \ldots, u-1\} \to \{0, \ldots, m-1\}$.

Hash function. A function $h(k) : \{0, \ldots, u-1\} \to \{0, \ldots, m-1\}$ mapping item keys to array indices. The array is called a hash table, and $h(k)$ is the hash of key $k$.

Since $m \ll u$, by the pigeonhole principle, no hash function can be injective over all possible keys — there must exist keys $a \neq b$ with $h(a) = h(b)$. This is called a collision.

Think of a hash function as a filing system for a cabinet with only $m$ drawers, but $u$ possible documents. No matter how cleverly you label the drawers, some documents must share a drawer. The question is not whether collisions happen — they always do — but how to handle them efficiently when they do.

Collision Resolution — Chaining

When two keys hash to the same slot, we need somewhere to put the second item. The cleanest solution: at each slot, store not a single item but a chain — a small data structure (typically a linked list) holding all items that hash to that slot.

To find key $k$: compute $h(k)$, go to slot $h(k)$, then search the chain there for $k$. If chains are short, all operations are $O(1)$ plus the chain search cost. The central question is: how long are the chains?

If $n$ items are distributed roughly evenly across $m = \Theta(n)$ slots, each chain has expected length $n/m = \Theta(1)$. The ratio $\alpha = n/m$ is called the load factor. As long as $\alpha = O(1)$, all operations run in $O(1)$ expected time.

What Makes a Good Hash Function?

Division method (bad in general): $h(k) = k \bmod m$. Simple and fast. Works well when keys are uniformly distributed. Terrible when keys share structure — if all instrument tokens are multiples of some number that divides $m$, they all collide. A determined adversary who knows your hash function can construct inputs that all hash to the same slot, forcing $\Theta(n)$ chains.

The fix: do not use a fixed hash function. Choose one randomly from a large family of functions. Then no adversary can predict which function you are using.

Universal hash family. A family of hash functions $\mathcal{H}$ is universal if for any two distinct keys $k_i \neq k_j$: $$\Pr_{h \in \mathcal{H}}[h(k_i) = h(k_j)] \leq \frac{1}{m}$$ The probability is over the random choice of $h$ from $\mathcal{H}$, not over the keys. Keys can be adversarial — the randomness is ours.

One concrete universal family: $h_{ab}(k) = ((ak + b) \bmod p) \bmod m$, where $p$ is a prime larger than $u$, and $a, b$ are chosen uniformly at random from $\{0, \ldots, p-1\}$ with $a \neq 0$.

Why Universality Implies Short Chains — The Probability Argument

Fix a key $k_i$ and a universally random hash function $h$. How many other keys collide with $k_i$? For each other key $k_j$, define the indicator random variable:

X_{ij} = \begin{cases} 1 & \text{if } h(k_i) = h(k_j) \\ 0 & \text{otherwise} \end{cases}

The chain length at slot $h(k_i)$ is $X_i = \sum_{j} X_{ij}$. By linearity of expectation:

\mathbb{E}[X_i] = 1 + \sum_{j \neq i} \mathbb{E}[X_{ij}] = 1 + \sum_{j \neq i} \Pr[h(k_i) = h(k_j)] \leq 1 + (n-1) \cdot \frac{1}{m}

Since $m = \Omega(n)$, the load factor $\alpha = n/m = O(1)$, so:

\mathbb{E}[X_i] \leq 1 + \frac{n-1}{m} = 1 + O(1) = O(1)

Expected chain length is $O(1)$ — regardless of the input keys, as long as $h$ is chosen randomly from a universal family. All hash table operations run in $O(1)$ expected time.

Hash_Table_Set · chaining with universal hashing

from random import randint

class Hash_Table_Set:
    def __init__(self):
        self.A      = []          # array of chains (linked lists)
        self.size   = 0           # number of items stored
        self.p      = 2**31 - 1  # large prime > key universe
        self.a      = randint(1, self.p - 1)  # random multiplier
        self._resize(0)

    def _hash(self, k, m):        # universal hash: O(1)
        return ((self.a * k) % self.p) % m

    def _resize(self, n):         # O(n) — rebuild with new size
        if n == 0 or not (len(self.A) // 4 <= n <= len(self.A)):
            m = max(n * 2, 1)
            new_A = [[] for _ in range(m)]  # new array of empty chains
            for chain in self.A:
                for x in chain:
                    new_A[self._hash(x.key, m)].append(x)
            self.A = new_A

    def find(self, k):            # O(1) expected
        h = self._hash(k, len(self.A))
        for x in self.A[h]:      # scan chain at slot h
            if x.key == k:
                return x
        return None

    def insert(self, x):          # O(1) amortised expected
        self._resize(self.size + 1)
        h = self._hash(x.key, len(self.A))
        for item in self.A[h]:   # check for existing key
            if item.key == x.key:
                item.val = x.val  # update
                return
        self.A[h].append(x)      # no collision — just append
        self.size += 1

    def delete(self, k):          # O(1) amortised expected
        h = self._hash(k, len(self.A))
        for i, x in enumerate(self.A[h]):
            if x.key == k:
                self.A[h].pop(i)
                self.size -= 1
                self._resize(self.size)
                return x
        raise KeyError(k)

    # Order operations — O(n), no structure to exploit
    def find_min(self):
        return min((x for chain in self.A for x in chain),
                   key=lambda x: x.key, default=None)

    def find_max(self):
        return max((x for chain in self.A for x in chain),
                   key=lambda x: x.key, default=None)

#include <vector>
#include <list>
#include <optional>
#include <random>
using namespace std;

// Hash table with chaining and universal hash family
// h_a(k) = (a*k mod p) mod m,  p = 2^31 - 1
template<typename K, typename V>
class Hash_Table_Set {
    using KV = pair<K, V>;
    vector<list<KV>> A;
    int    sz   = 0;
    long   p    = 2147483647LL;   // Mersenne prime 2^31 - 1
    long   a;                      // random multiplier

    int hash(K k, int m) {
        return (int)(((a * (long)k) % p) % m);
    }

    void resize(int n) {
        if (A.empty() || n < (int)A.size()/4 || n >= (int)A.size()) {
            int m = max(n * 2, 1);
            vector<list<KV>> B(m);
            for (auto& chain : A)
                for (auto& [k, v] : chain)
                    B[hash(k, m)].push_back({k, v});
            A = move(B);
        }
    }

public:
    Hash_Table_Set() {
        mt19937 rng(random_device{}());
        a = uniform_int_distribution<long>(1, p-1)(rng);
        resize(0);
    }

    optional<V> find(K k) {              // O(1) expected
        for (auto& [key, val] : A[hash(k, A.size())])
            if (key == k) return val;
        return nullopt;
    }

    void insert(K k, V v) {              // O(1) amortised expected
        resize(sz + 1);
        auto& chain = A[hash(k, A.size())];
        for (auto& [key, val] : chain)
            if (key == k) { val = v; return; }
        chain.push_back({k, v});
        sz++;
    }

    void remove(K k) {                   // O(1) amortised expected
        auto& chain = A[hash(k, A.size())];
        chain.remove_if([&](auto& kv){ return kv.first == k; });
        sz--;
        resize(sz);
    }
};

↗ Run C++ on Compiler Explorer (Godbolt)

The Complete Performance Picture

Data Structure	build	find(k)	insert	delete	find_min/max	find_prev/next
Array	$\Theta(n)$	$\Theta(n)$	$\Theta(n)$	$\Theta(n)$	$\Theta(n)$	$\Theta(n)$
Sorted Array	$\Theta(n \log n)$	$\Theta(\log n)$	$\Theta(n)$	$\Theta(n)$	$\Theta(1)$	$\Theta(\log n)$
Direct Access Array	$\Theta(u)$	$\Theta(1)$	$\Theta(1)$	$\Theta(1)$	$\Theta(u)$	$\Theta(u)$
Hash Table	$\Theta(n)^{(e)}$	$\Theta(1)^{(e)}$	$\Theta(1)^{(a)(e)}$	$\Theta(1)^{(a)(e)}$	$\Theta(n)$	$\Theta(n)$

(e) expected over random hash function · (a) amortised over sequence of operations

Hash tables give $O(1)$ expected for find, insert, delete — but $\Theta(n)$ for any order operation (find_min, find_max, find_next). This is the fundamental tradeoff: hashing destroys all ordering information. If you need both fast lookup and fast order queries, you need a balanced BST (Chapter 7), which gives $O(\log n)$ for everything. In practice, Python's dict is a hash table — it has no concept of min, max, or next key.

Worked Example · NSE Instrument Master Lookup 🇮🇳 XTS API

The XTS API returns an instrument master with ~9,000 records. Each record has a ExchangeInstrumentID (integer token) and a TradingSymbol (string like "NIFTY24JAN24CE24000"). Your strategy needs to: (a) look up the token for a given symbol in $O(1)$, and (b) look up the symbol for a given token in $O(1)$. Design the data structure.

Build two hash maps at session start

One map from symbol → token, one from token → symbol. Building each takes $\Theta(n)$ time and $\Theta(n)$ space. After that, every lookup is $O(1)$ expected.

Python implementation

def build_instrument_maps(master):
    """
    master: list of instrument dicts from XTS API
    Returns two O(1) lookup dicts, built in Θ(n) total.
    """
    symbol_to_token = {}           # str  → int
    token_to_symbol = {}           # int  → str

    for inst in master:
        sym = inst['TradingSymbol']
        tok = inst['ExchangeInstrumentID']
        symbol_to_token[sym] = tok
        token_to_symbol[tok] = sym

    return symbol_to_token, token_to_symbol

# Usage — both lookups are O(1) expected
# sym_map, tok_map = build_instrument_maps(master)
# token = sym_map['NIFTY24JAN24CE24000']    # O(1)
# symbol = tok_map[26000]                    # O(1)

Answer: two Python dicts — one per direction. Build cost: $\Theta(n)$ once at session start. Lookup cost: $O(1)$ expected per query. With 50,000 lookups during a session, the alternative (linear scan of the 9,000-record master list) would cost $50{,}000 \times 9{,}000 / 2 = 225{,}000{,}000$ comparisons. The hash maps reduce this to $\approx 50{,}000$ operations. This is not theoretical — it is the exact pattern used in production option analytics scripts.

Python's dict and set — Hash Tables Under the Hood

Python's built-in dict and set are hash tables. Python uses open addressing (not chaining) with a probing strategy, but the asymptotic bounds are the same: $O(1)$ expected for get, set, delete; $O(n)$ to build. A few practical notes:

Keys must be hashable — Python computes hash(key) for each key. Integers, strings, and tuples are hashable. Lists and dicts are not (they are mutable, so their hash would change). For custom objects, define __hash__ and __eq__.

As of Python 3.7+, dict preserves insertion order — a deliberate design choice, not an algorithmic property of hash tables. Do not rely on this for ordering by key value; use a sorted structure for that.

In Python, hash(k) for integers is just $k$ itself (with some platform-specific twists). For strings, Python uses a randomised hash — the seed changes with every Python process startup, a security feature called hash randomisation (PEP 456). This means you cannot rely on Python string hash values being consistent across runs or machines. For trading systems that serialise hash-based structures to disk, use explicit integer keys (instrument tokens) rather than string symbols as dict keys — tokens are stable across sessions, string hashes are not.

The Duplicates Problem — Five Algorithms, Five Complexities

A clean problem that illustrates every tool from this module and the last. Given an array $A$ of $n$ positive integers with values at most $k$, does $A$ contain any duplicate?

Algorithm 1 — Brute force: check all $\binom{n}{2}$ pairs. $\Theta(n^2)$.

Algorithm 2 — Sort first: merge sort in $\Theta(n \log n)$, then scan for adjacent duplicates in $\Theta(n)$. Total: $\Theta(n \log n)$.

Algorithm 3 — Hash table: insert each element; if it is already present, return duplicate found. $O(n)$ expected — each insert and lookup is $O(1)$ expected.

Algorithm 4 — Pigeonhole: if $k < n$, a duplicate must exist by the pigeonhole principle. Return True without looking at the data. $\Theta(1)$.

Algorithm 5 — Direct access array: if $n \leq k$, allocate a boolean array of size $k$. For each element, check and mark. $\Theta(k)$ initialisation, $\Theta(n)$ insertions. Total: $\Theta(k)$.

Same problem, five algorithms, five different complexities: $\Theta(1)$, $\Theta(k)$, $\Theta(n)$, $\Theta(n \log n)$, $\Theta(n^2)$. The right algorithm depends on what you know about the input. Knowing the domain ($k < n$ → pigeonhole) is strictly more powerful than knowing only $n$. This is a recurring theme: problem-specific knowledge always beats general-purpose algorithms. In trading, your integers are bounded instrument tokens — exploit that structure.

Practice Problems

4 questions · Chapter 04

prog / dsa / ch04 / q01 ★ Conceptual

A trading system needs to support three operations on open orders: add order, cancel by order ID, and find the order with the lowest price (to match against incoming market orders). Which data structure is the best fit?

Hash table — O(1) add and cancel

Sorted array — O(1) find minimum, O(log n) find by ID

Hash table + min-heap — O(1) cancel by ID, O(log n) add, O(1) find minimum

Direct access array — O(1) for all three operations

Answer: C.

No single data structure supports all three operations efficiently. A hash table gives O(1) find/cancel by key (order ID) but $\Theta(n)$ find_min. A heap gives O(1) find_min and O(log n) insert but $\Theta(n)$ find by arbitrary ID.

The standard solution for order books is to combine both: a hash map from order ID → heap node for $O(1)$ cancel-by-ID lookup, and a min-heap for $O(1)$ find_min and $O(\log n)$ insert. This is the architecture used in real matching engines. We build the heap in Chapter 8. Option D is wrong because a direct access array indexed by price does not support find by order ID in $O(1)$ without another structure.

prog / dsa / ch04 / q02 ★★ Universal Hashing

A hash table uses the division method $h(k) = k \bmod 100$ with $m = 100$ slots. An adversary who knows this function inserts $n = 200$ items all with keys $\{0, 100, 200, \ldots, 19900\}$. What is the worst-case chain length?

2 — load factor is $200/100 = 2$, evenly distributed

200 — all keys hash to slot 0, single chain holds all items

$\sqrt{200} \approx 14$ — birthday paradox bound

100 — half the items collide by pigeonhole

Answer: B — all 200 items in one chain.

Every key is a multiple of 100, so $k \bmod 100 = 0$ for all of them. Every single item hashes to slot 0. The chain at slot 0 has length 200 — all operations on this table are now $\Theta(n)$. The hash table has degenerated into a linked list.

This is precisely why a fixed hash function is dangerous: any adversary who knows $h$ can construct inputs that destroy performance. Universal hashing prevents this by choosing $h$ randomly — the adversary cannot know which function was chosen. With a universally random hash, the expected chain length is $O(1)$ regardless of the keys. The expectation is over our random choice, not over the adversary's inputs.

prog / dsa / ch04 / q03 ★★ Complexity

A Python script checks whether any of $n = 10{,}000$ NIFTY order IDs (integers in range $[1, 10^9]$) appear in a watchlist of $w = 500$ IDs. Which approach is asymptotically fastest for a single batch check?

Sort the watchlist ($\Theta(w \log w)$), binary search each order ID ($\Theta(n \log w)$) — total $\Theta(n \log w)$

Build a set from the watchlist ($\Theta(w)$), check each order ID ($\Theta(n)$) — total $\Theta(n + w)$

Nested loop: for each order, scan the watchlist — $\Theta(nw)$

Direct access array of size $10^9$ — $\Theta(10^9)$ build, $\Theta(n)$ queries

Answer: B — $\Theta(n + w)$.

Build a Python set from the 500 watchlist IDs: $\Theta(w)$ time. Then for each of the 10,000 order IDs, check order_id in watchlist_set: $O(1)$ expected per check, $\Theta(n)$ total. Grand total: $\Theta(n + w) = \Theta(10{,}500)$.

Option A ($\Theta(n \log w) \approx 90{,}000$ ops) is worse. Option C ($\Theta(nw) = 5{,}000{,}000$) is much worse. Option D allocates a $10^9$-slot array ($\approx 8$ GB for 64-bit pointers) — technically $\Theta(n)$ queries but the $\Theta(u)$ build cost is catastrophic. In Python: watchlist_set = set(watchlist); matches = [oid for oid in orders if oid in watchlist_set]. Two lines, optimal complexity.

prog / dsa / ch04 / q04 ★★★ Proof

The universal hash family $h_{ab}(k) = ((ak + b) \bmod p) \bmod m$ guarantees $\Pr[h(k_i) = h(k_j)] \leq 1/m$ for $k_i \neq k_j$. The expected chain length at any slot is then $O(1)$ when $m = \Omega(n)$. Which step in the chain-length argument uses linearity of expectation?

Showing that $\Pr[h(k_i) = h(k_j)] \leq 1/m$ for the universal family

Computing $\mathbb{E}[X_i] = \mathbb{E}[\sum_j X_{ij}] = \sum_j \mathbb{E}[X_{ij}]$

Concluding that $O(1)$ expected chain length implies $O(1)$ expected find time

Arguing that $m = \Omega(n)$ implies load factor $\alpha = O(1)$

Answer: B.

Linearity of expectation says $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ for any random variables $X$ and $Y$ — crucially, with no independence assumption required. We use it at the step: $$\mathbb{E}\!\left[\sum_j X_{ij}\right] = \sum_j \mathbb{E}[X_{ij}]$$ The indicator variables $X_{ij}$ may be correlated (knowing one collision increases the probability of others), but linearity of expectation holds regardless. This is the key mathematical tool — without it, we would need independence between all pairs, which we do not have.

Option A is the universality property of the hash family, proved separately. Option C uses Markov's inequality or a simpler argument. Option D is just algebra. Only B uses linearity of expectation.

Terminal Questions — Chapter 04 10 problems · No answers given

Work through independently. Q1–3 are direct applications. Q4–7 require combining ideas. Q8–10 demand careful argument.

A direct access array stores NSE instrument tokens in range $[1, 2^{32}]$. You have 9,000 instruments to store. (a) What is the space cost in bytes if each slot holds an 8-byte pointer? (b) What fraction of slots are used? (c) Why is a hash table strictly better here? Easy

Compute $h_{ab}(k)$ for $k = 42$, $a = 7$, $b = 3$, $p = 1009$ (prime), $m = 11$. Show all steps. Then compute $h_{ab}(142)$ with the same parameters and verify whether a collision occurs with $k = 42$. Easy

Python's dict and set both use hash tables. What is the expected time complexity of: (a) x in my_dict; (b) my_dict[k] = v; (c) del my_dict[k]; (d) min(my_dict.keys())? For (d), explain why hashing cannot help. Easy

Given a list of $n$ NIFTY option trades, each with fields (strike, type, price, qty), design a data structure that supports in $O(1)$ expected time: (a) total volume traded at a specific strike; (b) check whether any trade occurred at a given strike+type combination. Describe your hash function choice and collision strategy. Medium

Prove that the division hash function $h(k) = k \bmod m$ is not universal. Construct a specific set of $n$ keys and a specific $m$ where every key maps to the same slot. Medium

Implement the DUPLICATES problem using three different approaches in Python: (a) brute force $O(n^2)$; (b) sort-based $O(n \log n)$; (c) hash set $O(n)$ expected. Test on an array of 10,000 random NIFTY tick prices. Measure and compare actual runtimes using time.perf_counter(). Medium

A hash table with chaining currently has $n = 1{,}000$ items in $m = 500$ slots (load factor $\alpha = 2$). After a resize to $m' = 2{,}000$ slots with a new random hash function, what is the expected chain length? Show that this is $O(1)$ and compute the exact bound from the universality argument. Medium

The comparison lower bound proof shows that any comparison search algorithm takes $\Omega(\log n)$ time. Explain precisely where this argument breaks down for hash tables — which specific assumption of the comparison model does hashing violate, and what operation does it exploit instead? Hard

Design a data structure supporting all Set operations with the following complexities: find(k) in $O(1)$ expected, insert/delete in $O(\log n)$ amortised expected, find_min/find_max in $O(\log n)$. You may combine a hash table with any structure from previous chapters. Describe the design and justify each operation's complexity. Hard

Two-sum problem: given an array $A$ of $n$ integers and a target $t$, find two indices $i \neq j$ such that $A[i] + A[j] = t$, or report none exists. (a) Give an $O(n^2)$ brute force solution. (b) Give an $O(n \log n)$ sort-based solution. (c) Give an $O(n)$ expected hash-based solution. Apply (c) to find two NIFTY strike prices from a given list that sum to 48,000. Hard

← Sorting I Next: Sorting II →