Hashing
We proved in Chapter 3 that no comparison-based algorithm can find a key faster than $\Theta(\log n)$. This chapter breaks that barrier — not by being cleverer about comparisons, but by abandoning them entirely.
The Comparison Lower Bound — Why $\Theta(\log n)$ Is the Ceiling
Any deterministic comparison search algorithm can be modelled as a binary decision tree. Each internal node is a comparison between two items — the outcome is True or False, so the tree branches in exactly two directions. Each leaf represents a distinct output: either "found item $x$" or "not found." With $n$ items stored, there are $n + 1$ possible outputs, so the decision tree needs at least $n + 1$ leaves.
A binary tree with at least $n + 1$ leaves has height at least $\lceil \log_2(n+1) \rceil - 1 = \Omega(\log n)$. The height is the length of the longest root-to-leaf path — which is the worst-case number of comparisons. Therefore:
Sorted arrays and balanced BSTs achieve this bound. But what if we are not limited to comparisons? The Word-RAM model has one operation with super-constant branching factor: random memory access. Given an address, we can read the value there in $\Theta(1)$ — branching factor equal to the size of memory itself. This is the doorway hashing walks through.
Direct Access Arrays — $\Theta(1)$ Find, Catastrophic Space
The simplest exploitation of random access: give every item an integer key $k \in \{0, \ldots, u-1\}$ and store item $x$ at index $x.\text{key}$ in an array of length $u$. Finding any item is exactly one array access — $\Theta(1)$. No comparisons at all.
class DirectAccessArray:
def __init__(self, u): # O(u) space — the problem
self.A = [None] * u
def find(self, k): # O(1) — just index
return self.A[k]
def insert(self, x): # O(1)
self.A[x.key] = x
def delete(self, k): # O(1)
self.A[k] = None
def find_next(self, k): # O(u) — must scan
for i in range(k, len(self.A)):
if self.A[i] is not None:
return self.A[i]
def find_max(self): # O(u) — must scan
for i in range(len(self.A) - 1, -1, -1):
if self.A[i] is not None:
return self.A[i]
# Problem: if keys are 10-letter strings, u = 26^10 ≈ 1.4 × 10^14
# Storing even 1 bit per slot = 17.6 TB. Completely unusable.
#include <vector>
#include <optional>
using namespace std;
template<typename T>
class DirectAccessArray {
vector<optional<T>> A;
public:
DirectAccessArray(int u) : A(u) {} // O(u) space — the problem
optional<T> find(int k) { return A[k]; } // O(1)
void insert(int k, T x) { A[k] = x; } // O(1)
void remove(int k) { A[k] = nullopt; } // O(1)
// Order operations require O(u) scan — no structure to exploit
optional<T> find_max() {
for (int i = A.size()-1; i >= 0; --i)
if (A[i]) return A[i];
return nullopt;
}
};
The space cost is $\Theta(u)$ — the size of the entire key universe, regardless of how many items you actually store. If $n \ll u$, this is catastrophically wasteful. For NSE instrument tokens (32-bit integers), $u = 2^{32} \approx 4 \times 10^9$. Allocating that array takes 16 GB for 4-byte pointers, for storing perhaps 9,000 instruments. We need a smarter approach.
Hashing — Compressing the Key Space
The fix: instead of an array of size $u$, use an array of size $m = \Theta(n)$ — just enough slots for the items you actually store. Map each key $k$ to a slot via a hash function $h : \{0, \ldots, u-1\} \to \{0, \ldots, m-1\}$.
Since $m \ll u$, by the pigeonhole principle, no hash function can be injective over all possible keys — there must exist keys $a \neq b$ with $h(a) = h(b)$. This is called a collision.
Collision Resolution — Chaining
When two keys hash to the same slot, we need somewhere to put the second item. The cleanest solution: at each slot, store not a single item but a chain — a small data structure (typically a linked list) holding all items that hash to that slot.
To find key $k$: compute $h(k)$, go to slot $h(k)$, then search the chain there for $k$. If chains are short, all operations are $O(1)$ plus the chain search cost. The central question is: how long are the chains?
If $n$ items are distributed roughly evenly across $m = \Theta(n)$ slots, each chain has expected length $n/m = \Theta(1)$. The ratio $\alpha = n/m$ is called the load factor. As long as $\alpha = O(1)$, all operations run in $O(1)$ expected time.
What Makes a Good Hash Function?
Division method (bad in general): $h(k) = k \bmod m$. Simple and fast. Works well when keys are uniformly distributed. Terrible when keys share structure — if all instrument tokens are multiples of some number that divides $m$, they all collide. A determined adversary who knows your hash function can construct inputs that all hash to the same slot, forcing $\Theta(n)$ chains.
The fix: do not use a fixed hash function. Choose one randomly from a large family of functions. Then no adversary can predict which function you are using.
One concrete universal family: $h_{ab}(k) = ((ak + b) \bmod p) \bmod m$, where $p$ is a prime larger than $u$, and $a, b$ are chosen uniformly at random from $\{0, \ldots, p-1\}$ with $a \neq 0$.
Why Universality Implies Short Chains — The Probability Argument
Fix a key $k_i$ and a universally random hash function $h$. How many other keys collide with $k_i$? For each other key $k_j$, define the indicator random variable:
The chain length at slot $h(k_i)$ is $X_i = \sum_{j} X_{ij}$. By linearity of expectation:
Since $m = \Omega(n)$, the load factor $\alpha = n/m = O(1)$, so:
Expected chain length is $O(1)$ — regardless of the input keys, as long as $h$ is chosen randomly from a universal family. All hash table operations run in $O(1)$ expected time.
from random import randint
class Hash_Table_Set:
def __init__(self):
self.A = [] # array of chains (linked lists)
self.size = 0 # number of items stored
self.p = 2**31 - 1 # large prime > key universe
self.a = randint(1, self.p - 1) # random multiplier
self._resize(0)
def _hash(self, k, m): # universal hash: O(1)
return ((self.a * k) % self.p) % m
def _resize(self, n): # O(n) — rebuild with new size
if n == 0 or not (len(self.A) // 4 <= n <= len(self.A)):
m = max(n * 2, 1)
new_A = [[] for _ in range(m)] # new array of empty chains
for chain in self.A:
for x in chain:
new_A[self._hash(x.key, m)].append(x)
self.A = new_A
def find(self, k): # O(1) expected
h = self._hash(k, len(self.A))
for x in self.A[h]: # scan chain at slot h
if x.key == k:
return x
return None
def insert(self, x): # O(1) amortised expected
self._resize(self.size + 1)
h = self._hash(x.key, len(self.A))
for item in self.A[h]: # check for existing key
if item.key == x.key:
item.val = x.val # update
return
self.A[h].append(x) # no collision — just append
self.size += 1
def delete(self, k): # O(1) amortised expected
h = self._hash(k, len(self.A))
for i, x in enumerate(self.A[h]):
if x.key == k:
self.A[h].pop(i)
self.size -= 1
self._resize(self.size)
return x
raise KeyError(k)
# Order operations — O(n), no structure to exploit
def find_min(self):
return min((x for chain in self.A for x in chain),
key=lambda x: x.key, default=None)
def find_max(self):
return max((x for chain in self.A for x in chain),
key=lambda x: x.key, default=None)
#include <vector>
#include <list>
#include <optional>
#include <random>
using namespace std;
// Hash table with chaining and universal hash family
// h_a(k) = (a*k mod p) mod m, p = 2^31 - 1
template<typename K, typename V>
class Hash_Table_Set {
using KV = pair<K, V>;
vector<list<KV>> A;
int sz = 0;
long p = 2147483647LL; // Mersenne prime 2^31 - 1
long a; // random multiplier
int hash(K k, int m) {
return (int)(((a * (long)k) % p) % m);
}
void resize(int n) {
if (A.empty() || n < (int)A.size()/4 || n >= (int)A.size()) {
int m = max(n * 2, 1);
vector<list<KV>> B(m);
for (auto& chain : A)
for (auto& [k, v] : chain)
B[hash(k, m)].push_back({k, v});
A = move(B);
}
}
public:
Hash_Table_Set() {
mt19937 rng(random_device{}());
a = uniform_int_distribution<long>(1, p-1)(rng);
resize(0);
}
optional<V> find(K k) { // O(1) expected
for (auto& [key, val] : A[hash(k, A.size())])
if (key == k) return val;
return nullopt;
}
void insert(K k, V v) { // O(1) amortised expected
resize(sz + 1);
auto& chain = A[hash(k, A.size())];
for (auto& [key, val] : chain)
if (key == k) { val = v; return; }
chain.push_back({k, v});
sz++;
}
void remove(K k) { // O(1) amortised expected
auto& chain = A[hash(k, A.size())];
chain.remove_if([&](auto& kv){ return kv.first == k; });
sz--;
resize(sz);
}
};
The Complete Performance Picture
| Data Structure | build | find(k) | insert | delete | find_min/max | find_prev/next |
|---|---|---|---|---|---|---|
| Array | $\Theta(n)$ | $\Theta(n)$ | $\Theta(n)$ | $\Theta(n)$ | $\Theta(n)$ | $\Theta(n)$ |
| Sorted Array | $\Theta(n \log n)$ | $\Theta(\log n)$ | $\Theta(n)$ | $\Theta(n)$ | $\Theta(1)$ | $\Theta(\log n)$ |
| Direct Access Array | $\Theta(u)$ | $\Theta(1)$ | $\Theta(1)$ | $\Theta(1)$ | $\Theta(u)$ | $\Theta(u)$ |
| Hash Table | $\Theta(n)^{(e)}$ | $\Theta(1)^{(e)}$ | $\Theta(1)^{(a)(e)}$ | $\Theta(1)^{(a)(e)}$ | $\Theta(n)$ | $\Theta(n)$ |
(e) expected over random hash function · (a) amortised over sequence of operations
dict is a hash table — it has no concept of min, max, or next key.
The XTS API returns an instrument master with ~9,000 records. Each record has a ExchangeInstrumentID (integer token) and a TradingSymbol (string like "NIFTY24JAN24CE24000"). Your strategy needs to: (a) look up the token for a given symbol in $O(1)$, and (b) look up the symbol for a given token in $O(1)$. Design the data structure.
One map from symbol → token, one from token → symbol. Building each takes $\Theta(n)$ time and $\Theta(n)$ space. After that, every lookup is $O(1)$ expected.
def build_instrument_maps(master):
"""
master: list of instrument dicts from XTS API
Returns two O(1) lookup dicts, built in Θ(n) total.
"""
symbol_to_token = {} # str → int
token_to_symbol = {} # int → str
for inst in master:
sym = inst['TradingSymbol']
tok = inst['ExchangeInstrumentID']
symbol_to_token[sym] = tok
token_to_symbol[tok] = sym
return symbol_to_token, token_to_symbol
# Usage — both lookups are O(1) expected
# sym_map, tok_map = build_instrument_maps(master)
# token = sym_map['NIFTY24JAN24CE24000'] # O(1)
# symbol = tok_map[26000] # O(1)
Python's dict and set — Hash Tables Under the Hood
Python's built-in dict and set are hash tables. Python uses open addressing (not chaining) with a probing strategy, but the asymptotic bounds are the same: $O(1)$ expected for get, set, delete; $O(n)$ to build. A few practical notes:
Keys must be hashable — Python computes hash(key) for each key. Integers, strings, and tuples are hashable. Lists and dicts are not (they are mutable, so their hash would change). For custom objects, define __hash__ and __eq__.
As of Python 3.7+, dict preserves insertion order — a deliberate design choice, not an algorithmic property of hash tables. Do not rely on this for ordering by key value; use a sorted structure for that.
hash(k) for integers is just $k$ itself (with some platform-specific twists). For strings, Python uses a randomised hash — the seed changes with every Python process startup, a security feature called hash randomisation (PEP 456). This means you cannot rely on Python string hash values being consistent across runs or machines. For trading systems that serialise hash-based structures to disk, use explicit integer keys (instrument tokens) rather than string symbols as dict keys — tokens are stable across sessions, string hashes are not.
The Duplicates Problem — Five Algorithms, Five Complexities
A clean problem that illustrates every tool from this module and the last. Given an array $A$ of $n$ positive integers with values at most $k$, does $A$ contain any duplicate?
Algorithm 1 — Brute force: check all $\binom{n}{2}$ pairs. $\Theta(n^2)$.
Algorithm 2 — Sort first: merge sort in $\Theta(n \log n)$, then scan for adjacent duplicates in $\Theta(n)$. Total: $\Theta(n \log n)$.
Algorithm 3 — Hash table: insert each element; if it is already present, return duplicate found. $O(n)$ expected — each insert and lookup is $O(1)$ expected.
Algorithm 4 — Pigeonhole: if $k < n$, a duplicate must exist by the pigeonhole principle. Return True without looking at the data. $\Theta(1)$.
Algorithm 5 — Direct access array: if $n \leq k$, allocate a boolean array of size $k$. For each element, check and mark. $\Theta(k)$ initialisation, $\Theta(n)$ insertions. Total: $\Theta(k)$.
A trading system needs to support three operations on open orders: add order, cancel by order ID, and find the order with the lowest price (to match against incoming market orders). Which data structure is the best fit?
No single data structure supports all three operations efficiently. A hash table gives O(1) find/cancel by key (order ID) but $\Theta(n)$ find_min. A heap gives O(1) find_min and O(log n) insert but $\Theta(n)$ find by arbitrary ID.
The standard solution for order books is to combine both: a hash map from order ID → heap node for $O(1)$ cancel-by-ID lookup, and a min-heap for $O(1)$ find_min and $O(\log n)$ insert. This is the architecture used in real matching engines. We build the heap in Chapter 8. Option D is wrong because a direct access array indexed by price does not support find by order ID in $O(1)$ without another structure.
A hash table uses the division method $h(k) = k \bmod 100$ with $m = 100$ slots. An adversary who knows this function inserts $n = 200$ items all with keys $\{0, 100, 200, \ldots, 19900\}$. What is the worst-case chain length?
Every key is a multiple of 100, so $k \bmod 100 = 0$ for all of them. Every single item hashes to slot 0. The chain at slot 0 has length 200 — all operations on this table are now $\Theta(n)$. The hash table has degenerated into a linked list.
This is precisely why a fixed hash function is dangerous: any adversary who knows $h$ can construct inputs that destroy performance. Universal hashing prevents this by choosing $h$ randomly — the adversary cannot know which function was chosen. With a universally random hash, the expected chain length is $O(1)$ regardless of the keys. The expectation is over our random choice, not over the adversary's inputs.
A Python script checks whether any of $n = 10{,}000$ NIFTY order IDs (integers in range $[1, 10^9]$) appear in a watchlist of $w = 500$ IDs. Which approach is asymptotically fastest for a single batch check?
Build a Python
set from the 500 watchlist IDs: $\Theta(w)$ time. Then for each of the 10,000 order IDs, check order_id in watchlist_set: $O(1)$ expected per check, $\Theta(n)$ total. Grand total: $\Theta(n + w) = \Theta(10{,}500)$.
Option A ($\Theta(n \log w) \approx 90{,}000$ ops) is worse. Option C ($\Theta(nw) = 5{,}000{,}000$) is much worse. Option D allocates a $10^9$-slot array ($\approx 8$ GB for 64-bit pointers) — technically $\Theta(n)$ queries but the $\Theta(u)$ build cost is catastrophic. In Python:
watchlist_set = set(watchlist); matches = [oid for oid in orders if oid in watchlist_set]. Two lines, optimal complexity.
The universal hash family $h_{ab}(k) = ((ak + b) \bmod p) \bmod m$ guarantees $\Pr[h(k_i) = h(k_j)] \leq 1/m$ for $k_i \neq k_j$. The expected chain length at any slot is then $O(1)$ when $m = \Omega(n)$. Which step in the chain-length argument uses linearity of expectation?
Linearity of expectation says $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ for any random variables $X$ and $Y$ — crucially, with no independence assumption required. We use it at the step: $$\mathbb{E}\!\left[\sum_j X_{ij}\right] = \sum_j \mathbb{E}[X_{ij}]$$ The indicator variables $X_{ij}$ may be correlated (knowing one collision increases the probability of others), but linearity of expectation holds regardless. This is the key mathematical tool — without it, we would need independence between all pairs, which we do not have.
Option A is the universality property of the hash family, proved separately. Option C uses Markov's inequality or a simpler argument. Option D is just algebra. Only B uses linearity of expectation.
Work through independently. Q1–3 are direct applications. Q4–7 require combining ideas. Q8–10 demand careful argument.
dict and set both use hash tables. What is the expected time complexity of: (a) x in my_dict; (b) my_dict[k] = v; (c) del my_dict[k]; (d) min(my_dict.keys())? For (d), explain why hashing cannot help.
Easy
(strike, type, price, qty), design a data structure that supports in $O(1)$ expected time: (a) total volume traded at a specific strike; (b) check whether any trade occurred at a given strike+type combination. Describe your hash function choice and collision strategy.
Medium
time.perf_counter().
Medium