Sorting II
We proved merge sort is optimal in the comparison model. This chapter breaks through the $\Theta(n \log n)$ floor — not with cleverer comparisons, but by abandoning them altogether, exactly as we did for search in Chapter 4.
The Sorting Lower Bound — $\Theta(n \log n)$ Is the Wall
In Chapter 3 we proved the comparison search lower bound: any comparison algorithm needs $\Omega(\log n)$ time. The same decision-tree argument gives a much stronger bound for sorting.
A sorting algorithm must produce the correct output permutation for any input. There are $n!$ possible permutations of $n$ distinct items — so the decision tree needs at least $n!$ leaves. A binary tree with $\geq n!$ leaves has height at least $\log_2(n!)$. By Stirling's approximation $n! > (n/2)^{n/2}$, so:
Therefore every comparison sort requires $\Omega(n \log n)$ comparisons in the worst case. Merge sort achieves $\Theta(n \log n)$ — it is optimal in the comparison model. You cannot do better with comparisons alone.
Direct Access Array Sort — $\Theta(n)$ When Keys Are Small
The simplest non-comparison sort: if all keys are distinct non-negative integers in range $\{0, \ldots, u-1\}$, use a direct access array. Insert each item at its key index in $\Theta(n)$, then read back in order in $\Theta(u)$. Total: $\Theta(n + u) = \Theta(n)$ when $u = \Theta(n)$.
def direct_access_sort(A):
"""Sort A assuming items have distinct non-negative integer keys."""
u = 1 + max(x.key for x in A) # O(n) — find key range
D = [None] * u # O(u) — direct access array
for x in A: # O(n) — place each item
D[x.key] = x
i = 0
for key in range(u): # O(u) — read back in order
if D[key] is not None:
A[i] = D[key]
i += 1
# Total: Θ(n + u). Linear when u = Θ(n).
# Fatal flaw: cannot handle duplicate keys (second item overwrites first).
#include <vector>
#include <algorithm>
using namespace std;
struct Item { int key; /* other fields */ };
void direct_access_sort(vector<Item>& A) {
int u = 1 + (*max_element(A.begin(), A.end(),
[](auto& a, auto& b){ return a.key < b.key; })).key;
vector<Item*> D(u, nullptr); // O(u) — indexed by key
for (auto& x : A) // O(n) — place items
D[x.key] = &x;
int i = 0;
vector<Item> out;
for (int k = 0; k < u; ++k) // O(u) — read in order
if (D[k]) out.push_back(*D[k]);
A = out;
// O(n + u). Requires distinct keys — duplicate key overwrites earlier item.
}
Two problems remain: duplicate keys (overwriting), and large key ranges ($u \gg n$). We fix them in order.
Counting Sort — Handling Duplicates Stably
Replace each single slot with a chain — a list that accumulates all items sharing that key. Insert each item to the end of its chain (preserving arrival order), then read back chain by chain. This makes counting sort stable: items with equal keys appear in the output in the same order as the input.
def counting_sort(A):
"""Stable sort of A assuming non-negative integer keys."""
u = 1 + max(x.key for x in A) # O(n)
D = [[] for _ in range(u)] # O(u) — array of chains
for x in A: # O(n) — append to chain
D[x.key].append(x) # insert_last preserves order → stable
i = 0
for chain in D: # O(u) — read chains in key order
for x in chain: # O(n) total across all chains
A[i] = x
i += 1
# Total: Θ(n + u).
# Stability proof: within each chain, items appear in insertion order
# (= original array order), so equal-key items preserve relative order.
#include <vector>
#include <list>
using namespace std;
struct Item { int key; int val; };
void counting_sort(vector<Item>& A) {
if (A.empty()) return;
int u = 1 + (*max_element(A.begin(), A.end(),
[](auto& a, auto& b){ return a.key < b.key; })).key;
vector<list<Item>> D(u); // O(u) — array of chains
for (auto& x : A) // O(n) — append to correct chain
D[x.key].push_back(x); // push_back = insert_last → stable
int i = 0;
for (auto& chain : D) // O(u) scan + O(n) items total
for (auto& x : chain)
A[i++] = x;
// Θ(n + u). Stable by construction.
}
Let's trace counting sort on a small example. Keys are digit-like integers in $\{0,\ldots,4\}$:
D[1] = [(key=1), (key=1)] ← both 1s, in original order
D[2] = [(key=2)]
D[3] = [(key=3)]
D[4] = [(key=4)]
The Key Idea — Represent Large Keys as Tuples
Suppose keys can be up to $u < n^2$. A key $k$ can be written as a two-digit base-$n$ number: $k = a \cdot n + b$ where $a = \lfloor k/n \rfloor$ and $b = k \bmod n$. Both $a$ and $b$ are in range $\{0, \ldots, n-1\}$. So every key becomes a pair $(a, b)$ with values bounded by $n$ — tractable for counting sort.
For example with $n = 5$ (so base $5$): key $17 = 3 \cdot 5 + 2 \Rightarrow (3, 2)$. Key $24 = 4 \cdot 5 + 4 \Rightarrow (4, 4)$. All pairs have digits in $\{0, 1, 2, 3, 4\}$ — counting sort handles each digit in $\Theta(n)$.
The remaining question: how do you sort by two keys simultaneously? The answer is tuple sort.
Tuple Sort — Sorting Least Significant First
1. Sort by least significant key $k_c$ using a stable sort.
2. Sort by $k_{c-1}$ using a stable sort.
$\vdots$
c. Sort by most significant key $k_1$ using a stable sort.
Stability is essential: when sorting by $k_1$, items with equal $k_1$ must preserve their $k_2$-order from the previous round.
Why least-significant-first? Consider sorting $[(3,2), (0,3), (4,4), (4,2), (2,2)]$ (our base-5 example):
Radix Sort — Linear Time for Bounded Integer Keys
Radix sort combines tuple sort with counting sort as the stable subroutine. For keys up to $u$, represent each key as a $c$-digit base-$n$ number where $c = \lceil \log_n u \rceil$. Sort each digit with counting sort ($\Theta(n)$ per digit). Total: $\Theta(cn)$.
If $u \leq n^c$ for some constant $c$, then $\Theta(cn) = \Theta(n)$ — linear time.
def radix_sort(A):
"""Sort A (non-negative integer keys) in Θ(cn) time, c = ceil(log_n(u))."""
n = len(A)
if n == 0: return
u = 1 + max(x.key for x in A)
# number of base-n digits needed
c = 1 + (u.bit_length() // n.bit_length())
# Build digit-tuple objects
class Wrapper:
def __init__(self, item):
self.item = item
self.digits = [] # digits[0] = least significant
high = item.key
for _ in range(c):
high, low = divmod(high, n)
self.digits.append(low)
# counting_sort reads .key
@property
def key(self): return self._cur_key
wrappers = [Wrapper(x) for x in A]
# Sort c times, from least to most significant digit
for i in range(c):
for w in wrappers:
w._cur_key = w.digits[i] # set digit i as current key
counting_sort(wrappers) # stable Θ(n) sort on this digit
# Write back
for i, w in enumerate(wrappers):
A[i] = w.item
#include <vector>
#include <cmath>
using namespace std;
// Radix sort for non-negative integers — Θ(cn) time
void radix_sort(vector<int>& A) {
if (A.empty()) return;
int n = A.size();
int u = *max_element(A.begin(), A.end()) + 1;
int c = (int)ceil(log((double)u) / log((double)n)) + 1;
vector<int> out(n);
// Sort c times from least to most significant base-n digit
for (int d = 0, base = 1; d < c; ++d, base *= n) {
// Counting sort on digit d
vector<int> count(n, 0);
for (int x : A) count[(x / base) % n]++; // count each digit
for (int k = 1; k < n; ++k) count[k] += count[k-1]; // prefix sums
// Traverse A in reverse to maintain stability
for (int i = (int)A.size()-1; i >= 0; --i) {
int digit = (A[i] / base) % n;
out[--count[digit]] = A[i];
}
A = out;
}
// Θ(cn) total. Linear when c is constant.
}
The Complete Sorting Picture
| Algorithm | Worst Case | Best Case | Space | In-place | Stable | Key Constraint |
|---|---|---|---|---|---|---|
| Selection Sort | $\Theta(n^2)$ | $\Theta(n^2)$ | $O(1)$ | Yes | No | Any comparable |
| Insertion Sort | $\Theta(n^2)$ | $\Theta(n)$ | $O(1)$ | Yes | Yes | Any comparable |
| Merge Sort | $\Theta(n \log n)$ | $\Theta(n \log n)$ | $O(n)$ | No | Yes | Any comparable |
| Counting Sort | $\Theta(n + u)$ | $\Theta(n + u)$ | $O(n + u)$ | No | Yes | Integer $\in [0, u)$ |
| Radix Sort | $\Theta(cn)$ | $\Theta(cn)$ | $O(n)$ | No | Yes | Integer $\leq n^c$ |
Counting and Radix Sort highlighted — they beat the comparison lower bound by exploiting integer key structure.
NIFTY weekly options have strikes from 22,000 to 26,000 in steps of 50 — exactly 81 distinct strikes. You have $n = 810$ option records (10 per strike: CE + PE × 5 expiries). Keys are integers in range $[22000, 26000]$. Should you use radix sort or merge sort?
$u = 26001$, $n = 810$. The ratio $u/n \approx 32$. Counting sort would allocate 26,001 chain slots to hold 810 items — wasteful but not catastrophic for a single sort at session open.
Shift keys to $[0, 4000]$ (subtract 22,000), so $u = 4001$. Radix sort in base $n = 810$: each key $k \leq 4000 < 810^2 = 656,100$, so $c = 2$ digits suffice. Two passes of counting sort: $2 \times \Theta(810) = \Theta(n)$.
For $n = 810$, all three algorithms (merge sort $\Theta(n \log n) \approx 8{,}100$, counting sort $\Theta(n + u) \approx 4{,}800$, radix sort $\Theta(2n) \approx 1{,}620$) finish in microseconds — the difference is not perceptible. Use Python's sorted(chain, key=lambda x: x['strike']) (TimSort) and move on.
numpy.sort (which uses radix sort internally for integer arrays) or write C++ with the implementation above.
numpy.sort on integer arrays uses a radix-sort variant and is typically 5–10× faster than comparison-based sort for the same data. The boundary between "use the built-in" and "care about the algorithm" is around $10^6$ records for a modern machine.
You want to sort 10,000 NSE instrument tokens (32-bit integers, values up to $2^{32} \approx 4 \times 10^9$) using counting sort directly. Why is this impractical?
Counting sort allocates one chain slot per possible key value. With $u = 2^{32}$, the chain array has $4 \times 10^9$ entries. At 8 bytes per Python list pointer, that is 32 GB — to sort 10,000 items. The runtime is also $\Theta(n + u) = \Theta(4 \times 10^9)$ — about 4 seconds at $10^9$ ops/sec, compared to merge sort's $\Theta(n \log n) \approx 130{,}000$ operations.
The fix is radix sort: represent each 32-bit key as 4 base-256 digits ($256^4 = 2^{32}$). Four passes of counting sort with $u = 256$: $4 \times \Theta(n + 256) = \Theta(4n) = \Theta(40{,}000)$ operations. Practical.
Tuple sort sorts items first by the least significant key, then by more significant keys. Why does tuple sort fail if the subroutine sort is not stable?
After sorting by the least significant digit, items with equal least-significant digits are in the correct relative order. When we then sort by the next digit, items with equal next-digit values must preserve their previously established order. An unstable sort may arbitrarily reorder items with equal keys — destroying the ordering from the previous round.
Concrete example: after round 1 (sort by units digit), we have $[(4,2), (2,2), (3,2), \ldots]$ — the three items with units digit 2 are in the right relative order. If round 2 (sort by tens digit) is unstable, it might output $[(2,2), (4,2), (3,2)]$ — equal tens-digit items in arbitrary order, breaking the lexicographic sort. Stability is not optional in tuple sort — it is the entire reason the algorithm works.
Apply radix sort (base 10) to the sequence $[329, 457, 657, 839, 436, 720, 355]$. After sorting by the units digit (first pass), what is the array?
Sort by units digit (0–9), stably:
Units digit 0: 720
Units digit 5: 355
Units digit 6: 436
Units digit 7: 457, 657 (in original order — stable)
Units digit 9: 329, 839 (in original order — stable)
Result: $[720, 355, 436, 457, 657, 329, 839]$. After pass 2 (tens digit): $[720, 329, 436, 839, 355, 457, 657]$. After pass 3 (hundreds digit): $[329, 355, 436, 457, 657, 720, 839]$ — fully sorted.
Radix sort runs in $\Theta(cn)$ time where $c = \lceil \log_n u \rceil$. If we increase the base from $n$ to $n^2$, how does the runtime change?
With base $b$, radix sort does $c = \lceil \log_b u \rceil$ passes, each costing $\Theta(n + b)$. Total: $\Theta(c(n + b)) = \Theta(\log_b(u) \cdot (n + b))$.
With base $n$: $c = \log_n u$ passes, each $\Theta(n + n) = \Theta(n)$. Total: $\Theta(n \log_n u)$.
With base $n^2$: $c = \log_{n^2} u = \frac{\log_n u}{2}$ passes (half as many), each $\Theta(n + n^2) = \Theta(n^2)$. Total: $\Theta(\frac{\log_n u}{2} \cdot n^2) = \Theta(n^2 \log_n u)$ — actually worse.
The optimal base is $\Theta(n)$ — it minimises total work. Increasing the base reduces pass count but makes each pass more expensive. The $\Theta(cn)$ bound with base $n$ is the sweet spot: linear time when $c = O(1)$.
Work through independently. Q1–3 are direct applications. Q4–7 require synthesis. Q8–10 demand careful argument.