Advanced Chapter 13 of 13 · Algorithm Design and Analysis

Cryptography & Cache-Oblivious Algorithms

Two seemingly unrelated topics united by one theme: algorithms designed for a specific model of the world. Cryptography designs algorithms that are hard to reverse without a secret key, building security from mathematical hardness. Cache-oblivious algorithms design for memory hierarchy without knowing the cache size, achieving optimal memory transfers automatically on any machine.

Part I — Cryptography

Hash Functions

A cryptographic hash function $h: \{0,1\}^* \to \{0,1\}^d$ maps arbitrary-length inputs to fixed-length outputs. It is deterministic, public, and should behave like a random function (random oracle model). Common functions and output sizes: MD5 (128 bits, broken), SHA-1 (160 bits, weakened), SHA-256 (256 bits), SHA-512 (512 bits).

Desirable properties of cryptographic hash functions:
1. One-way (preimage resistance): given $y$, hard to find $x$ with $h(x) = y$. Cost: $O(2^d)$.
2. Strong collision resistance: hard to find any $x \neq x'$ with $h(x) = h(x')$. Cost via birthday paradox: $O(2^{d/2})$.
3. Weak collision resistance (2nd preimage): given $x$, hard to find $x' \neq x$ with $h(x) = h(x')$.
4. Pseudorandomness: output indistinguishable from uniform random.
5. Non-malleability: given $h(x)$, hard to compute $h(f(x))$ for any $f$.

Implications: $2 \Rightarrow 3$ (strong $\Rightarrow$ weak collision). $1 \not\Rightarrow 2$, $2 \not\Rightarrow 1$ (independent properties).

Symmetric-Key Encryption

Alice and Bob share a secret key $k$. Encryption: $c = e_k(m)$. Decryption: $m = d_k(c)$. Examples: AES (Advanced Encryption Standard), DES, RC5. The core challenge: how do Alice and Bob agree on $k$ without ever meeting?

Diffie-Hellman Key Exchange

Let $p$ be a large prime, $g$ a generator of $\mathbb{F}_p^*$. Both public.

Alice picks secret $a$, sends $g^a \bmod p$. Bob picks secret $b$, sends $g^b \bmod p$.
Alice computes $(g^b)^a = g^{ab} \bmod p$. Bob computes $(g^a)^b = g^{ab} \bmod p$.
Shared key $k = g^{ab} \bmod p$. An eavesdropper sees $g^a$ and $g^b$ but cannot compute $g^{ab}$ (assuming the Discrete Logarithm Problem is hard).

Vulnerability: man-in-the-middle attack. Eve intercepts both $g^a$ and $g^b$, establishing separate keys with Alice and Bob. Authentication is needed (public-key certificates).

RSA Public-Key Encryption

Key generation: pick large primes $p, q$. Set $N = pq$, $\phi = (p-1)(q-1)$. Choose $e$ with $\gcd(e, \phi) = 1$ (e.g., $e = 65537$). Compute $d$ with $ed \equiv 1 \pmod{\phi}$ via extended Euclidean algorithm. Public key: $(N, e)$. Private key: $(d, p, q)$.

Encryption: $c = m^e \bmod N$. Decryption: $m = c^d \bmod N$.

Correctness: $c^d = (m^e)^d = m^{ed} = m^{1+k\phi} \equiv m \pmod N$ (by Fermat’s little theorem for each prime factor). This works because $ed \equiv 1 \pmod \phi$ means $ed = 1 + k(p-1)(q-1)$.

Security: relies on the hardness of factoring $N = pq$ (given $N$, find $p, q$) and the RSA problem (given $e$ and $c = m^e$, find $m$). Primality testing is in P (AKS 2004), but factoring remains believed hard.

Digital Signatures and MACs

	Confidentiality	Integrity
Symmetric	Private-key encryption (AES)	MAC (SHA3-keyed)
Asymmetric	Public-key encryption (RSA)	Digital signatures (RSA-sign)

Digital signature: $\sigma = \text{Sign}(\text{sk}, m)$; anyone can verify with $\text{Verify}(\text{pk}, m, \sigma)$. RSA signature: $\sigma = m^d \bmod N$, verify $\sigma^e \equiv m \pmod N$. Naively broken by multiplicative homomorphism: $\sigma_1 \sigma_2 = (m_1 m_2)^d$ is a valid signature of $m_1 m_2$. Fix: hash-then-sign ($\sigma = h(m)^d \bmod N$).

MAC (Message Authentication Code): symmetric integrity. $\sigma = \text{MAC}(k, m)$; verify by recomputing. Standard: HMAC = $h(k \| m)$ with SHA-3. Unlike a plain hash, MAC requires knowledge of $k$ to forge.

Merkle Trees

Problem: Alice stores $n$ files on an untrusted server. How to verify integrity with $O(1)$ trusted local storage?

Merkle tree: binary tree where each leaf stores $h(x_i)$, each internal node stores $h(\text{left child} \| \text{right child})$. Alice stores only the root hash $\sigma_\text{root}$ locally (1 hash = $O(1)$ trusted storage).

Verification of file $x_i$: Alice downloads $x_i$ and the $O(\log n)$ sibling hashes on the path to the root. She recomputes the path and checks against $\sigma_\text{root}$. If $x_i$ was modified, the leaf hash changes and propagates up — unless a hash collision exists.

Update: modify $x_i$, recompute the $O(\log n)$ hashes on the path to root. Update $\sigma_\text{root}$.

Used in: Bitcoin/blockchain (transaction integrity), git (commit object hashes), certificate transparency logs, and any distributed system needing integrity of large datasets.

NP-Completeness and Cryptography

NP-hardness is about worst-case complexity. Cryptography needs average-case hardness — the problem must be hard for typical instances, not just contrived worst cases. This is why most NP-complete problems are poor bases for cryptography: 3-colorability is NP-complete but easy on random graphs (average-case trivial). The Merkle-Hellman knapsack system was broken using lattice basis reduction (low-density knapsacks). Modern cryptography is based on number-theoretic assumptions (discrete log, factoring) that appear hard on average for properly chosen parameters.

RSA key generation + Merkle tree + Diffie-Hellman demo

import hashlib, math, random

# ── RSA KEY GENERATION (educational, use cryptography library in prod) ──
def extended_gcd(a, b):
    """Extended Euclidean: returns (gcd, x, y) with a*x + b*y = gcd."""
    if a == 0: return b, 0, 1
    g, x, y = extended_gcd(b % a, a)
    return g, y - (b//a)*x, x

def mod_inverse(e, phi):
    """Compute d = e^(-1) mod phi via extended Euclidean."""
    g, d, _ = extended_gcd(e % phi, phi)
    if g != 1: raise ValueError("No inverse — gcd != 1")
    return d % phi

def rsa_keygen(p, q, e=65537):
    """
    RSA key generation from two primes p, q.
    Public key: (N, e). Private key: (d, p, q).
    Security relies on difficulty of factoring N = p*q.
    """
    N   = p * q
    phi = (p - 1) * (q - 1)
    assert math.gcd(e, phi) == 1, "e must be coprime to phi"
    d   = mod_inverse(e, phi)
    return (N, e), (d, p, q)   # (public_key, private_key)

def rsa_encrypt(m, public_key):
    N, e = public_key
    return pow(m, e, N)         # c = m^e mod N

def rsa_decrypt(c, private_key):
    d, p, q = private_key
    N = p * q
    return pow(c, d, N)         # m = c^d mod N

# Example (use large primes in practice!)
pk, sk = rsa_keygen(61, 53)
m   = 42
c   = rsa_encrypt(m, pk)
m2  = rsa_decrypt(c, sk)
print(f"Original: {m}, Ciphertext: {c}, Decrypted: {m2}")  # 42


# ── MERKLE TREE ──────────────────────────────────────────────────────
class MerkleTree:
    """
    Merkle tree for integrity verification of n data blocks.
    Trusted storage: O(1) — just the root hash.
    Verification of block i: O(log n) — path to root.
    Update of block i: O(log n) — recompute path.
    """
    def __init__(self, blocks):
        """blocks: list of bytes objects."""
        self.n      = len(blocks)
        self.leaves = [hashlib.sha256(b).hexdigest() for b in blocks]
        self.tree   = self._build(self.leaves)

    def _hash_pair(self, left, right):
        return hashlib.sha256((left+right).encode()).hexdigest()

    def _build(self, layer):
        if len(layer) == 1: return layer
        if len(layer) % 2 == 1: layer = layer + [layer[-1]]  # duplicate last
        next_layer = [self._hash_pair(layer[i], layer[i+1])
                      for i in range(0, len(layer), 2)]
        return self._build(next_layer) + layer

    @property
    def root(self):
        return self.tree[0]  # root hash (store locally — O(1))

    def proof(self, i):
        """Generate Merkle proof: O(log n) sibling hashes."""
        idx   = len(self.tree) - self.n + i
        path  = []
        while idx > 0:
            sibling = idx - 1 if idx % 2 == 1 else idx + 1
            if sibling < len(self.tree):
                path.append((sibling, self.tree[sibling]))
            idx = (idx - 1) // 2
        return path

    def verify(self, block, i, proof, root):
        """Verify block i against stored root hash. O(log n)."""
        current = hashlib.sha256(block).hexdigest()
        idx     = len(self.tree) - self.n + i
        for sib_idx, sib_hash in proof:
            if sib_idx == idx - 1:  # sibling is left
                current = self._hash_pair(sib_hash, current)
            else:
                current = self._hash_pair(current, sib_hash)
            idx = (idx - 1) // 2
        return current == root

# Example: 4 files
files  = [b"file0", b"file1", b"file2", b"file3"]
tree   = MerkleTree(files)
print(f"Root hash: {tree.root[:16]}...")

proof  = tree.proof(1)
valid  = tree.verify(b"file1", 1, proof, tree.root)
print(f"Proof valid: {valid}")           # True

tampered = tree.verify(b"TAMPERED", 1, proof, tree.root)
print(f"Tampered detected: {not tampered}")  # True


# ── DIFFIE-HELLMAN KEY EXCHANGE ───────────────────────────────────────
def diffie_hellman(p, g):
    """
    Diffie-Hellman key exchange over Z_p^*.
    p: large prime, g: generator.
    Returns (alice_secret, bob_secret, shared_key).
    Security: discrete logarithm problem in Z_p^*.
    """
    a = random.randint(2, p-2)  # Alice's secret
    b = random.randint(2, p-2)  # Bob's secret
    A = pow(g, a, p)            # Alice sends g^a mod p
    B = pow(g, b, p)            # Bob sends g^b mod p
    alice_key = pow(B, a, p)    # Alice computes (g^b)^a = g^ab mod p
    bob_key   = pow(A, b, p)    # Bob computes (g^a)^b = g^ab mod p
    assert alice_key == bob_key
    return a, b, alice_key

p  = 23       # toy prime (use 2048-bit primes in practice)
g  = 5        # generator
_, _, k = diffie_hellman(p, g)
print(f"DH shared key: {k}")

// RSA + Diffie-Hellman + Merkle tree (conceptual sketch)
// Production: use OpenSSL, libsodium, or Botan
#include <string>
#include <vector>
#include <cstdint>
using namespace std;

// Modular exponentiation: base^exp mod m — O(log exp)
uint64_t mod_pow(uint64_t base, uint64_t exp, uint64_t mod) {
    uint64_t result = 1;
    base %= mod;
    while (exp > 0) {
        if (exp & 1) result = (__uint128_t)result * base % mod;
        base = (__uint128_t)base * base % mod;
        exp >>= 1;
    }
    return result;
}

// Extended GCD: returns gcd(a,b), sets x,y s.t. ax+by=gcd
int64_t ext_gcd(int64_t a, int64_t b, int64_t& x, int64_t& y) {
    if (a==0){x=0;y=1;return b;}
    int64_t x1,y1;
    int64_t g = ext_gcd(b%a, a, x1, y1);
    x = y1 - (b/a)*x1; y = x1;
    return g;
}

// Mod inverse: e^(-1) mod phi
int64_t mod_inv(int64_t e, int64_t phi) {
    int64_t x, y;
    ext_gcd(e, phi, x, y);
    return (x % phi + phi) % phi;
}

struct RSAKey {
    uint64_t N, e;  // public
    uint64_t d;     // private
};

RSAKey rsa_keygen(uint64_t p, uint64_t q, uint64_t e=65537) {
    uint64_t N   = p * q;
    uint64_t phi = (p-1)*(q-1);
    return {N, e, (uint64_t)mod_inv(e, phi)};
}

uint64_t rsa_enc(uint64_t m, uint64_t N, uint64_t e) { return mod_pow(m,e,N); }
uint64_t rsa_dec(uint64_t c, uint64_t N, uint64_t d) { return mod_pow(c,d,N); }

// Merkle tree root (simplified — real impl uses SHA-256)
string merkle_root(vector<string> hashes) {
    while (hashes.size() > 1) {
        if (hashes.size()%2) hashes.push_back(hashes.back());
        vector<string> next;
        for (size_t i=0;i<hashes.size();i+=2)
            next.push_back("H(" + hashes[i] + "|" + hashes[i+1] + ")");
        hashes = next;
    }
    return hashes[0];
}

↑ Run C++ on Compiler Explorer

Part II — Cache-Oblivious Algorithms

Memory Hierarchy and the I/O Model

Modern computers have a hierarchy: CPU registers (bytes, ~1ns) → L1 cache (~10KB, ~1ns) → L2 (~100KB, ~10ns) → L3 (~MB, ~100ns) → RAM (~GB, ~100ns) → disk (~TB, ~10ms). Each level is larger and slower. Memory is transferred in blocks of $B$ words. The key cost metric: number of memory transfers (block reads/writes), not raw operations.

External Memory (EM) model: two-level memory (cache of size $M$, disk of infinite size). Cache has $M/B$ blocks of $B$ words each. Algorithm explicitly reads/writes blocks. Count: memory transfers $MT(N)$.

Cache-oblivious (CO) model: same as EM but the algorithm does NOT know $B$ or $M$. Memory access to any word automatically fetches the containing block (LRU eviction if cache is full). Algorithm must achieve good $MT(N)$ without knowing the cache parameters. An algorithm that works optimally in CO model works optimally on any machine automatically.

LRU Is Near-Optimal

Why LRU? Sleator-Tarjan 1985: $\text{LRU}_M \leq 2 \cdot \text{OPT}_{M/2}$. Partition the block-access sequence into maximal phases of $M/B$ distinct blocks. LRU uses $\leq M/B$ transfers per phase. OPT with $M/2$ cache must use $\geq M/(2B)$ per phase (needs at least half the blocks fresh each phase). So LRU with $M$ is within factor 2 of OPT with $M/2$.

Scanning: $O(N/B + 1)$

Sequential scan of $N$ elements: $\Theta(N/B + 1)$ transfers. The $+1$ covers the case $N < B$. Cannot do better (must read each element at least once). Cache-oblivious: can’t align with block boundary, so gets $\lceil N/B \rceil + 1 = O(N/B + 1)$ — same asymptotically.

Cache-Oblivious Median Finding

Apply the Median of Medians algorithm (Ch1) and analyse its memory transfers. Let $MT(N)$ be the transfers needed.

Steps: (1) partition into groups of 5 [free — already in cache], (2) sort each group [scan: $O(N/B+1)$], (3) recursive median of $N/5$ medians [$MT(N/5)$], (4) partition by median [3 parallel scans: $O(N/B+1)$], (5) recurse on $\leq 7N/10$ elements [$MT(7N/10)$].

Recurrence: $MT(N) = MT(N/5) + MT(7N/10) + O(N/B + 1)$.
Base case: $MT(O(B)) = O(1)$ (fits in $O(1)$ blocks).
Solution: $MT(N) = O(N/B + 1)$ — optimal (must read all elements).

Cache-Oblivious Matrix Multiplication

Naive matrix multiplication ($N^3$ operations): inner loop accesses one row of $X$ and one column of $Y$ per output element. If $Y$ is in row-major order, accessing column $j$ skips $N$ words per step: $O(N^2)$ cache misses for one column, $O(N^3)$ total. Very poor locality.

Cache-oblivious blocked multiplication: recursively divide $N\times N$ matrices into $N/2 \times N/2$ submatrices. Multiply via 8 recursive calls of size $N/2$ and add results. $$MT(N) = 8 \cdot MT(N/2) + O(N^2/B + 1)$$ Base case: when three matrices fit in cache ($3 \cdot (N/2)^2 \leq M$, i.e., $N = O(\sqrt{M})$): $MT(O(\sqrt{M})) = O(M/B)$ (one scan fills cache).

Recursion tree: cost per level is geometrically increasing $N^2/B, 8(N/2)^2/B, \ldots$ — leaves dominate. Number of leaves when $N = O(\sqrt{M})$: $(N/\sqrt{M})^3 = O(N^3/M^{3/2})$. Cost per leaf: $O(M/B)$.

Total: $MT(N) = O\!\left(\frac{N^3}{B\sqrt{M}}\right)$.

This matches the information-theoretic lower bound for cache-oblivious matrix multiply — optimal.

Cache-Oblivious Search: van Emde Boas Layout

Binary search on $N$ sorted elements: naive layout gives $O(\log N)$ transfers (each comparison likely a cache miss). B-tree gives $O(\log_B N)$ but requires knowing $B$.

van Emde Boas (vEB) tree layout [Prokop 1999]: recursively lay out a complete BST by splitting at the middle level of edges. Each half-tree occupies a contiguous memory region. Crucially: when the recursion reaches a subtree of $\leq B$ nodes, it occupies $\leq 2$ blocks of memory.

Search cost: a root-to-leaf search path visits $O(\log N)$ subtrees of size $\leq B$. Each such subtree occupies $\leq 2$ blocks. Total: $O(\log_B N)$ memory transfers — matching B-tree performance without knowing $B$!

This extends to dynamic B-trees with $O(\log_B N)$ insert/delete (cache-obliviously).

Cache-Oblivious Sorting

Optimal sorting in the EM model: $\Theta\!\left(\frac{N}{B}\log_{M/B}\frac{N}{B}\right)$ transfers. Binary mergesort is cache-oblivious but not optimal:

Binary mergesort: merge is 3 parallel scans.
$MT(N) = 2 \cdot MT(N/2) + O(N/B + 1)$, $MT(M) = O(M/B)$.
Solution: $MT(N) = O\!\left(\frac{N}{B}\log_{M/B}\frac{N}{M}\right)$ — better than naive but has extra $\log$ factor vs optimal.

$M/B$-way mergesort (optimal in EM): split into $M/B$ subarrays, recursively sort, merge $M/B$-way (keeping one block per run in cache).
$MT(N) = O\!\left(\frac{N}{B}\log_{M/B}\frac{N}{B}\right)$ — optimal.

Cache-oblivious sorting (Funnelsort): with the tall cache assumption $M = \Omega(B^{1+\varepsilon})$, an $N^\varepsilon$-way mergesort using recursive “funnel” merges achieves the optimal $O\!\left(\frac{N}{B}\log_{M/B}\frac{N}{B}\right)$ transfers without knowing $B$ or $M$.

Cache-oblivious matrix multiply + vEB tree layout + transfer counting

import numpy as np

# ── CACHE-OBLIVIOUS MATRIX MULTIPLICATION ────────────────────────────
# Track memory transfers (simulated with block access counting)
class CacheSimulator:
    """Simplified cache simulator to count block transfers."""
    def __init__(self, B=8, M=64):
        self.B, self.M = B, M
        self.cache = {}     # block_id -> last_access_time
        self.time  = 0
        self.transfers = 0

    def access(self, addr):
        """Access a word at address; fetch block if not in cache."""
        block_id = addr // self.B
        if block_id not in self.cache:
            self.transfers += 1
            if len(self.cache) >= self.M // self.B:
                # LRU eviction
                lru = min(self.cache, key=self.cache.get)
                del self.cache[lru]
            self.cache[block_id] = self.time
        else:
            self.cache[block_id] = self.time
        self.time += 1


def co_matmul(X, Y, B=8, M=64):
    """
    Cache-oblivious matrix multiplication via recursive subdivision.
    No explicit blocking — the recursion naturally achieves cache-efficiency.
    MT(N) = O(N^3 / (B * sqrt(M))) — optimal.

    For demonstration: pure Python recursive; NumPy is much faster in practice.
    """
    n = len(X)
    if n <= 1:
        return [[X[0][0] * Y[0][0]]]

    # Divide matrices into four n/2 x n/2 quadrants
    h = n // 2
    A11, A12 = [r[:h] for r in X[:h]], [r[h:] for r in X[:h]]
    A21, A22 = [r[:h] for r in X[h:]], [r[h:] for r in X[h:]]
    B11, B12 = [r[:h] for r in Y[:h]], [r[h:] for r in Y[:h]]
    B21, B22 = [r[:h] for r in Y[h:]], [r[h:] for r in Y[h:]]

    def add(P, Q):
        return [[P[i][j]+Q[i][j] for j in range(len(P[0]))]
                for i in range(len(P))]

    # 8 recursive multiplications (like standard divide-and-conquer)
    C11 = add(co_matmul(A11,B11), co_matmul(A12,B21))
    C12 = add(co_matmul(A11,B12), co_matmul(A12,B22))
    C21 = add(co_matmul(A21,B11), co_matmul(A22,B21))
    C22 = add(co_matmul(A21,B12), co_matmul(A22,B22))

    # Combine quadrants
    result = [[0]*n for _ in range(n)]
    for i in range(h):
        result[i][:h]   = C11[i]
        result[i][h:]   = C12[i]
        result[h+i][:h] = C21[i]
        result[h+i][h:] = C22[i]
    return result


# ── VAN EMDE BOAS TREE LAYOUT ─────────────────────────────────────────
def veb_layout(sorted_arr):
    """
    Lay out a sorted array in van Emde Boas (vEB) tree order.
    This gives O(log_B N) memory transfers for binary search,
    without knowing B — cache-oblivious.

    Recursively: take middle element as root, recurse on left half
    (bottom subtrees) and right half (top subtrees).
    Each subtree of size B fits in O(1) cache blocks.
    """
    n = len(sorted_arr)
    if n <= 1:
        return sorted_arr
    if n == 2:
        return [sorted_arr[0], sorted_arr[1]]

    # Split at the midpoint of tree height (not array midpoint)
    import math
    h     = int(math.log2(n + 1))    # height of BST
    h_top = h // 2
    h_bot = h - h_top
    top_size = (1 << h_top) - 1      # nodes in top subtree
    bot_size = (1 << h_bot) - 1      # nodes in each bottom subtree

    # Root of top subtree is at position top_size in a BFS ordering
    # vEB layout recursively arranges bottom subtrees, then top subtree
    top_subtree = sorted_arr[:top_size + 1]
    bot_subtrees = [sorted_arr[top_size + 1 + i*(bot_size+1):
                               top_size + 1 + (i+1)*(bot_size+1)]
                    for i in range(top_size + 1)]

    result = []
    for sub in bot_subtrees:
        result.extend(veb_layout(sub))
    result.extend(veb_layout(top_subtree))
    return result


# ── BINARY MERGESORT TRANSFER ANALYSIS ───────────────────────────────
def mergesort_transfers(N, B, M):
    """
    Estimate memory transfers for binary mergesort.
    MT(N) = O(N/B * log_{M/B}(N/M)) — not optimal.
    Optimal: N/B * log_{M/B}(N/B).
    """
    if N <= M:
        return N / B    # fits in cache: O(N/B) to load
    # log_{M/B}(N/M) levels, each costs N/B
    levels = math.log(N/M, M/B)
    return N / B * levels

import math
N, B, M = 1_000_000, 64, 8192
print(f"Binary mergesort transfers: {mergesort_transfers(N,B,M):.0f}")
opt = N/B * math.log(N/B, M/B)
print(f"Optimal: {opt:.0f}")
print(f"Ratio: {mergesort_transfers(N,B,M)/opt:.2f}")

// Cache-oblivious matrix multiply — O(N^3 / (B*sqrt(M)))
#include <vector>
#include <algorithm>
using namespace std;
using Matrix = vector<vector<double>>;

// In-place C += A * B (for submatrices specified by offsets)
void co_matmul(const Matrix& A, const Matrix& B, Matrix& C,
               int ra, int ca, int rb, int cb, int rc, int cc, int n) {
    if (n == 1) {
        C[rc][cc] += A[ra][ca] * B[rb][cb];
        return;
    }
    int h = n / 2;
    // 8 recursive calls — recursion achieves cache efficiency
    // When submatrices fit in cache, stops incurring misses
    co_matmul(A,B,C, ra,   ca,   rb,   cb,   rc,   cc,   h); // C11 += A11*B11
    co_matmul(A,B,C, ra,   ca+h, rb+h, cb,   rc,   cc,   h); // C11 += A12*B21
    co_matmul(A,B,C, ra,   ca,   rb,   cb+h, rc,   cc+h, h); // C12 += A11*B12
    co_matmul(A,B,C, ra,   ca+h, rb+h, cb+h, rc,   cc+h, h); // C12 += A12*B22
    co_matmul(A,B,C, ra+h, ca,   rb,   cb,   rc+h, cc,   h); // C21 += A21*B11
    co_matmul(A,B,C, ra+h, ca+h, rb+h, cb,   rc+h, cc,   h); // C21 += A22*B21
    co_matmul(A,B,C, ra+h, ca,   rb,   cb+h, rc+h, cc+h, h); // C22 += A21*B12
    co_matmul(A,B,C, ra+h, ca+h, rb+h, cb+h, rc+h, cc+h, h); // C22 += A22*B22
}
// MT(N) = 8*MT(N/2) + O(N^2/B)
// Base: MT(sqrt(M)) = O(M/B)
// Total: O(N^3 / (B*sqrt(M))) — optimal cache-oblivious bound

↑ Run C++ on Compiler Explorer

Crypto in quantitative finance: HMAC for trade message authentication

import hmac, hashlib, json, time

# ── HMAC FOR TRADE MESSAGE AUTHENTICATION ────────────────────────────
def sign_order(order_dict, secret_key: bytes) -> str:
    """
    HMAC-SHA256 authentication of a trade order message.
    Used in XTS API, FIX protocol, and most broker APIs.

    Properties:
    - Correctness: same key always produces same MAC
    - Unforgeability: without secret_key, cannot produce valid MAC
    - Unlike plain hash: adversary who knows MAC(k, m1) cannot
      compute MAC(k, m2) for any new m2 (security from secret key)
    """
    message = json.dumps(order_dict, sort_keys=True).encode()
    mac = hmac.new(secret_key, message, hashlib.sha256).hexdigest()
    return mac

def verify_order(order_dict, received_mac: str, secret_key: bytes) -> bool:
    """
    Verify HMAC. Use hmac.compare_digest to prevent timing attacks.
    Timing attack: if we return early on first mismatch, an attacker
    can deduce the correct MAC one byte at a time by measuring response time.
    """
    expected = sign_order(order_dict, secret_key)
    return hmac.compare_digest(expected, received_mac)  # constant-time compare!


# Example: XTS-style order authentication
secret_key = b'ttt-secret-2025-xts-api-key'

order = {
    'order_id':   'TTT-20250327-001',
    'instrument': 'NIFTY25MAR24000CE',
    'quantity':   50,
    'price':      245.50,
    'side':       'BUY',
    'timestamp':  int(time.time()),
}
mac = sign_order(order, secret_key)
print(f"Order MAC (first 16 chars): {mac[:16]}...")

# Verify legitimate order
print(f"Legitimate order valid: {verify_order(order, mac, secret_key)}")

# Tampered order (e.g., price changed by attacker)
tampered_order = dict(order, price=1.0)   # attacker changes price
print(f"Tampered order valid: {verify_order(tampered_order, mac, secret_key)}")
# False — HMAC detects tampering


# ── COMMITMENT SCHEME ─────────────────────────────────────────────────
import os
def commit(value: str) -> tuple:
    """
    Cryptographic commitment: hide value v until reveal.
    Used in: sealed-bid auctions, fair coin flips, zero-knowledge proofs.
    commit(v) = (commitment=h(r||v), randomness=r) — HIDING: h(r||v) leaks nothing
    reveal(v, r): check commitment = h(r||v) — BINDING: can't find v' with h(r||v')=h(r||v)
    """
    r = os.urandom(32)   # 256 bits of randomness (prevents brute force)
    commitment = hashlib.sha256(r + value.encode()).hexdigest()
    return commitment, r

def reveal(value: str, randomness: bytes, commitment: str) -> bool:
    """Verify that commitment was to value."""
    expected = hashlib.sha256(randomness + value.encode()).hexdigest()
    return hmac.compare_digest(expected, commitment)

# TTT use case: sealed-bid order submission
bid_price = "24500"
c, r = commit(bid_price)
print(f"Committed to bid. Commitment: {c[:16]}...")
# ... later, when all bids are submitted ...
print(f"Reveal valid: {reveal(bid_price, r, c)}")       # True
print(f"Cheat attempt: {reveal('23000', r, c)}")        # False

// HMAC-SHA256 for order authentication (conceptual)
// Production: use OpenSSL HMAC_SHA256 or libsodium crypto_auth
#include <string>
#include <vector>
#include <openssl/hmac.h>  // link with -lssl -lcrypto

string hmac_sha256(const string& key, const string& msg) {
    unsigned char result[32];
    unsigned int len = 32;
    HMAC(EVP_sha256(),
         key.data(), key.size(),
         (unsigned char*)msg.data(), msg.size(),
         result, &len);
    string hex;
    char buf[3];
    for (int i=0;i<32;i++) { sprintf(buf,"%02x",result[i]); hex+=buf; }
    return hex;
}

// Constant-time comparison (prevent timing attacks)
bool secure_compare(const string& a, const string& b) {
    if (a.size() != b.size()) return false;
    unsigned char diff = 0;
    for (size_t i=0;i<a.size();i++) diff |= a[i]^b[i];
    return diff == 0;
}

↑ Run C++ on Compiler Explorer

Worked Example · Cache-Oblivious vs Naïve Matrix Multiply on Large Vol Surface 🇮🇳 Pashupati / TTT

Pashupati’s options analytics system computes implied vol surface factors daily: $V = A \cdot B$ where $A$ is a $512 \times 512$ matrix of factor loadings and $B$ is a $512 \times 512$ matrix of factor returns. Compare naïve vs cache-oblivious performance.

Memory transfer analysis for $N=512$, $B=64$ words, $M=4096$ words

import math

N, B, M = 512, 64, 4096   # matrix size, block size (words), cache (words)

# Naïve row × column: column of Y has N=512 elements, stride N apart
# Each column element is a cache miss → O(N^2) transfers just for one output row
naive_transfers = N**3 / B   # assuming perfect row access, bad column access
print(f"Naïve (best case):    {naive_transfers:,.0f} transfers")

# Cache-oblivious recursive: MT(N) = O(N^3 / (B * sqrt(M)))
co_transfers = N**3 / (B * math.sqrt(M))
print(f"Cache-oblivious:      {co_transfers:,.0f} transfers")
print(f"Speedup factor:       {naive_transfers / co_transfers:.1f}×")

# sqrt(M) = sqrt(4096) = 64: speedup = 64x for this config
# For N=1024, M=16384, B=64: speedup = sqrt(16384/64) = 16x

# Memory transfer lower bound: Omega(N^3 / (B*sqrt(M)))
lb = N**3 / (B * math.sqrt(M))
print(f"Lower bound:          {lb:,.0f} transfers")
print(f"CO achieves optimal: {abs(co_transfers - lb) < 1}")  # True

Result: for $N=512$, $B=64$, $M=4096$: naïve transfers $\approx 33.5$ million, cache-oblivious $\approx 524$K — a $64\times$ improvement, equal to $\sqrt{M/B} = \sqrt{64} = 8$... wait, $\sqrt{M} = 64$, giving $64\times$. The CO algorithm is provably optimal. For Pashupati’s daily Greeks calculation on 512-strike surfaces, this means the computation fits in L3 cache instead of spilling to RAM, reducing wall-clock time from ~seconds to ~milliseconds.

Cryptography in Indian fintech: SEBI mandates encrypted communication for all broker API traffic. XTS API uses HMAC-SHA256 for order authentication (same structure as our demo above). The Diffie-Hellman key exchange is the basis of TLS 1.3 (used by every HTTPS connection to NSE/BSE). Merkle trees underpin the audit trails required by SEBI’s data localisation rules — each trade record is part of a Merkle tree whose root is published to prevent retroactive modification.

Cache-oblivious algorithms in finance: large covariance matrix operations (factor models, risk attribution) are memory-bandwidth-bound. NumPy/BLAS uses blocked matrix multiplication internally — essentially the cache-oblivious algorithm above with explicit block sizes tuned for the target CPU. For custom operations (e.g., Heston calibration with $500\times500$ implied vol surfaces), cache-oblivious divide-and-conquer implementations outperform naïve loops by $10\text{–}100\times$ on modern CPUs due to cache efficiency.

Practice Problems

4 questions · Chapter 13

prog / daa / ch13 / q01 ★ Hash Properties

Why can a birthday attack find a collision in a $d$-bit hash function in $O(2^{d/2})$ operations, not $O(2^d)$?

Because the hash function has a backdoor that allows finding collisions in $O(2^{d/2})$

By the birthday paradox: among $O(2^{d/2})$ random inputs, a collision is likely because pairwise collision probability sums to $O(1)$ — we’re finding any collision, not a collision with a specific target

Because we can enumerate all $2^d$ outputs in $O(2^{d/2})$ using a sorted lookup table

The $O(2^d)$ bound is for preimage; collision always takes $O(2^{d/2})$ by definition

Answer: B — the birthday paradox.

Preimage resistance (find $x$ with $h(x) = y$ for a specific $y$) requires $O(2^d)$ random tries. Collision resistance (find any $x \neq x'$ with $h(x) = h(x')$) only requires $O(2^{d/2})$ random inputs. Why? For $k$ random inputs, there are $\binom{k}{2} \approx k^2/2$ pairs. Each pair collides with probability $1/2^d$. Expected collisions $\approx k^2/(2 \cdot 2^d)$. Setting this to 1: $k \approx 2^{d/2}$. This is the birthday paradox: with $2^{d/2}$ people, there’s a ~50% chance two share a birthday (with $2^d$ possible birthdays). SHA-256 targets $2^{128}$ collision resistance (not $2^{256}$) precisely because of this.

prog / daa / ch13 / q02 ★★ RSA Correctness

RSA decryption computes $c^d \bmod N$ where $c = m^e \bmod N$ and $ed \equiv 1 \pmod{(p-1)(q-1)}$. Why does $m^{ed} \equiv m \pmod N$?

Because $ed = 1$, so $m^{ed} = m^1 = m$

$ed = 1 + k(p-1)(q-1)$, so $m^{ed} = m \cdot (m^{p-1})^{k(q-1)} \equiv m \cdot 1^{k(q-1)} \equiv m \pmod p$ by Fermat; similarly mod $q$; by CRT, $m^{ed} \equiv m \pmod{pq}$

Because $N$ is prime, so Fermat’s little theorem applies directly

It only works when $\gcd(m, N) = 1$; for other $m$, RSA fails

Answer: B — Fermat’s little theorem + Chinese Remainder Theorem.

Since $ed \equiv 1 \pmod{(p-1)(q-1)}$, write $ed = 1 + k(p-1)(q-1)$. By Fermat’s little theorem for prime $p$: $m^{p-1} \equiv 1 \pmod p$ (when $\gcd(m,p)=1$). So $m^{ed} = m^{1+k(p-1)(q-1)} = m \cdot (m^{p-1})^{k(q-1)} \equiv m \cdot 1 \equiv m \pmod p$. Similarly $m^{ed} \equiv m \pmod q$. When $\gcd(m,p)=p$ (i.e., $p \mid m$): both sides are $0 \bmod p$. By CRT (since $p,q$ coprime): $m^{ed} \equiv m \pmod{N=pq}$. This is why RSA uses two primes: to apply Fermat’s theorem to each prime factor of $N$.

prog / daa / ch13 / q03 ★★ Cache-Oblivious Matrix Multiply

The cache-oblivious matrix multiplication achieves $O(N^3 / (B\sqrt{M}))$ transfers. Why does the base case $MT(O(\sqrt{M})) = O(M/B)$ give this bound, and why is this optimal?

Because the recursion tree has $O(N^3/M^{3/2})$ leaves, each costing $O(1)$, giving $O(N^3/M^{3/2})$

When $N = O(\sqrt{M})$, three $N\times N$ matrices fit in cache ($3N^2 \leq M$): load them once in $O(M/B)$ transfers. The recursion tree has $(N/\sqrt{M})^3$ leaves, each costing $O(M/B)$: total $O(N^3/M^{3/2}) \cdot O(M/B) = O(N^3/(B\sqrt{M}))$. Optimal because any algorithm must read $N^3$ data with at most $B$ useful operations per transfer

Because the recursion terminates when $N = B$, giving base cost $O(B^3/B)= O(B^2)$

The $\sqrt{M}$ factor comes from the $\sqrt{M}$ registers in the CPU

Answer: B.

The recursion bottoms out when three $N\times N$ matrices fit in cache: $3N^2 \leq M \Rightarrow N = O(\sqrt{M/3}) = O(\sqrt{M})$. At that point, load all three matrices once: $3N^2/B = O(M/B)$ transfers. Then all $N^3$ multiplications are done from cache (0 additional transfers).

Number of leaf sub-problems: the recursion tree has depth $\log_2(N/\sqrt{M})$ and branching factor 8, giving $(N/\sqrt{M})^{\log_2 8} = (N/\sqrt{M})^3$ leaves. Total: $(N/\sqrt{M})^3 \cdot O(M/B) = O(N^3 M^{-3/2} M B^{-1}) = O(N^3 / (B\sqrt{M}))$.

Optimality: by an information-theoretic argument, computing $N^3$ multiply-adds requires reading $\Omega(N^3)$ memory words; with block size $B$ and reuse from a cache of size $M$, you need $\Omega(N^3/(B\sqrt{M}))$ transfers. This matches.

prog / daa / ch13 / q04 ★★★ NP-hardness vs Cryptographic Hardness

Why is NP-completeness an unreliable basis for cryptography, even though it guarantees hardness in some sense?

NP-complete problems can always be solved in polynomial time with approximation algorithms

NP-completeness guarantees worst-case hardness (hard for some instances), but cryptography needs average-case hardness (hard for random instances chosen as keys). An NP-complete problem may be easy on almost all instances used as keys

NP-complete problems are always in P for special input structures used in cryptography

Cryptography only uses problems from P, not NP-complete problems

Answer: B — worst-case vs average-case hardness.

NP-completeness says: there exist instances of the problem that are hard (no polynomial algorithm solves all of them, assuming P $\neq$ NP). But cryptography encrypts with a random key $k$ drawn from some distribution. The adversary tries to break a randomly generated instance. For cryptography to be secure, we need the problem to be hard for random instances, not just worst-case instances.

Example: 3-colorability is NP-complete, but a random $G(n, p)$ with $p \gg \log n/n$ is almost certainly not 3-colorable — and is easy to verify (no 3-coloring exists). Using 3-colorability for encryption would produce instances the adversary can immediately reject.

The Merkle-Hellman knapsack failed for exactly this reason: the knapsack instances it generates have low density (a special structure), and Lagarias-Odlyzko showed low-density knapsacks are easy. Modern cryptography bases security on problems believed to be average-case hard: discrete log, factoring, lattice problems.

Terminal Questions — Chapter 13 10 problems · No answers given

Work through independently. Q1–3 direct application. Q4–7 synthesis. Q8–10 careful argument.

Implement RSA with toy primes $p=61$, $q=53$, $e=17$. Find $N$, $\phi$, $d$ using the extended Euclidean algorithm. Encrypt and decrypt the message $m=42$. Verify $m^{ed} \equiv m \pmod N$ using Fermat’s little theorem mod $p$ and mod $q$ separately. Easy

Build a Merkle tree for 8 data blocks (use any hash function). Verify block 3 using only the 3 sibling hashes on its path to the root. Show all intermediate hash computations. Then simulate a tampering attack on block 5 and show it is detected. Easy

Compute the memory transfers for a naïve scan of $N=10^6$ elements with block size $B=64$ words and cache $M=8192$ words. Now compute transfers for binary mergesort and cache-oblivious matrix multiply of a $100\times100$ matrix. Which algorithm benefits most from cache-awareness? Easy

Diffie-Hellman with $p=23$, $g=5$: simulate Alice choosing $a=6$ and Bob choosing $b=15$. Compute $g^a \bmod p$, $g^b \bmod p$, and the shared key $g^{ab} \bmod p$. Then simulate a man-in-the-middle attack with Eve choosing $e=4$: what does Eve compute, and what do Alice and Bob each think the shared key is? Medium

Implement the van Emde Boas tree layout for $n=15$ elements (a full binary tree of height 3). Show the memory layout, then simulate a binary search for element 11 and count cache misses with $B=2$ (block size 2). Compare to binary search in sorted-array order. Medium

Analyse the memory transfers of naïve $N\times N$ matrix-vector multiply (dense matrix, dense vector). Show that if the matrix is stored in row-major order, transfers $= O(N^2/B + N)$. Now show that if the matrix is stored in column-major order and we access row by row, transfers $= O(N^2)$ (no spatial locality). Medium

The RSA signature scheme $\sigma = h(m)^d \bmod N$ is more secure than $\sigma = m^d \bmod N$. Show specifically that hashing prevents (a) the multiplicative forgery attack $(m_1 m_2, \sigma_1 \sigma_2)$, and (b) the “choose-signature-first” attack (pick $\sigma^*$, compute $m^* = (\sigma^*)^e$). Why does collision resistance of $h$ matter? Medium

Prove that binary mergesort achieves $MT(N) = O\!\left(\frac{N}{B}\log_{M/B}\frac{N}{M}\right)$. Start from $MT(N) = 2 \cdot MT(N/2) + O(N/B)$ with base case $MT(M) = O(M/B)$. Unroll the recursion, sum the geometric series, and identify the number of recursion levels. Hard

Prove the information-theoretic lower bound for comparison-based sorting in the EM model: any algorithm must perform $\Omega\!\left(\frac{N}{B}\log_{M/B}\frac{N}{B}\right)$ memory transfers. Hint: consider the number of distinct outputs (sorted permutations) and the information revealed per transfer ($B$ comparisons per transfer, $M/B$-way branching). Hard

The LRU bound $\text{LRU}_M \leq 2 \cdot \text{OPT}_{M/2}$ implies that cache-oblivious algorithms designed for LRU are within a factor 2 of optimal. Explain why this means the $O(N^3/(B\sqrt{M}))$ cache-oblivious matrix multiply bound also holds for the offline optimal algorithm (with $M/2$ cache). Is the factor-2 gap significant in practice? Why or why not? Hard

← FPT & Distributed

✓ MODULE COMPLETE

All 13 chapters of Algorithm Design & Analysis

Back to Module →