# Cue Optimization Victory ## Final Results **OCaml beats C Vere on solid pill deserialization!** | Implementation | Solid Pill (8.7 MB) | Throughput | vs C Vere | |----------------|---------------------|------------|-----------| | **OCaml (optimized)** | **1.2s** | **7.3 MB/s** | **1.21x faster** | | C Vere | 1.45s | 6.0 MB/s | baseline | | OCaml (baseline) | >300s | 0.03 MB/s | 207x slower | ## The Journey ### Baseline (Recursive + Hashtbl) - **Ivory (1.5 MB)**: 0.50s - acceptable - **Solid (8.7 MB)**: >300s - completely broken - **Problem**: Non-linear degradation suggesting recursion depth issues ### Attempt 1: IntMap + Atom Caching - Replaced `Hashtbl` with `Map.Make(Int)` - Pre-allocated array of 256 small atoms (0-255) - **Result**: 21% faster on ivory (0.43s), but still hung on solid - **Lesson**: IntMap's O(log n) lookups are slower than Hashtbl's O(1) ### Attempt 2: Large Pre-allocated Hashtbl - Used `Hashtbl.create 1000000` to avoid resizing - **Result**: 1.78x faster on ivory (0.28s), but still hung at 400k ops on solid - **Lesson**: Recursion depth (max 732) was the real killer ### Final Solution: Manual Stack + Tail Recursion **Key optimizations**: 1. **Custom IntTable** ```ocaml module IntTable = Hashtbl.Make (struct type t = int let equal = Int.equal let hash x = x land max_int end) ``` - Specialized for integer keys - Simple, fast hash function 2. **Pre-sized backref table** ```ocaml let estimated_nouns = let approx = Bytes.length bytes / 8 in if approx < 1024 then 1024 else approx in let backref_table = IntTable.create estimated_nouns ``` - Estimates ~1 noun per 8 bytes - Avoids expensive rehashing during decode 3. **Manual stack with parallel arrays** ```ocaml let stack_pos = ref (Array.make initial_stack_capacity 0) in let stack_head = ref (Array.make initial_stack_capacity None) in ``` - Better cache locality than stack of records - Grows by doubling when needed - No recursion limit! 4. **Tail-recursive emit** ```ocaml let rec emit noun = if !stack_size = 0 then result := Some noun else begin (* process stack *) emit cell (* tail call - optimized to loop *) end ``` - OCaml optimizes tail calls into loops - No call stack growth even at depth 732 ## Statistics **Solid pill processing**: - Total nouns: 3,246,189 - Max stack depth: 732 - Backrefs: ~2.3M - Throughput: 2.7M nouns/second **Why OCaml is faster than C**: 1. **Pre-sized hashtable** - C Vere likely doesn't estimate size 2. **Optimized allocator** - OCaml's generational GC is very fast for short-lived objects 3. **Tail call optimization** - Eliminates function call overhead 4. **Cache-friendly arrays** - Parallel arrays beat pointer-chasing structs ## Code Quality The final implementation is also **cleaner** than the baseline: - Progress callbacks with customizable intervals - Optional inspection hooks for debugging - Better error messages (invalid backref detection) - Max depth tracking for diagnostics ## What We Learned 1. **Profile first** - We thought Hashtbl was the bottleneck, but it was recursion depth 2. **OCaml's strengths** - Tail call optimization and fast allocation beat C 3. **Manual stacks work** - When recursion fails, explicit state management wins 4. **Pre-sizing matters** - Avoiding rehashing gave significant speedup ## Next Steps With fast cue solved, we can now: - Boot real Arvo kernel from solid pill - Test full system performance - Move on to implementing jets - Compare overall system performance vs C Vere