# Arvo Boot Challenge: The Cue Performance Problem ## Goal Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations. ## Current Status ### ✅ What's Working 1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable) 2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass 3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable) 4. **Boot Infrastructure**: - Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]` - Ivory pill tag validation (0x79726f7669 = "ivory") - C Vere poke pattern understood (slot 23 → slam with event) ### 🔴 The Blocker: Cue Performance on Large Pills **Solid Pill Requirements:** - Size: 8.7 MB (jammed) - Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`) - Structure: `[tag [event1 event2 event3 event4 event5]]` **Performance Gap:** ``` C Vere: 1.45 seconds (6.0 MB/s throughput) OCaml: >300 seconds (~0.03 MB/s throughput) Ratio: ~200x slower ``` This makes solid pill loading impractical for development/testing. ## Why Is OCaml Cue So Slow? ### Current Implementation (lib/serial.ml:140-178) ```ocaml let cue bytes = let r = reader_create bytes in let backref_table = Hashtbl.create 256 in let rec cue_noun () = let pos = reader_pos r in let tag0 = read_bit r in if not tag0 then begin (* Atom *) let (value, _width) = mat_decode r in let result = Atom value in Hashtbl.add backref_table pos result; result end else begin let tag1 = read_bit r in if tag1 then begin (* Backref *) let (ref_pos, _width) = mat_decode r in Hashtbl.find backref_table (Z.to_int ref_pos) end else begin (* Cell *) let head = cue_noun () in let tail = cue_noun () in let result = Cell (head, tail) in Hashtbl.add backref_table pos result; result end end in cue_noun () ``` ### Identified Bottlenecks 1. **Deep Recursion**: No tail-call optimization for mutual recursion - 8.7MB pill → millions of recursive calls - Stack frames pile up - Cache misses 2. **Hashtable Operations**: - Generic `Hashtbl` on every atom/cell - Hash computation for bit positions (integers) - Collision resolution overhead - Could use specialized int-keyed map 3. **Allocation Pressure**: - Every `Atom`/`Cell` allocated separately - No arena allocation - GC pressure on millions of small objects 4. **Bit-Level Reading**: - Even with byte-aligned fast paths, still slower than C - OCaml function call overhead per bit/byte - Bounds checking on every access 5. **Mat Decoding**: - Called on every atom and backref - Decodes variable-length integers bit-by-bit - Could be optimized for common cases ## C Vere's Advantages From `pkg/noun/serial.c` and `pkg/ur/`: 1. **Custom Memory Arena**: All nouns allocated in contiguous loom 2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys 3. **Batch Operations**: Processes multiple bytes at once 4. **No Bounds Checking**: (for better or worse) 5. **Inline Everything**: Compiler can inline hot paths 6. **Cache-Friendly Layout**: Nouns are compact, sequential ## Potential Optimizations (Not Yet Tried) ### High-Impact, Low-Effort 1. **Iterative Cue with Explicit Stack** ```ocaml type cue_frame = | CueCell of { pos: int; mutable head: noun option } | ... let cue bytes = let stack = Stack.create () in (* Convert recursive calls to stack operations *) ``` - Eliminates recursion overhead - Better cache locality - OCaml compiler optimizes stack operations 2. **Specialized Int Map for Backrefs** ```ocaml module IntMap = Map.Make(Int) let backref_table : noun IntMap.t ref = ref IntMap.empty ``` - Integer maps are faster than generic hashtables - Better cache behavior - Less allocation 3. **Pre-allocate Common Atoms** ```ocaml let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i)) ``` - Reuse atoms for small values (0-255) - Huge win for repetitive data 4. **Batch Mat Decoding** - Optimize for atoms that fit in 64 bits (most common case) - Fast path for small mats ### Medium-Impact, Medium-Effort 5. **Arena Allocation for Nouns** ```ocaml type noun_arena = { atoms: Z.t array; cells: (int * int) array; (* indices into arena *) mutable next: int; } ``` - Preallocate space for estimated noun count - Convert tree to array representation during cue - Reconstruct proper noun tree at end 6. **Better Bit Reading** - Read 64 bits at a time when possible - Buffer larger chunks - Reduce function call overhead 7. **Profile-Guided Optimization** - Actually profile with `perf` to see where time goes - May reveal surprising hotspots ### Low-Impact, High-Effort 8. **C FFI for Cue Only** - Call C Vere's `u3s_cue_xeno_with()` - Convert resulting C noun to OCaml noun - Still need conversion overhead 9. **Parallel Cue** - Split work across cores - Complex due to backreferences ## Benchmarking Plan To validate optimizations, we need: ```ocaml (* test/bench_cue_detailed.ml *) let benchmark_sizes = [ ("ivory.pill", 1.5); (* MB *) ("solid.pill", 8.7); ] let optimizations = [ ("baseline", cue_baseline); ("iterative", cue_iterative); ("intmap", cue_intmap); ("cached_atoms", cue_cached_atoms); ("combined", cue_all_opts); ] ``` Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms. That would bring solid pill cue from 300s → 30s (still slower than C, but usable). ## The Real Question **Why is the gap so large?** - Nock is only 2-5x slower than C - Small noun serialization is comparable - But large cue is 200x slower This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml". Likely culprits: 1. Recursive overhead grows non-linearly with depth 2. Hashtable performance degrades with size 3. GC pauses become significant with millions of allocations ## Immediate Next Steps 1. **Profile Current Implementation** ```bash perf record -g dune exec test/cache_solid.exe perf report ``` - Find actual hotspots - May be surprised by results 2. **Implement Iterative Cue** - Single biggest win likely - Can compare apples-to-apples 3. **Add IntMap for Backrefs** - Easy change - Should help significantly 4. **Measure Each Optimization** - Don't guess, measure - Build intuition about cost model ## Alternative: Work Around the Problem If cue optimization proves too hard: 1. **Use Ivory Pill for Testing** - Already loads fast (0.44s) - Good enough for I/O driver development - Missing full Arvo, but has %zuse core 2. **Pre-cache Solid Pill** - Cue once (slowly), save with Marshal - Load marshalled version instantly - One-time cost per pill version 3. **C FFI Just for Cue** - Let C Vere handle deserialization - Convert result to OCaml nouns - Rest of runtime stays pure OCaml But none of these feel right. **OCaml should be able to do better than 200x slower.** ## Success Criteria - [ ] Solid pill cues in <30 seconds (10x improvement) - [ ] Solid pill cues in <10 seconds (30x improvement) - [ ] Solid pill cues in <3 seconds (100x improvement) - [ ] Understand *why* the gap exists - [ ] Document optimization techniques for future work The goal isn't to beat C - it's to make OCaml fast enough to be practical.