1 files changed, 273 insertions, 0 deletions
diff --git a/ocaml/ARVO_CHALLENGE.md b/ocaml/ARVO_CHALLENGE.md
new file mode 100644
index 0000000..6ce7585
--- /dev/null
+++ b/ocaml/ARVO_CHALLENGE.md
@@ -0,0 +1,273 @@
+# Arvo Boot Challenge: The Cue Performance Problem
+
+## Goal
+
+Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations.
+
+## Current Status
+
+### ✅ What's Working
+
+1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable)
+2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass
+3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable)
+4. **Boot Infrastructure**:
+   - Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]`
+   - Ivory pill tag validation (0x79726f7669 = "ivory")
+   - C Vere poke pattern understood (slot 23 → slam with event)
+
+### 🔴 The Blocker: Cue Performance on Large Pills
+
+**Solid Pill Requirements:**
+- Size: 8.7 MB (jammed)
+- Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`)
+- Structure: `[tag [event1 event2 event3 event4 event5]]`
+
+**Performance Gap:**
+```
+C Vere:  1.45 seconds  (6.0 MB/s throughput)
+OCaml:   >300 seconds  (~0.03 MB/s throughput)
+Ratio:   ~200x slower
+```
+
+This makes solid pill loading impractical for development/testing.
+
+## Why Is OCaml Cue So Slow?
+
+### Current Implementation (lib/serial.ml:140-178)
+
+```ocaml
+let cue bytes =
+  let r = reader_create bytes in
+  let backref_table = Hashtbl.create 256 in
+
+  let rec cue_noun () =
+    let pos = reader_pos r in
+    let tag0 = read_bit r in
+
+    if not tag0 then begin
+      (* Atom *)
+      let (value, _width) = mat_decode r in
+      let result = Atom value in
+      Hashtbl.add backref_table pos result;
+      result
+    end else begin
+      let tag1 = read_bit r in
+      if tag1 then begin
+        (* Backref *)
+        let (ref_pos, _width) = mat_decode r in
+        Hashtbl.find backref_table (Z.to_int ref_pos)
+      end else begin
+        (* Cell *)
+        let head = cue_noun () in
+        let tail = cue_noun () in
+        let result = Cell (head, tail) in
+        Hashtbl.add backref_table pos result;
+        result
+      end
+    end
+  in
+  cue_noun ()
+```
+
+### Identified Bottlenecks
+
+1. **Deep Recursion**: No tail-call optimization for mutual recursion
+   - 8.7MB pill → millions of recursive calls
+   - Stack frames pile up
+   - Cache misses
+
+2. **Hashtable Operations**:
+   - Generic `Hashtbl` on every atom/cell
+   - Hash computation for bit positions (integers)
+   - Collision resolution overhead
+   - Could use specialized int-keyed map
+
+3. **Allocation Pressure**:
+   - Every `Atom`/`Cell` allocated separately
+   - No arena allocation
+   - GC pressure on millions of small objects
+
+4. **Bit-Level Reading**:
+   - Even with byte-aligned fast paths, still slower than C
+   - OCaml function call overhead per bit/byte
+   - Bounds checking on every access
+
+5. **Mat Decoding**:
+   - Called on every atom and backref
+   - Decodes variable-length integers bit-by-bit
+   - Could be optimized for common cases
+
+## C Vere's Advantages
+
+From `pkg/noun/serial.c` and `pkg/ur/`:
+
+1. **Custom Memory Arena**: All nouns allocated in contiguous loom
+2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys
+3. **Batch Operations**: Processes multiple bytes at once
+4. **No Bounds Checking**: (for better or worse)
+5. **Inline Everything**: Compiler can inline hot paths
+6. **Cache-Friendly Layout**: Nouns are compact, sequential
+
+## Potential Optimizations (Not Yet Tried)
+
+### High-Impact, Low-Effort
+
+1. **Iterative Cue with Explicit Stack**
+   ```ocaml
+   type cue_frame =
+     | CueCell of { pos: int; mutable head: noun option }
+     | ...
+
+   let cue bytes =
+     let stack = Stack.create () in
+     (* Convert recursive calls to stack operations *)
+   ```
+   - Eliminates recursion overhead
+   - Better cache locality
+   - OCaml compiler optimizes stack operations
+
+2. **Specialized Int Map for Backrefs**
+   ```ocaml
+   module IntMap = Map.Make(Int)
+   let backref_table : noun IntMap.t ref = ref IntMap.empty
+   ```
+   - Integer maps are faster than generic hashtables
+   - Better cache behavior
+   - Less allocation
+
+3. **Pre-allocate Common Atoms**
+   ```ocaml
+   let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i))
+   ```
+   - Reuse atoms for small values (0-255)
+   - Huge win for repetitive data
+
+4. **Batch Mat Decoding**
+   - Optimize for atoms that fit in 64 bits (most common case)
+   - Fast path for small mats
+
+### Medium-Impact, Medium-Effort
+
+5. **Arena Allocation for Nouns**
+   ```ocaml
+   type noun_arena = {
+     atoms: Z.t array;
+     cells: (int * int) array; (* indices into arena *)
+     mutable next: int;
+   }
+   ```
+   - Preallocate space for estimated noun count
+   - Convert tree to array representation during cue
+   - Reconstruct proper noun tree at end
+
+6. **Better Bit Reading**
+   - Read 64 bits at a time when possible
+   - Buffer larger chunks
+   - Reduce function call overhead
+
+7. **Profile-Guided Optimization**
+   - Actually profile with `perf` to see where time goes
+   - May reveal surprising hotspots
+
+### Low-Impact, High-Effort
+
+8. **C FFI for Cue Only**
+   - Call C Vere's `u3s_cue_xeno_with()`
+   - Convert resulting C noun to OCaml noun
+   - Still need conversion overhead
+
+9. **Parallel Cue**
+   - Split work across cores
+   - Complex due to backreferences
+
+## Benchmarking Plan
+
+To validate optimizations, we need:
+
+```ocaml
+(* test/bench_cue_detailed.ml *)
+let benchmark_sizes = [
+  ("ivory.pill", 1.5);      (* MB *)
+  ("solid.pill", 8.7);
+]
+
+let optimizations = [
+  ("baseline", cue_baseline);
+  ("iterative", cue_iterative);
+  ("intmap", cue_intmap);
+  ("cached_atoms", cue_cached_atoms);
+  ("combined", cue_all_opts);
+]
+```
+
+Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms.
+That would bring solid pill cue from 300s → 30s (still slower than C, but usable).
+
+## The Real Question
+
+**Why is the gap so large?**
+
+- Nock is only 2-5x slower than C
+- Small noun serialization is comparable
+- But large cue is 200x slower
+
+This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml".
+
+Likely culprits:
+1. Recursive overhead grows non-linearly with depth
+2. Hashtable performance degrades with size
+3. GC pauses become significant with millions of allocations
+
+## Immediate Next Steps
+
+1. **Profile Current Implementation**
+   ```bash
+   perf record -g dune exec test/cache_solid.exe
+   perf report
+   ```
+   - Find actual hotspots
+   - May be surprised by results
+
+2. **Implement Iterative Cue**
+   - Single biggest win likely
+   - Can compare apples-to-apples
+
+3. **Add IntMap for Backrefs**
+   - Easy change
+   - Should help significantly
+
+4. **Measure Each Optimization**
+   - Don't guess, measure
+   - Build intuition about cost model
+
+## Alternative: Work Around the Problem
+
+If cue optimization proves too hard:
+
+1. **Use Ivory Pill for Testing**
+   - Already loads fast (0.44s)
+   - Good enough for I/O driver development
+   - Missing full Arvo, but has %zuse core
+
+2. **Pre-cache Solid Pill**
+   - Cue once (slowly), save with Marshal
+   - Load marshalled version instantly
+   - One-time cost per pill version
+
+3. **C FFI Just for Cue**
+   - Let C Vere handle deserialization
+   - Convert result to OCaml nouns
+   - Rest of runtime stays pure OCaml
+
+But none of these feel right. **OCaml should be able to do better than 200x slower.**
+
+## Success Criteria
+
+- [ ] Solid pill cues in <30 seconds (10x improvement)
+- [ ] Solid pill cues in <10 seconds (30x improvement)
+- [ ] Solid pill cues in <3 seconds (100x improvement)
+- [ ] Understand *why* the gap exists
+- [ ] Document optimization techniques for future work
+
+The goal isn't to beat C - it's to make OCaml fast enough to be practical.