summaryrefslogtreecommitdiff
path: root/ocaml/ARVO_CHALLENGE.md
diff options
context:
space:
mode:
Diffstat (limited to 'ocaml/ARVO_CHALLENGE.md')
-rw-r--r--ocaml/ARVO_CHALLENGE.md273
1 files changed, 273 insertions, 0 deletions
diff --git a/ocaml/ARVO_CHALLENGE.md b/ocaml/ARVO_CHALLENGE.md
new file mode 100644
index 0000000..6ce7585
--- /dev/null
+++ b/ocaml/ARVO_CHALLENGE.md
@@ -0,0 +1,273 @@
+# Arvo Boot Challenge: The Cue Performance Problem
+
+## Goal
+
+Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations.
+
+## Current Status
+
+### ✅ What's Working
+
+1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable)
+2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass
+3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable)
+4. **Boot Infrastructure**:
+ - Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]`
+ - Ivory pill tag validation (0x79726f7669 = "ivory")
+ - C Vere poke pattern understood (slot 23 → slam with event)
+
+### 🔴 The Blocker: Cue Performance on Large Pills
+
+**Solid Pill Requirements:**
+- Size: 8.7 MB (jammed)
+- Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`)
+- Structure: `[tag [event1 event2 event3 event4 event5]]`
+
+**Performance Gap:**
+```
+C Vere: 1.45 seconds (6.0 MB/s throughput)
+OCaml: >300 seconds (~0.03 MB/s throughput)
+Ratio: ~200x slower
+```
+
+This makes solid pill loading impractical for development/testing.
+
+## Why Is OCaml Cue So Slow?
+
+### Current Implementation (lib/serial.ml:140-178)
+
+```ocaml
+let cue bytes =
+ let r = reader_create bytes in
+ let backref_table = Hashtbl.create 256 in
+
+ let rec cue_noun () =
+ let pos = reader_pos r in
+ let tag0 = read_bit r in
+
+ if not tag0 then begin
+ (* Atom *)
+ let (value, _width) = mat_decode r in
+ let result = Atom value in
+ Hashtbl.add backref_table pos result;
+ result
+ end else begin
+ let tag1 = read_bit r in
+ if tag1 then begin
+ (* Backref *)
+ let (ref_pos, _width) = mat_decode r in
+ Hashtbl.find backref_table (Z.to_int ref_pos)
+ end else begin
+ (* Cell *)
+ let head = cue_noun () in
+ let tail = cue_noun () in
+ let result = Cell (head, tail) in
+ Hashtbl.add backref_table pos result;
+ result
+ end
+ end
+ in
+ cue_noun ()
+```
+
+### Identified Bottlenecks
+
+1. **Deep Recursion**: No tail-call optimization for mutual recursion
+ - 8.7MB pill → millions of recursive calls
+ - Stack frames pile up
+ - Cache misses
+
+2. **Hashtable Operations**:
+ - Generic `Hashtbl` on every atom/cell
+ - Hash computation for bit positions (integers)
+ - Collision resolution overhead
+ - Could use specialized int-keyed map
+
+3. **Allocation Pressure**:
+ - Every `Atom`/`Cell` allocated separately
+ - No arena allocation
+ - GC pressure on millions of small objects
+
+4. **Bit-Level Reading**:
+ - Even with byte-aligned fast paths, still slower than C
+ - OCaml function call overhead per bit/byte
+ - Bounds checking on every access
+
+5. **Mat Decoding**:
+ - Called on every atom and backref
+ - Decodes variable-length integers bit-by-bit
+ - Could be optimized for common cases
+
+## C Vere's Advantages
+
+From `pkg/noun/serial.c` and `pkg/ur/`:
+
+1. **Custom Memory Arena**: All nouns allocated in contiguous loom
+2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys
+3. **Batch Operations**: Processes multiple bytes at once
+4. **No Bounds Checking**: (for better or worse)
+5. **Inline Everything**: Compiler can inline hot paths
+6. **Cache-Friendly Layout**: Nouns are compact, sequential
+
+## Potential Optimizations (Not Yet Tried)
+
+### High-Impact, Low-Effort
+
+1. **Iterative Cue with Explicit Stack**
+ ```ocaml
+ type cue_frame =
+ | CueCell of { pos: int; mutable head: noun option }
+ | ...
+
+ let cue bytes =
+ let stack = Stack.create () in
+ (* Convert recursive calls to stack operations *)
+ ```
+ - Eliminates recursion overhead
+ - Better cache locality
+ - OCaml compiler optimizes stack operations
+
+2. **Specialized Int Map for Backrefs**
+ ```ocaml
+ module IntMap = Map.Make(Int)
+ let backref_table : noun IntMap.t ref = ref IntMap.empty
+ ```
+ - Integer maps are faster than generic hashtables
+ - Better cache behavior
+ - Less allocation
+
+3. **Pre-allocate Common Atoms**
+ ```ocaml
+ let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i))
+ ```
+ - Reuse atoms for small values (0-255)
+ - Huge win for repetitive data
+
+4. **Batch Mat Decoding**
+ - Optimize for atoms that fit in 64 bits (most common case)
+ - Fast path for small mats
+
+### Medium-Impact, Medium-Effort
+
+5. **Arena Allocation for Nouns**
+ ```ocaml
+ type noun_arena = {
+ atoms: Z.t array;
+ cells: (int * int) array; (* indices into arena *)
+ mutable next: int;
+ }
+ ```
+ - Preallocate space for estimated noun count
+ - Convert tree to array representation during cue
+ - Reconstruct proper noun tree at end
+
+6. **Better Bit Reading**
+ - Read 64 bits at a time when possible
+ - Buffer larger chunks
+ - Reduce function call overhead
+
+7. **Profile-Guided Optimization**
+ - Actually profile with `perf` to see where time goes
+ - May reveal surprising hotspots
+
+### Low-Impact, High-Effort
+
+8. **C FFI for Cue Only**
+ - Call C Vere's `u3s_cue_xeno_with()`
+ - Convert resulting C noun to OCaml noun
+ - Still need conversion overhead
+
+9. **Parallel Cue**
+ - Split work across cores
+ - Complex due to backreferences
+
+## Benchmarking Plan
+
+To validate optimizations, we need:
+
+```ocaml
+(* test/bench_cue_detailed.ml *)
+let benchmark_sizes = [
+ ("ivory.pill", 1.5); (* MB *)
+ ("solid.pill", 8.7);
+]
+
+let optimizations = [
+ ("baseline", cue_baseline);
+ ("iterative", cue_iterative);
+ ("intmap", cue_intmap);
+ ("cached_atoms", cue_cached_atoms);
+ ("combined", cue_all_opts);
+]
+```
+
+Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms.
+That would bring solid pill cue from 300s → 30s (still slower than C, but usable).
+
+## The Real Question
+
+**Why is the gap so large?**
+
+- Nock is only 2-5x slower than C
+- Small noun serialization is comparable
+- But large cue is 200x slower
+
+This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml".
+
+Likely culprits:
+1. Recursive overhead grows non-linearly with depth
+2. Hashtable performance degrades with size
+3. GC pauses become significant with millions of allocations
+
+## Immediate Next Steps
+
+1. **Profile Current Implementation**
+ ```bash
+ perf record -g dune exec test/cache_solid.exe
+ perf report
+ ```
+ - Find actual hotspots
+ - May be surprised by results
+
+2. **Implement Iterative Cue**
+ - Single biggest win likely
+ - Can compare apples-to-apples
+
+3. **Add IntMap for Backrefs**
+ - Easy change
+ - Should help significantly
+
+4. **Measure Each Optimization**
+ - Don't guess, measure
+ - Build intuition about cost model
+
+## Alternative: Work Around the Problem
+
+If cue optimization proves too hard:
+
+1. **Use Ivory Pill for Testing**
+ - Already loads fast (0.44s)
+ - Good enough for I/O driver development
+ - Missing full Arvo, but has %zuse core
+
+2. **Pre-cache Solid Pill**
+ - Cue once (slowly), save with Marshal
+ - Load marshalled version instantly
+ - One-time cost per pill version
+
+3. **C FFI Just for Cue**
+ - Let C Vere handle deserialization
+ - Convert result to OCaml nouns
+ - Rest of runtime stays pure OCaml
+
+But none of these feel right. **OCaml should be able to do better than 200x slower.**
+
+## Success Criteria
+
+- [ ] Solid pill cues in <30 seconds (10x improvement)
+- [ ] Solid pill cues in <10 seconds (30x improvement)
+- [ ] Solid pill cues in <3 seconds (100x improvement)
+- [ ] Understand *why* the gap exists
+- [ ] Document optimization techniques for future work
+
+The goal isn't to beat C - it's to make OCaml fast enough to be practical.