From 4be1d7f999ffb3eb1c12c54e863b141af21b3fbf Mon Sep 17 00:00:00 2001 From: polwex Date: Mon, 6 Oct 2025 10:30:19 +0700 Subject: some progress but man --- ocaml/ARVO_CHALLENGE.md | 273 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 273 insertions(+) create mode 100644 ocaml/ARVO_CHALLENGE.md (limited to 'ocaml/ARVO_CHALLENGE.md') diff --git a/ocaml/ARVO_CHALLENGE.md b/ocaml/ARVO_CHALLENGE.md new file mode 100644 index 0000000..6ce7585 --- /dev/null +++ b/ocaml/ARVO_CHALLENGE.md @@ -0,0 +1,273 @@ +# Arvo Boot Challenge: The Cue Performance Problem + +## Goal + +Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations. + +## Current Status + +### ✅ What's Working + +1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable) +2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass +3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable) +4. **Boot Infrastructure**: + - Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]` + - Ivory pill tag validation (0x79726f7669 = "ivory") + - C Vere poke pattern understood (slot 23 → slam with event) + +### 🔴 The Blocker: Cue Performance on Large Pills + +**Solid Pill Requirements:** +- Size: 8.7 MB (jammed) +- Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`) +- Structure: `[tag [event1 event2 event3 event4 event5]]` + +**Performance Gap:** +``` +C Vere: 1.45 seconds (6.0 MB/s throughput) +OCaml: >300 seconds (~0.03 MB/s throughput) +Ratio: ~200x slower +``` + +This makes solid pill loading impractical for development/testing. + +## Why Is OCaml Cue So Slow? + +### Current Implementation (lib/serial.ml:140-178) + +```ocaml +let cue bytes = + let r = reader_create bytes in + let backref_table = Hashtbl.create 256 in + + let rec cue_noun () = + let pos = reader_pos r in + let tag0 = read_bit r in + + if not tag0 then begin + (* Atom *) + let (value, _width) = mat_decode r in + let result = Atom value in + Hashtbl.add backref_table pos result; + result + end else begin + let tag1 = read_bit r in + if tag1 then begin + (* Backref *) + let (ref_pos, _width) = mat_decode r in + Hashtbl.find backref_table (Z.to_int ref_pos) + end else begin + (* Cell *) + let head = cue_noun () in + let tail = cue_noun () in + let result = Cell (head, tail) in + Hashtbl.add backref_table pos result; + result + end + end + in + cue_noun () +``` + +### Identified Bottlenecks + +1. **Deep Recursion**: No tail-call optimization for mutual recursion + - 8.7MB pill → millions of recursive calls + - Stack frames pile up + - Cache misses + +2. **Hashtable Operations**: + - Generic `Hashtbl` on every atom/cell + - Hash computation for bit positions (integers) + - Collision resolution overhead + - Could use specialized int-keyed map + +3. **Allocation Pressure**: + - Every `Atom`/`Cell` allocated separately + - No arena allocation + - GC pressure on millions of small objects + +4. **Bit-Level Reading**: + - Even with byte-aligned fast paths, still slower than C + - OCaml function call overhead per bit/byte + - Bounds checking on every access + +5. **Mat Decoding**: + - Called on every atom and backref + - Decodes variable-length integers bit-by-bit + - Could be optimized for common cases + +## C Vere's Advantages + +From `pkg/noun/serial.c` and `pkg/ur/`: + +1. **Custom Memory Arena**: All nouns allocated in contiguous loom +2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys +3. **Batch Operations**: Processes multiple bytes at once +4. **No Bounds Checking**: (for better or worse) +5. **Inline Everything**: Compiler can inline hot paths +6. **Cache-Friendly Layout**: Nouns are compact, sequential + +## Potential Optimizations (Not Yet Tried) + +### High-Impact, Low-Effort + +1. **Iterative Cue with Explicit Stack** + ```ocaml + type cue_frame = + | CueCell of { pos: int; mutable head: noun option } + | ... + + let cue bytes = + let stack = Stack.create () in + (* Convert recursive calls to stack operations *) + ``` + - Eliminates recursion overhead + - Better cache locality + - OCaml compiler optimizes stack operations + +2. **Specialized Int Map for Backrefs** + ```ocaml + module IntMap = Map.Make(Int) + let backref_table : noun IntMap.t ref = ref IntMap.empty + ``` + - Integer maps are faster than generic hashtables + - Better cache behavior + - Less allocation + +3. **Pre-allocate Common Atoms** + ```ocaml + let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i)) + ``` + - Reuse atoms for small values (0-255) + - Huge win for repetitive data + +4. **Batch Mat Decoding** + - Optimize for atoms that fit in 64 bits (most common case) + - Fast path for small mats + +### Medium-Impact, Medium-Effort + +5. **Arena Allocation for Nouns** + ```ocaml + type noun_arena = { + atoms: Z.t array; + cells: (int * int) array; (* indices into arena *) + mutable next: int; + } + ``` + - Preallocate space for estimated noun count + - Convert tree to array representation during cue + - Reconstruct proper noun tree at end + +6. **Better Bit Reading** + - Read 64 bits at a time when possible + - Buffer larger chunks + - Reduce function call overhead + +7. **Profile-Guided Optimization** + - Actually profile with `perf` to see where time goes + - May reveal surprising hotspots + +### Low-Impact, High-Effort + +8. **C FFI for Cue Only** + - Call C Vere's `u3s_cue_xeno_with()` + - Convert resulting C noun to OCaml noun + - Still need conversion overhead + +9. **Parallel Cue** + - Split work across cores + - Complex due to backreferences + +## Benchmarking Plan + +To validate optimizations, we need: + +```ocaml +(* test/bench_cue_detailed.ml *) +let benchmark_sizes = [ + ("ivory.pill", 1.5); (* MB *) + ("solid.pill", 8.7); +] + +let optimizations = [ + ("baseline", cue_baseline); + ("iterative", cue_iterative); + ("intmap", cue_intmap); + ("cached_atoms", cue_cached_atoms); + ("combined", cue_all_opts); +] +``` + +Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms. +That would bring solid pill cue from 300s → 30s (still slower than C, but usable). + +## The Real Question + +**Why is the gap so large?** + +- Nock is only 2-5x slower than C +- Small noun serialization is comparable +- But large cue is 200x slower + +This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml". + +Likely culprits: +1. Recursive overhead grows non-linearly with depth +2. Hashtable performance degrades with size +3. GC pauses become significant with millions of allocations + +## Immediate Next Steps + +1. **Profile Current Implementation** + ```bash + perf record -g dune exec test/cache_solid.exe + perf report + ``` + - Find actual hotspots + - May be surprised by results + +2. **Implement Iterative Cue** + - Single biggest win likely + - Can compare apples-to-apples + +3. **Add IntMap for Backrefs** + - Easy change + - Should help significantly + +4. **Measure Each Optimization** + - Don't guess, measure + - Build intuition about cost model + +## Alternative: Work Around the Problem + +If cue optimization proves too hard: + +1. **Use Ivory Pill for Testing** + - Already loads fast (0.44s) + - Good enough for I/O driver development + - Missing full Arvo, but has %zuse core + +2. **Pre-cache Solid Pill** + - Cue once (slowly), save with Marshal + - Load marshalled version instantly + - One-time cost per pill version + +3. **C FFI Just for Cue** + - Let C Vere handle deserialization + - Convert result to OCaml nouns + - Rest of runtime stays pure OCaml + +But none of these feel right. **OCaml should be able to do better than 200x slower.** + +## Success Criteria + +- [ ] Solid pill cues in <30 seconds (10x improvement) +- [ ] Solid pill cues in <10 seconds (30x improvement) +- [ ] Solid pill cues in <3 seconds (100x improvement) +- [ ] Understand *why* the gap exists +- [ ] Document optimization techniques for future work + +The goal isn't to beat C - it's to make OCaml fast enough to be practical. -- cgit v1.2.3