# Arvo Boot Challenge: The Cue Performance Problem

## Goal

Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations.

## Current Status

### ✅ What's Working

1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable)
2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass
3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable)
4. **Boot Infrastructure**:
   - Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]`
   - Ivory pill tag validation (0x79726f7669 = "ivory")
   - C Vere poke pattern understood (slot 23 → slam with event)

### 🔴 The Blocker: Cue Performance on Large Pills

**Solid Pill Requirements:**
- Size: 8.7 MB (jammed)
- Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`)
- Structure: `[tag [event1 event2 event3 event4 event5]]`

**Performance Gap:**
```
C Vere:  1.45 seconds  (6.0 MB/s throughput)
OCaml:   >300 seconds  (~0.03 MB/s throughput)
Ratio:   ~200x slower
```

This makes solid pill loading impractical for development/testing.

## Why Is OCaml Cue So Slow?

### Current Implementation (lib/serial.ml:140-178)

```ocaml
let cue bytes =
  let r = reader_create bytes in
  let backref_table = Hashtbl.create 256 in

  let rec cue_noun () =
    let pos = reader_pos r in
    let tag0 = read_bit r in

    if not tag0 then begin
      (* Atom *)
      let (value, _width) = mat_decode r in
      let result = Atom value in
      Hashtbl.add backref_table pos result;
      result
    end else begin
      let tag1 = read_bit r in
      if tag1 then begin
        (* Backref *)
        let (ref_pos, _width) = mat_decode r in
        Hashtbl.find backref_table (Z.to_int ref_pos)
      end else begin
        (* Cell *)
        let head = cue_noun () in
        let tail = cue_noun () in
        let result = Cell (head, tail) in
        Hashtbl.add backref_table pos result;
        result
      end
    end
  in
  cue_noun ()
```

### Identified Bottlenecks

1. **Deep Recursion**: No tail-call optimization for mutual recursion
   - 8.7MB pill → millions of recursive calls
   - Stack frames pile up
   - Cache misses

2. **Hashtable Operations**:
   - Generic `Hashtbl` on every atom/cell
   - Hash computation for bit positions (integers)
   - Collision resolution overhead
   - Could use specialized int-keyed map

3. **Allocation Pressure**:
   - Every `Atom`/`Cell` allocated separately
   - No arena allocation
   - GC pressure on millions of small objects

4. **Bit-Level Reading**:
   - Even with byte-aligned fast paths, still slower than C
   - OCaml function call overhead per bit/byte
   - Bounds checking on every access

5. **Mat Decoding**:
   - Called on every atom and backref
   - Decodes variable-length integers bit-by-bit
   - Could be optimized for common cases

## C Vere's Advantages

From `pkg/noun/serial.c` and `pkg/ur/`:

1. **Custom Memory Arena**: All nouns allocated in contiguous loom
2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys
3. **Batch Operations**: Processes multiple bytes at once
4. **No Bounds Checking**: (for better or worse)
5. **Inline Everything**: Compiler can inline hot paths
6. **Cache-Friendly Layout**: Nouns are compact, sequential

## Potential Optimizations (Not Yet Tried)

### High-Impact, Low-Effort

1. **Iterative Cue with Explicit Stack**
   ```ocaml
   type cue_frame =
     | CueCell of { pos: int; mutable head: noun option }
     | ...

   let cue bytes =
     let stack = Stack.create () in
     (* Convert recursive calls to stack operations *)
   ```
   - Eliminates recursion overhead
   - Better cache locality
   - OCaml compiler optimizes stack operations

2. **Specialized Int Map for Backrefs**
   ```ocaml
   module IntMap = Map.Make(Int)
   let backref_table : noun IntMap.t ref = ref IntMap.empty
   ```
   - Integer maps are faster than generic hashtables
   - Better cache behavior
   - Less allocation

3. **Pre-allocate Common Atoms**
   ```ocaml
   let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i))
   ```
   - Reuse atoms for small values (0-255)
   - Huge win for repetitive data

4. **Batch Mat Decoding**
   - Optimize for atoms that fit in 64 bits (most common case)
   - Fast path for small mats

### Medium-Impact, Medium-Effort

5. **Arena Allocation for Nouns**
   ```ocaml
   type noun_arena = {
     atoms: Z.t array;
     cells: (int * int) array; (* indices into arena *)
     mutable next: int;
   }
   ```
   - Preallocate space for estimated noun count
   - Convert tree to array representation during cue
   - Reconstruct proper noun tree at end

6. **Better Bit Reading**
   - Read 64 bits at a time when possible
   - Buffer larger chunks
   - Reduce function call overhead

7. **Profile-Guided Optimization**
   - Actually profile with `perf` to see where time goes
   - May reveal surprising hotspots

### Low-Impact, High-Effort

8. **C FFI for Cue Only**
   - Call C Vere's `u3s_cue_xeno_with()`
   - Convert resulting C noun to OCaml noun
   - Still need conversion overhead

9. **Parallel Cue**
   - Split work across cores
   - Complex due to backreferences

## Benchmarking Plan

To validate optimizations, we need:

```ocaml
(* test/bench_cue_detailed.ml *)
let benchmark_sizes = [
  ("ivory.pill", 1.5);      (* MB *)
  ("solid.pill", 8.7);
]

let optimizations = [
  ("baseline", cue_baseline);
  ("iterative", cue_iterative);
  ("intmap", cue_intmap);
  ("cached_atoms", cue_cached_atoms);
  ("combined", cue_all_opts);
]
```

Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms.
That would bring solid pill cue from 300s → 30s (still slower than C, but usable).

## The Real Question

**Why is the gap so large?**

- Nock is only 2-5x slower than C
- Small noun serialization is comparable
- But large cue is 200x slower

This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml".

Likely culprits:
1. Recursive overhead grows non-linearly with depth
2. Hashtable performance degrades with size
3. GC pauses become significant with millions of allocations

## Immediate Next Steps

1. **Profile Current Implementation**
   ```bash
   perf record -g dune exec test/cache_solid.exe
   perf report
   ```
   - Find actual hotspots
   - May be surprised by results

2. **Implement Iterative Cue**
   - Single biggest win likely
   - Can compare apples-to-apples

3. **Add IntMap for Backrefs**
   - Easy change
   - Should help significantly

4. **Measure Each Optimization**
   - Don't guess, measure
   - Build intuition about cost model

## Alternative: Work Around the Problem

If cue optimization proves too hard:

1. **Use Ivory Pill for Testing**
   - Already loads fast (0.44s)
   - Good enough for I/O driver development
   - Missing full Arvo, but has %zuse core

2. **Pre-cache Solid Pill**
   - Cue once (slowly), save with Marshal
   - Load marshalled version instantly
   - One-time cost per pill version

3. **C FFI Just for Cue**
   - Let C Vere handle deserialization
   - Convert result to OCaml nouns
   - Rest of runtime stays pure OCaml

But none of these feel right. **OCaml should be able to do better than 200x slower.**

## Success Criteria

- [ ] Solid pill cues in <30 seconds (10x improvement)
- [ ] Solid pill cues in <10 seconds (30x improvement)
- [ ] Solid pill cues in <3 seconds (100x improvement)
- [ ] Understand *why* the gap exists
- [ ] Document optimization techniques for future work

The goal isn't to beat C - it's to make OCaml fast enough to be practical.