1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
|
# Arvo Boot Challenge: The Cue Performance Problem
## Goal
Boot a real Arvo kernel in the OCaml runtime to enable full event processing and poke operations.
## Current Status
### ✅ What's Working
1. **Nock Interpreter**: Complete, all tests passing, ~2-5x slower than C (acceptable)
2. **Jam/Cue for Small Data**: Working correctly, all roundtrip tests pass
3. **Ivory Pill Loading**: 1.5MB pill cues in **0.44 seconds** (manageable)
4. **Boot Infrastructure**:
- Implemented `u3v_life()` lifecycle formula `[2 [0 3] [0 2]]`
- Ivory pill tag validation (0x79726f7669 = "ivory")
- C Vere poke pattern understood (slot 23 → slam with event)
### 🔴 The Blocker: Cue Performance on Large Pills
**Solid Pill Requirements:**
- Size: 8.7 MB (jammed)
- Contains: 5 boot events to build full Arvo (`+aeon`, `+boot`, `+fate`, `+hoon`, `+arvo`)
- Structure: `[tag [event1 event2 event3 event4 event5]]`
**Performance Gap:**
```
C Vere: 1.45 seconds (6.0 MB/s throughput)
OCaml: >300 seconds (~0.03 MB/s throughput)
Ratio: ~200x slower
```
This makes solid pill loading impractical for development/testing.
## Why Is OCaml Cue So Slow?
### Current Implementation (lib/serial.ml:140-178)
```ocaml
let cue bytes =
let r = reader_create bytes in
let backref_table = Hashtbl.create 256 in
let rec cue_noun () =
let pos = reader_pos r in
let tag0 = read_bit r in
if not tag0 then begin
(* Atom *)
let (value, _width) = mat_decode r in
let result = Atom value in
Hashtbl.add backref_table pos result;
result
end else begin
let tag1 = read_bit r in
if tag1 then begin
(* Backref *)
let (ref_pos, _width) = mat_decode r in
Hashtbl.find backref_table (Z.to_int ref_pos)
end else begin
(* Cell *)
let head = cue_noun () in
let tail = cue_noun () in
let result = Cell (head, tail) in
Hashtbl.add backref_table pos result;
result
end
end
in
cue_noun ()
```
### Identified Bottlenecks
1. **Deep Recursion**: No tail-call optimization for mutual recursion
- 8.7MB pill → millions of recursive calls
- Stack frames pile up
- Cache misses
2. **Hashtable Operations**:
- Generic `Hashtbl` on every atom/cell
- Hash computation for bit positions (integers)
- Collision resolution overhead
- Could use specialized int-keyed map
3. **Allocation Pressure**:
- Every `Atom`/`Cell` allocated separately
- No arena allocation
- GC pressure on millions of small objects
4. **Bit-Level Reading**:
- Even with byte-aligned fast paths, still slower than C
- OCaml function call overhead per bit/byte
- Bounds checking on every access
5. **Mat Decoding**:
- Called on every atom and backref
- Decodes variable-length integers bit-by-bit
- Could be optimized for common cases
## C Vere's Advantages
From `pkg/noun/serial.c` and `pkg/ur/`:
1. **Custom Memory Arena**: All nouns allocated in contiguous loom
2. **Optimized Hash Table**: `ur_dict32` specialized for integer keys
3. **Batch Operations**: Processes multiple bytes at once
4. **No Bounds Checking**: (for better or worse)
5. **Inline Everything**: Compiler can inline hot paths
6. **Cache-Friendly Layout**: Nouns are compact, sequential
## Potential Optimizations (Not Yet Tried)
### High-Impact, Low-Effort
1. **Iterative Cue with Explicit Stack**
```ocaml
type cue_frame =
| CueCell of { pos: int; mutable head: noun option }
| ...
let cue bytes =
let stack = Stack.create () in
(* Convert recursive calls to stack operations *)
```
- Eliminates recursion overhead
- Better cache locality
- OCaml compiler optimizes stack operations
2. **Specialized Int Map for Backrefs**
```ocaml
module IntMap = Map.Make(Int)
let backref_table : noun IntMap.t ref = ref IntMap.empty
```
- Integer maps are faster than generic hashtables
- Better cache behavior
- Less allocation
3. **Pre-allocate Common Atoms**
```ocaml
let atom_cache = Array.init 256 (fun i -> Atom (Z.of_int i))
```
- Reuse atoms for small values (0-255)
- Huge win for repetitive data
4. **Batch Mat Decoding**
- Optimize for atoms that fit in 64 bits (most common case)
- Fast path for small mats
### Medium-Impact, Medium-Effort
5. **Arena Allocation for Nouns**
```ocaml
type noun_arena = {
atoms: Z.t array;
cells: (int * int) array; (* indices into arena *)
mutable next: int;
}
```
- Preallocate space for estimated noun count
- Convert tree to array representation during cue
- Reconstruct proper noun tree at end
6. **Better Bit Reading**
- Read 64 bits at a time when possible
- Buffer larger chunks
- Reduce function call overhead
7. **Profile-Guided Optimization**
- Actually profile with `perf` to see where time goes
- May reveal surprising hotspots
### Low-Impact, High-Effort
8. **C FFI for Cue Only**
- Call C Vere's `u3s_cue_xeno_with()`
- Convert resulting C noun to OCaml noun
- Still need conversion overhead
9. **Parallel Cue**
- Split work across cores
- Complex due to backreferences
## Benchmarking Plan
To validate optimizations, we need:
```ocaml
(* test/bench_cue_detailed.ml *)
let benchmark_sizes = [
("ivory.pill", 1.5); (* MB *)
("solid.pill", 8.7);
]
let optimizations = [
("baseline", cue_baseline);
("iterative", cue_iterative);
("intmap", cue_intmap);
("cached_atoms", cue_cached_atoms);
("combined", cue_all_opts);
]
```
Target: **10x speedup** should be achievable with just iterative + intmap + cached atoms.
That would bring solid pill cue from 300s → 30s (still slower than C, but usable).
## The Real Question
**Why is the gap so large?**
- Nock is only 2-5x slower than C
- Small noun serialization is comparable
- But large cue is 200x slower
This suggests the problem is **algorithmic/structural**, not just "C is faster than OCaml".
Likely culprits:
1. Recursive overhead grows non-linearly with depth
2. Hashtable performance degrades with size
3. GC pauses become significant with millions of allocations
## Immediate Next Steps
1. **Profile Current Implementation**
```bash
perf record -g dune exec test/cache_solid.exe
perf report
```
- Find actual hotspots
- May be surprised by results
2. **Implement Iterative Cue**
- Single biggest win likely
- Can compare apples-to-apples
3. **Add IntMap for Backrefs**
- Easy change
- Should help significantly
4. **Measure Each Optimization**
- Don't guess, measure
- Build intuition about cost model
## Alternative: Work Around the Problem
If cue optimization proves too hard:
1. **Use Ivory Pill for Testing**
- Already loads fast (0.44s)
- Good enough for I/O driver development
- Missing full Arvo, but has %zuse core
2. **Pre-cache Solid Pill**
- Cue once (slowly), save with Marshal
- Load marshalled version instantly
- One-time cost per pill version
3. **C FFI Just for Cue**
- Let C Vere handle deserialization
- Convert result to OCaml nouns
- Rest of runtime stays pure OCaml
But none of these feel right. **OCaml should be able to do better than 200x slower.**
## Success Criteria
- [ ] Solid pill cues in <30 seconds (10x improvement)
- [ ] Solid pill cues in <10 seconds (30x improvement)
- [ ] Solid pill cues in <3 seconds (100x improvement)
- [ ] Understand *why* the gap exists
- [ ] Document optimization techniques for future work
The goal isn't to beat C - it's to make OCaml fast enough to be practical.
|