ocaml/BENCHMARKS.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203

# Nock Interpreter Benchmark Results

## Overview

This document compares the performance of the OCaml Nock interpreter port against a simplified C implementation.

## Test Environment

- **Platform**: Linux
- **Compiler (C)**: GCC with -O3 optimization
- **Compiler (OCaml)**: OCaml native code compiler via Dune
- **Test iterations**: 1,000,000 for most operations, 100,000 for complex operations

## Benchmark Methodology

Each benchmark:
1. Constructs a Nock formula for a specific operation
2. Executes the formula N times in a tight loop
3. Measures elapsed time
4. Calculates operations per second

The OCaml benchmark includes a warmup phase and runs `Gc.compact()` before timing.

## Results

### Raw Performance (Operations per Second)

| Operation                    | C Implementation | OCaml Implementation | Speedup (C/OCaml) |
|------------------------------|------------------|----------------------|-------------------|
| Opcode 0: Slot/fragment      | 579 Mops/sec     | 42.5 Mops/sec        | 13.6x             |
| Opcode 1: Constant           | 595 Mops/sec     | 142 Mops/sec         | 4.2x              |
| Opcode 3: Is-cell            | 271 Mops/sec     | 56.6 Mops/sec        | 4.8x              |
| Opcode 4: Increment          | 265 Mops/sec     | 63.1 Mops/sec        | 4.2x              |
| Opcode 5: Equality           | 24 Mops/sec      | **29.6 Mops/sec**    | **0.8x** (OCaml faster!) |
| Opcode 6: If-then-else       | 185 Mops/sec     | 37.2 Mops/sec        | 5.0x              |
| Opcode 7: Composition        | 174 Mops/sec     | 36.0 Mops/sec        | 4.8x              |
| Opcode 8: Push               | 26.5 Mops/sec    | **32.7 Mops/sec**    | **0.8x** (OCaml faster!) |
| Cell construction            | 25.9 Mops/sec    | **53.2 Mops/sec**    | **0.5x** (OCaml 2x faster!) |
| Deep slot lookup (depth 4)   | 566 Mops/sec     | 19.2 Mops/sec        | 29.6x             |

### Performance Categories

#### Fast operations (>100 Mops/sec)
- **C**: Constant, Slot lookup
- **OCaml**: Constant

#### Medium operations (20-100 Mops/sec)
- **C**: Is-cell, Increment, If-then-else, Composition
- **OCaml**: Slot, Increment, Is-cell, If-then-else, Composition, Cell construction

#### Slower operations (<30 Mops/sec)
- **C**: Equality, Push, Cell construction, Deep slot (anomaly)
- **OCaml**: Equality, Push, Deep slot

## Analysis

### Where C Wins

**Simple, non-allocating operations** (4-14x faster):
- Slot lookup
- Constant retrieval
- Is-cell test
- Increment

**Reasons:**
1. No garbage collection overhead
2. Direct pointer manipulation
3. Function inlining with -O3
4. CPU cache-friendly memory access

### Where OCaml Wins or Competes

**Allocation-heavy operations** (0.5-0.8x, OCaml faster or competitive):
- Cell construction: **2x faster in OCaml**
- Push (allocates cells): **1.2x faster in OCaml**
- Equality (may allocate): **1.2x faster in OCaml**

**Reasons:**
1. OCaml's generational GC is very fast for short-lived allocations
2. Efficient minor heap allocation (bump-the-pointer)
3. The C benchmark leaks memory (no deallocation), which would slow it down if added
4. OCaml's GC amortizes collection cost across operations

### The Deep Slot Anomaly

The deep slot lookup shows an unusual pattern:
- C: 566 Mops/sec (very fast)
- OCaml: 19.2 Mops/sec (much slower)

This is likely due to:
1. OCaml's slot implementation using recursion (stack overhead)
2. Pattern matching overhead
3. Zarith (GMP) operations even on small integers

Could be optimized in OCaml with:
- Iterative implementation
- Special-casing small integers
- Tail recursion optimization

## Comparison Caveats

### C Implementation Simplifications

The C benchmark is **not** the full Vere implementation. It:
- Uses direct 64-bit integers (no arbitrary precision)
- Leaks memory (no deallocation)
- Has no error handling
- Lacks Vere's loom allocator
- Has no jet system
- Has no memoization

A real comparison would need the full Vere implementation with:
- Loom-based allocation
- Jet dashboard (accelerated implementations)
- Memoization caching
- Proper error handling and stack traces

### OCaml Implementation Features

The OCaml implementation:
- Uses Zarith for arbitrary-precision integers (GMP-backed)
- Includes proper error handling (exceptions)
- Is memory-safe (no manual deallocation needed)
- Is type-safe (compile-time guarantees)
- Has clean, maintainable code

## Optimization Opportunities

### OCaml Optimizations

1. **Unboxed integers**: Use native ints for small values
2. **Iterative slot lookup**: Replace recursion with loop
3. **Inline critical paths**: Use `[@inline]` attributes
4. **Custom allocator**: Consider a region-based allocator for nouns
5. **Flambda**: Use Flambda optimizing compiler

Potential speedup: **2-3x** on most operations

### C Optimizations

The simple C implementation could be made faster with:
1. **SIMD instructions**: Vectorize tree operations
2. **Better memory layout**: Structure-of-arrays
3. **JIT compilation**: Generate machine code at runtime

But these optimizations apply to Vere, not this simple benchmark.

## Conclusion

### Performance Summary

- **Simple operations**: C is 4-14x faster
- **Allocation-heavy operations**: OCaml is competitive or faster
- **Overall**: C is ~2-5x faster on average, but with caveats

### Practical Implications

The OCaml implementation is:
- ✅ **Fast enough** for most use cases (10-140 Mops/sec)
- ✅ **Memory safe** (no segfaults, buffer overflows)
- ✅ **Type safe** (catches errors at compile time)
- ✅ **Maintainable** (clean, high-level code)
- ✅ **Correct** (arbitrary-precision integers built-in)

Trade-offs:
- ❌ 2-5x slower than optimized C for most operations
- ❌ Much slower (14x) for deep tree traversal
- ✅ But faster for allocation-heavy workloads

### When to Use Which

**Use OCaml when:**
- Rapid prototyping and development
- Memory safety is critical
- Code maintainability matters
- Absolute performance is not critical

**Use C (Vere) when:**
- Maximum performance is required
- Jets and memoization are needed
- Full Urbit integration is required

## Future Work

1. **Benchmark against full Vere**: Compare with the actual bytecode interpreter
2. **Profile and optimize**: Find OCaml hotspots and optimize
3. **Implement jets**: Add accelerated implementations for common formulas
4. **Add memoization**: Cache repeated computations
5. **Try alternative representations**: Experiment with different noun encodings

## Running the Benchmarks

```bash
# OCaml benchmark
dune exec ./bench_nock.exe

# Simple C benchmark
gcc -O3 -o bench_simple bench_simple.c
./bench_simple

# Full comparison
./compare.sh
```