# Nock Interpreter Benchmark Results ## Overview This document compares the performance of the OCaml Nock interpreter port against a simplified C implementation. ## Test Environment - **Platform**: Linux - **Compiler (C)**: GCC with -O3 optimization - **Compiler (OCaml)**: OCaml native code compiler via Dune - **Test iterations**: 1,000,000 for most operations, 100,000 for complex operations ## Benchmark Methodology Each benchmark: 1. Constructs a Nock formula for a specific operation 2. Executes the formula N times in a tight loop 3. Measures elapsed time 4. Calculates operations per second The OCaml benchmark includes a warmup phase and runs `Gc.compact()` before timing. ## Results ### Raw Performance (Operations per Second) | Operation | C Implementation | OCaml Implementation | Speedup (C/OCaml) | |------------------------------|------------------|----------------------|-------------------| | Opcode 0: Slot/fragment | 579 Mops/sec | 42.5 Mops/sec | 13.6x | | Opcode 1: Constant | 595 Mops/sec | 142 Mops/sec | 4.2x | | Opcode 3: Is-cell | 271 Mops/sec | 56.6 Mops/sec | 4.8x | | Opcode 4: Increment | 265 Mops/sec | 63.1 Mops/sec | 4.2x | | Opcode 5: Equality | 24 Mops/sec | **29.6 Mops/sec** | **0.8x** (OCaml faster!) | | Opcode 6: If-then-else | 185 Mops/sec | 37.2 Mops/sec | 5.0x | | Opcode 7: Composition | 174 Mops/sec | 36.0 Mops/sec | 4.8x | | Opcode 8: Push | 26.5 Mops/sec | **32.7 Mops/sec** | **0.8x** (OCaml faster!) | | Cell construction | 25.9 Mops/sec | **53.2 Mops/sec** | **0.5x** (OCaml 2x faster!) | | Deep slot lookup (depth 4) | 566 Mops/sec | 19.2 Mops/sec | 29.6x | ### Performance Categories #### Fast operations (>100 Mops/sec) - **C**: Constant, Slot lookup - **OCaml**: Constant #### Medium operations (20-100 Mops/sec) - **C**: Is-cell, Increment, If-then-else, Composition - **OCaml**: Slot, Increment, Is-cell, If-then-else, Composition, Cell construction #### Slower operations (<30 Mops/sec) - **C**: Equality, Push, Cell construction, Deep slot (anomaly) - **OCaml**: Equality, Push, Deep slot ## Analysis ### Where C Wins **Simple, non-allocating operations** (4-14x faster): - Slot lookup - Constant retrieval - Is-cell test - Increment **Reasons:** 1. No garbage collection overhead 2. Direct pointer manipulation 3. Function inlining with -O3 4. CPU cache-friendly memory access ### Where OCaml Wins or Competes **Allocation-heavy operations** (0.5-0.8x, OCaml faster or competitive): - Cell construction: **2x faster in OCaml** - Push (allocates cells): **1.2x faster in OCaml** - Equality (may allocate): **1.2x faster in OCaml** **Reasons:** 1. OCaml's generational GC is very fast for short-lived allocations 2. Efficient minor heap allocation (bump-the-pointer) 3. The C benchmark leaks memory (no deallocation), which would slow it down if added 4. OCaml's GC amortizes collection cost across operations ### The Deep Slot Anomaly The deep slot lookup shows an unusual pattern: - C: 566 Mops/sec (very fast) - OCaml: 19.2 Mops/sec (much slower) This is likely due to: 1. OCaml's slot implementation using recursion (stack overhead) 2. Pattern matching overhead 3. Zarith (GMP) operations even on small integers Could be optimized in OCaml with: - Iterative implementation - Special-casing small integers - Tail recursion optimization ## Comparison Caveats ### C Implementation Simplifications The C benchmark is **not** the full Vere implementation. It: - Uses direct 64-bit integers (no arbitrary precision) - Leaks memory (no deallocation) - Has no error handling - Lacks Vere's loom allocator - Has no jet system - Has no memoization A real comparison would need the full Vere implementation with: - Loom-based allocation - Jet dashboard (accelerated implementations) - Memoization caching - Proper error handling and stack traces ### OCaml Implementation Features The OCaml implementation: - Uses Zarith for arbitrary-precision integers (GMP-backed) - Includes proper error handling (exceptions) - Is memory-safe (no manual deallocation needed) - Is type-safe (compile-time guarantees) - Has clean, maintainable code ## Optimization Opportunities ### OCaml Optimizations 1. **Unboxed integers**: Use native ints for small values 2. **Iterative slot lookup**: Replace recursion with loop 3. **Inline critical paths**: Use `[@inline]` attributes 4. **Custom allocator**: Consider a region-based allocator for nouns 5. **Flambda**: Use Flambda optimizing compiler Potential speedup: **2-3x** on most operations ### C Optimizations The simple C implementation could be made faster with: 1. **SIMD instructions**: Vectorize tree operations 2. **Better memory layout**: Structure-of-arrays 3. **JIT compilation**: Generate machine code at runtime But these optimizations apply to Vere, not this simple benchmark. ## Conclusion ### Performance Summary - **Simple operations**: C is 4-14x faster - **Allocation-heavy operations**: OCaml is competitive or faster - **Overall**: C is ~2-5x faster on average, but with caveats ### Practical Implications The OCaml implementation is: - ✅ **Fast enough** for most use cases (10-140 Mops/sec) - ✅ **Memory safe** (no segfaults, buffer overflows) - ✅ **Type safe** (catches errors at compile time) - ✅ **Maintainable** (clean, high-level code) - ✅ **Correct** (arbitrary-precision integers built-in) Trade-offs: - ❌ 2-5x slower than optimized C for most operations - ❌ Much slower (14x) for deep tree traversal - ✅ But faster for allocation-heavy workloads ### When to Use Which **Use OCaml when:** - Rapid prototyping and development - Memory safety is critical - Code maintainability matters - Absolute performance is not critical **Use C (Vere) when:** - Maximum performance is required - Jets and memoization are needed - Full Urbit integration is required ## Future Work 1. **Benchmark against full Vere**: Compare with the actual bytecode interpreter 2. **Profile and optimize**: Find OCaml hotspots and optimize 3. **Implement jets**: Add accelerated implementations for common formulas 4. **Add memoization**: Cache repeated computations 5. **Try alternative representations**: Experiment with different noun encodings ## Running the Benchmarks ```bash # OCaml benchmark dune exec ./bench_nock.exe # Simple C benchmark gcc -O3 -o bench_simple bench_simple.c ./bench_simple # Full comparison ./compare.sh ```