# Nock Interpreter Benchmark Results

## Overview

This document compares the performance of the OCaml Nock interpreter port against a simplified C implementation.

## Test Environment

- **Platform**: Linux
- **Compiler (C)**: GCC with -O3 optimization
- **Compiler (OCaml)**: OCaml native code compiler via Dune
- **Test iterations**: 1,000,000 for most operations, 100,000 for complex operations

## Benchmark Methodology

Each benchmark:
1. Constructs a Nock formula for a specific operation
2. Executes the formula N times in a tight loop
3. Measures elapsed time
4. Calculates operations per second

The OCaml benchmark includes a warmup phase and runs `Gc.compact()` before timing.

## Results

### Raw Performance (Operations per Second)

| Operation                    | C Implementation | OCaml Implementation | Speedup (C/OCaml) |
|------------------------------|------------------|----------------------|-------------------|
| Opcode 0: Slot/fragment      | 579 Mops/sec     | 42.5 Mops/sec        | 13.6x             |
| Opcode 1: Constant           | 595 Mops/sec     | 142 Mops/sec         | 4.2x              |
| Opcode 3: Is-cell            | 271 Mops/sec     | 56.6 Mops/sec        | 4.8x              |
| Opcode 4: Increment          | 265 Mops/sec     | 63.1 Mops/sec        | 4.2x              |
| Opcode 5: Equality           | 24 Mops/sec      | **29.6 Mops/sec**    | **0.8x** (OCaml faster!) |
| Opcode 6: If-then-else       | 185 Mops/sec     | 37.2 Mops/sec        | 5.0x              |
| Opcode 7: Composition        | 174 Mops/sec     | 36.0 Mops/sec        | 4.8x              |
| Opcode 8: Push               | 26.5 Mops/sec    | **32.7 Mops/sec**    | **0.8x** (OCaml faster!) |
| Cell construction            | 25.9 Mops/sec    | **53.2 Mops/sec**    | **0.5x** (OCaml 2x faster!) |
| Deep slot lookup (depth 4)   | 566 Mops/sec     | 19.2 Mops/sec        | 29.6x             |

### Performance Categories

#### Fast operations (>100 Mops/sec)
- **C**: Constant, Slot lookup
- **OCaml**: Constant

#### Medium operations (20-100 Mops/sec)
- **C**: Is-cell, Increment, If-then-else, Composition
- **OCaml**: Slot, Increment, Is-cell, If-then-else, Composition, Cell construction

#### Slower operations (<30 Mops/sec)
- **C**: Equality, Push, Cell construction, Deep slot (anomaly)
- **OCaml**: Equality, Push, Deep slot

## Analysis

### Where C Wins

**Simple, non-allocating operations** (4-14x faster):
- Slot lookup
- Constant retrieval
- Is-cell test
- Increment

**Reasons:**
1. No garbage collection overhead
2. Direct pointer manipulation
3. Function inlining with -O3
4. CPU cache-friendly memory access

### Where OCaml Wins or Competes

**Allocation-heavy operations** (0.5-0.8x, OCaml faster or competitive):
- Cell construction: **2x faster in OCaml**
- Push (allocates cells): **1.2x faster in OCaml**
- Equality (may allocate): **1.2x faster in OCaml**

**Reasons:**
1. OCaml's generational GC is very fast for short-lived allocations
2. Efficient minor heap allocation (bump-the-pointer)
3. The C benchmark leaks memory (no deallocation), which would slow it down if added
4. OCaml's GC amortizes collection cost across operations

### The Deep Slot Anomaly

The deep slot lookup shows an unusual pattern:
- C: 566 Mops/sec (very fast)
- OCaml: 19.2 Mops/sec (much slower)

This is likely due to:
1. OCaml's slot implementation using recursion (stack overhead)
2. Pattern matching overhead
3. Zarith (GMP) operations even on small integers

Could be optimized in OCaml with:
- Iterative implementation
- Special-casing small integers
- Tail recursion optimization

## Comparison Caveats

### C Implementation Simplifications

The C benchmark is **not** the full Vere implementation. It:
- Uses direct 64-bit integers (no arbitrary precision)
- Leaks memory (no deallocation)
- Has no error handling
- Lacks Vere's loom allocator
- Has no jet system
- Has no memoization

A real comparison would need the full Vere implementation with:
- Loom-based allocation
- Jet dashboard (accelerated implementations)
- Memoization caching
- Proper error handling and stack traces

### OCaml Implementation Features

The OCaml implementation:
- Uses Zarith for arbitrary-precision integers (GMP-backed)
- Includes proper error handling (exceptions)
- Is memory-safe (no manual deallocation needed)
- Is type-safe (compile-time guarantees)
- Has clean, maintainable code

## Optimization Opportunities

### OCaml Optimizations

1. **Unboxed integers**: Use native ints for small values
2. **Iterative slot lookup**: Replace recursion with loop
3. **Inline critical paths**: Use `[@inline]` attributes
4. **Custom allocator**: Consider a region-based allocator for nouns
5. **Flambda**: Use Flambda optimizing compiler

Potential speedup: **2-3x** on most operations

### C Optimizations

The simple C implementation could be made faster with:
1. **SIMD instructions**: Vectorize tree operations
2. **Better memory layout**: Structure-of-arrays
3. **JIT compilation**: Generate machine code at runtime

But these optimizations apply to Vere, not this simple benchmark.

## Conclusion

### Performance Summary

- **Simple operations**: C is 4-14x faster
- **Allocation-heavy operations**: OCaml is competitive or faster
- **Overall**: C is ~2-5x faster on average, but with caveats

### Practical Implications

The OCaml implementation is:
- ✅ **Fast enough** for most use cases (10-140 Mops/sec)
- ✅ **Memory safe** (no segfaults, buffer overflows)
- ✅ **Type safe** (catches errors at compile time)
- ✅ **Maintainable** (clean, high-level code)
- ✅ **Correct** (arbitrary-precision integers built-in)

Trade-offs:
- ❌ 2-5x slower than optimized C for most operations
- ❌ Much slower (14x) for deep tree traversal
- ✅ But faster for allocation-heavy workloads

### When to Use Which

**Use OCaml when:**
- Rapid prototyping and development
- Memory safety is critical
- Code maintainability matters
- Absolute performance is not critical

**Use C (Vere) when:**
- Maximum performance is required
- Jets and memoization are needed
- Full Urbit integration is required

## Future Work

1. **Benchmark against full Vere**: Compare with the actual bytecode interpreter
2. **Profile and optimize**: Find OCaml hotspots and optimize
3. **Implement jets**: Add accelerated implementations for common formulas
4. **Add memoization**: Cache repeated computations
5. **Try alternative representations**: Experiment with different noun encodings

## Running the Benchmarks

```bash
# OCaml benchmark
dune exec ./bench_nock.exe

# Simple C benchmark
gcc -O3 -o bench_simple bench_simple.c
./bench_simple

# Full comparison
./compare.sh
```