From fcedfddf00b3f994e4f4e40332ac7fc192c63244 Mon Sep 17 00:00:00 2001 From: polwex Date: Sun, 5 Oct 2025 21:56:51 +0700 Subject: claude is gud --- ocaml/BENCHMARKS.md | 203 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 ocaml/BENCHMARKS.md (limited to 'ocaml/BENCHMARKS.md') diff --git a/ocaml/BENCHMARKS.md b/ocaml/BENCHMARKS.md new file mode 100644 index 0000000..be0933f --- /dev/null +++ b/ocaml/BENCHMARKS.md @@ -0,0 +1,203 @@ +# Nock Interpreter Benchmark Results + +## Overview + +This document compares the performance of the OCaml Nock interpreter port against a simplified C implementation. + +## Test Environment + +- **Platform**: Linux +- **Compiler (C)**: GCC with -O3 optimization +- **Compiler (OCaml)**: OCaml native code compiler via Dune +- **Test iterations**: 1,000,000 for most operations, 100,000 for complex operations + +## Benchmark Methodology + +Each benchmark: +1. Constructs a Nock formula for a specific operation +2. Executes the formula N times in a tight loop +3. Measures elapsed time +4. Calculates operations per second + +The OCaml benchmark includes a warmup phase and runs `Gc.compact()` before timing. + +## Results + +### Raw Performance (Operations per Second) + +| Operation | C Implementation | OCaml Implementation | Speedup (C/OCaml) | +|------------------------------|------------------|----------------------|-------------------| +| Opcode 0: Slot/fragment | 579 Mops/sec | 42.5 Mops/sec | 13.6x | +| Opcode 1: Constant | 595 Mops/sec | 142 Mops/sec | 4.2x | +| Opcode 3: Is-cell | 271 Mops/sec | 56.6 Mops/sec | 4.8x | +| Opcode 4: Increment | 265 Mops/sec | 63.1 Mops/sec | 4.2x | +| Opcode 5: Equality | 24 Mops/sec | **29.6 Mops/sec** | **0.8x** (OCaml faster!) | +| Opcode 6: If-then-else | 185 Mops/sec | 37.2 Mops/sec | 5.0x | +| Opcode 7: Composition | 174 Mops/sec | 36.0 Mops/sec | 4.8x | +| Opcode 8: Push | 26.5 Mops/sec | **32.7 Mops/sec** | **0.8x** (OCaml faster!) | +| Cell construction | 25.9 Mops/sec | **53.2 Mops/sec** | **0.5x** (OCaml 2x faster!) | +| Deep slot lookup (depth 4) | 566 Mops/sec | 19.2 Mops/sec | 29.6x | + +### Performance Categories + +#### Fast operations (>100 Mops/sec) +- **C**: Constant, Slot lookup +- **OCaml**: Constant + +#### Medium operations (20-100 Mops/sec) +- **C**: Is-cell, Increment, If-then-else, Composition +- **OCaml**: Slot, Increment, Is-cell, If-then-else, Composition, Cell construction + +#### Slower operations (<30 Mops/sec) +- **C**: Equality, Push, Cell construction, Deep slot (anomaly) +- **OCaml**: Equality, Push, Deep slot + +## Analysis + +### Where C Wins + +**Simple, non-allocating operations** (4-14x faster): +- Slot lookup +- Constant retrieval +- Is-cell test +- Increment + +**Reasons:** +1. No garbage collection overhead +2. Direct pointer manipulation +3. Function inlining with -O3 +4. CPU cache-friendly memory access + +### Where OCaml Wins or Competes + +**Allocation-heavy operations** (0.5-0.8x, OCaml faster or competitive): +- Cell construction: **2x faster in OCaml** +- Push (allocates cells): **1.2x faster in OCaml** +- Equality (may allocate): **1.2x faster in OCaml** + +**Reasons:** +1. OCaml's generational GC is very fast for short-lived allocations +2. Efficient minor heap allocation (bump-the-pointer) +3. The C benchmark leaks memory (no deallocation), which would slow it down if added +4. OCaml's GC amortizes collection cost across operations + +### The Deep Slot Anomaly + +The deep slot lookup shows an unusual pattern: +- C: 566 Mops/sec (very fast) +- OCaml: 19.2 Mops/sec (much slower) + +This is likely due to: +1. OCaml's slot implementation using recursion (stack overhead) +2. Pattern matching overhead +3. Zarith (GMP) operations even on small integers + +Could be optimized in OCaml with: +- Iterative implementation +- Special-casing small integers +- Tail recursion optimization + +## Comparison Caveats + +### C Implementation Simplifications + +The C benchmark is **not** the full Vere implementation. It: +- Uses direct 64-bit integers (no arbitrary precision) +- Leaks memory (no deallocation) +- Has no error handling +- Lacks Vere's loom allocator +- Has no jet system +- Has no memoization + +A real comparison would need the full Vere implementation with: +- Loom-based allocation +- Jet dashboard (accelerated implementations) +- Memoization caching +- Proper error handling and stack traces + +### OCaml Implementation Features + +The OCaml implementation: +- Uses Zarith for arbitrary-precision integers (GMP-backed) +- Includes proper error handling (exceptions) +- Is memory-safe (no manual deallocation needed) +- Is type-safe (compile-time guarantees) +- Has clean, maintainable code + +## Optimization Opportunities + +### OCaml Optimizations + +1. **Unboxed integers**: Use native ints for small values +2. **Iterative slot lookup**: Replace recursion with loop +3. **Inline critical paths**: Use `[@inline]` attributes +4. **Custom allocator**: Consider a region-based allocator for nouns +5. **Flambda**: Use Flambda optimizing compiler + +Potential speedup: **2-3x** on most operations + +### C Optimizations + +The simple C implementation could be made faster with: +1. **SIMD instructions**: Vectorize tree operations +2. **Better memory layout**: Structure-of-arrays +3. **JIT compilation**: Generate machine code at runtime + +But these optimizations apply to Vere, not this simple benchmark. + +## Conclusion + +### Performance Summary + +- **Simple operations**: C is 4-14x faster +- **Allocation-heavy operations**: OCaml is competitive or faster +- **Overall**: C is ~2-5x faster on average, but with caveats + +### Practical Implications + +The OCaml implementation is: +- ✅ **Fast enough** for most use cases (10-140 Mops/sec) +- ✅ **Memory safe** (no segfaults, buffer overflows) +- ✅ **Type safe** (catches errors at compile time) +- ✅ **Maintainable** (clean, high-level code) +- ✅ **Correct** (arbitrary-precision integers built-in) + +Trade-offs: +- ❌ 2-5x slower than optimized C for most operations +- ❌ Much slower (14x) for deep tree traversal +- ✅ But faster for allocation-heavy workloads + +### When to Use Which + +**Use OCaml when:** +- Rapid prototyping and development +- Memory safety is critical +- Code maintainability matters +- Absolute performance is not critical + +**Use C (Vere) when:** +- Maximum performance is required +- Jets and memoization are needed +- Full Urbit integration is required + +## Future Work + +1. **Benchmark against full Vere**: Compare with the actual bytecode interpreter +2. **Profile and optimize**: Find OCaml hotspots and optimize +3. **Implement jets**: Add accelerated implementations for common formulas +4. **Add memoization**: Cache repeated computations +5. **Try alternative representations**: Experiment with different noun encodings + +## Running the Benchmarks + +```bash +# OCaml benchmark +dune exec ./bench_nock.exe + +# Simple C benchmark +gcc -O3 -o bench_simple bench_simple.c +./bench_simple + +# Full comparison +./compare.sh +``` -- cgit v1.2.3