summaryrefslogtreecommitdiff
path: root/ocaml/JAM_BACKREFS.md
diff options
context:
space:
mode:
authorpolwex <polwex@sortug.com>2025-10-20 13:13:39 +0700
committerpolwex <polwex@sortug.com>2025-10-20 13:13:39 +0700
commitd21900836f89b2bf9cd55ff1708a4619c8b89656 (patch)
treebb3a5842ae408ffa465814c6bbf27a5002866252 /ocaml/JAM_BACKREFS.md
neoinityes
Diffstat (limited to 'ocaml/JAM_BACKREFS.md')
-rw-r--r--ocaml/JAM_BACKREFS.md192
1 files changed, 192 insertions, 0 deletions
diff --git a/ocaml/JAM_BACKREFS.md b/ocaml/JAM_BACKREFS.md
new file mode 100644
index 0000000..fbb8269
--- /dev/null
+++ b/ocaml/JAM_BACKREFS.md
@@ -0,0 +1,192 @@
+# Jam Serialization and Backreferences
+
+## What is Jam?
+
+Jam is Urbit's serialization format for nouns. It encodes a noun into a bitstream that can be later decoded with `cue`. The format uses length-prefixing and supports **backreferences** to handle shared structure efficiently.
+
+## Basic Jam Encoding
+
+For each noun, jam writes:
+- **Atoms**: `0` bit followed by mat-encoded value
+- **Cells**: `01` bits followed by jam-encoded head and tail
+- **Backreferences**: `11` bits followed by mat-encoded bit position
+
+## What are Backreferences?
+
+A backreference is a pointer to a previously-encoded noun at a specific bit position in the output stream. Instead of encoding the same noun multiple times, jam can say "this noun is the same as the one at position X".
+
+### Example
+
+Consider jamming `[[1 2] [1 2]]` where both cells are the **same object** in memory:
+
+```ocaml
+let cell = pair (atom_int 1) (atom_int 2) in
+pair cell cell
+```
+
+**Without backreferences**, you'd encode:
+```
+01 (* outer cell *)
+ 01 0<1> 0<2> (* first [1 2] *)
+ 01 0<1> 0<2> (* second [1 2] - duplicate! *)
+```
+
+**With backreferences**, jam encodes:
+```
+01 (* outer cell *)
+ 01 0<1> 0<2> (* first [1 2] at position 2 *)
+ 11 <2> (* backref to position 2 *)
+```
+
+The backref is smaller!
+
+## Physical vs Structural Equality
+
+This is where things get interesting. Jam needs to decide: "Have I seen this noun before?"
+
+### Vere's Approach
+
+Vere uses `u3r_sing()` which does:
+1. **Fast path**: Check physical equality `(a == b)` - same pointer?
+2. **Slow path**: If not same pointer, do deep structural comparison
+
+This means Vere detects BOTH:
+- Shared structure (same object used twice)
+- Duplicate values (different objects with same value)
+
+### Initial OCaml Bug
+
+Our first implementation used **physical equality only**:
+```ocaml
+module NounTbl = Hashtbl.Make (struct
+ type t = noun
+ let equal = (==) (* BUG: physical equality only! *)
+ let hash = Hashtbl.hash
+end)
+```
+
+This caused problems because:
+- In C: `pair (atom_int 0) (atom_int 0)` might reuse the SAME `0` object (if it's cached)
+- Our hashtable would see the same `0` twice and create a backref
+- But the C version creates backrefs based on structural equality too
+
+### The Fix
+
+Change to structural equality:
+```ocaml
+let equal = Noun.equal (* structural equality *)
+```
+
+Now we match Vere's behavior for detecting duplicates.
+
+## The Size Optimization
+
+But wait! If we always use backrefs for duplicate atoms, why doesn't `[0 0]` use a backref?
+
+The answer: **Vere compares the size** of encoding the atom vs the backref!
+
+From `vere/pkg/noun/serial.c:157-169`:
+```c
+if ( a_w <= b_w ) {
+ // if atom is smaller than backref, encode atom directly
+ _cs_jam_fib_chop(fib_u, 1, 0);
+ _cs_jam_fib_mat(fib_u, a);
+}
+else {
+ // otherwise, encode backref
+ _cs_jam_fib_chop(fib_u, 2, 3);
+ _cs_jam_fib_mat(fib_u, b);
+}
+```
+
+### Example: `[0 0]`
+
+Encoding atom `0`:
+- `0` bit (atom tag)
+- `1` bit (mat encoding of 0)
+- **Total: 2 bits**
+
+Encoding backref to position 2:
+- `11` bits (backref tag)
+- mat encoding of `2` (which is `01` `10` = 4 bits)
+- **Total: 6 bits**
+
+**Atom is smaller!** So jam encodes the atom directly even though it's a duplicate.
+
+## Why Variable Reuse Matters in OCaml
+
+Consider this OCaml code:
+```ocaml
+let base = pair (atom_int 0) (atom_int 0) in
+pair base (pair base (atom_int 1))
+```
+
+vs this:
+```ocaml
+pair
+ (pair (atom_int 0) (atom_int 0))
+ (pair
+ (pair (atom_int 0) (atom_int 0))
+ (atom_int 1))
+```
+
+### First version (reusing `base`):
+- Both occurrences of `base` point to the **same object** in memory
+- Jam's hashtable (using structural equality) will find the duplicate
+- It will create a backref if the backref is smaller than re-encoding
+
+### Second version (no reuse):
+- Each `pair (atom_int 0) (atom_int 0)` creates a **fresh object**
+- They're different objects, even though structurally equal
+- Jam's hashtable compares them with `Noun.equal`
+- They're equal, so jam still considers them duplicates!
+
+Wait, so both should produce backrefs?
+
+### The Subtlety
+
+Actually, the issue is more nuanced. When you write:
+```ocaml
+let base = pair (atom_int 0) (atom_int 0) in
+```
+
+The OCaml runtime creates ONE object. Using `base` twice means the SAME object appears in two places in the structure. This is **true sharing**.
+
+But in C:
+```c
+u3nc(u3nc(0, 0), u3nc(u3nc(0, 0), ...))
+```
+
+Each `u3nc(0, 0)` call allocates a FRESH cell, even if the values are the same. There's no sharing unless you explicitly save and reuse the pointer.
+
+## The Test Case Fix
+
+The Vere tests are written to create fresh allocations with no sharing. To match this in OCaml, we needed to inline everything without variable reuse:
+
+**Before** (has sharing):
+```ocaml
+let base = pair (atom_int 0) (atom_int 0) in
+pair base (pair base (atom_int 1))
+```
+
+**After** (no sharing):
+```ocaml
+pair
+ (pair (atom_int 0) (atom_int 0))
+ (pair
+ (pair (atom_int 0) (atom_int 0))
+ (atom_int 1))
+```
+
+Both create structurally equal `[0 0]` cells, but the second version creates them as separate objects.
+
+## Summary
+
+1. **Jam uses backrefs** to avoid encoding the same noun multiple times
+2. **Vere uses structural equality** (`u3r_sing`) to detect duplicates
+3. **Size optimization**: Vere only uses backrefs if they're smaller than re-encoding
+4. **OCaml variable reuse** creates true shared structure (same object)
+5. **C function calls** create fresh allocations (different objects with same value)
+6. **Our fix**: Use structural equality + size optimization + avoid variable reuse in tests
+
+The key insight: Jam is **correct** when it detects shared structure. Our tests just needed to match Vere's allocation patterns (fresh allocations, no sharing).