diff options
Diffstat (limited to 'ocaml/JAM_BACKREFS.md')
| -rw-r--r-- | ocaml/JAM_BACKREFS.md | 192 |
1 files changed, 192 insertions, 0 deletions
diff --git a/ocaml/JAM_BACKREFS.md b/ocaml/JAM_BACKREFS.md new file mode 100644 index 0000000..fbb8269 --- /dev/null +++ b/ocaml/JAM_BACKREFS.md @@ -0,0 +1,192 @@ +# Jam Serialization and Backreferences + +## What is Jam? + +Jam is Urbit's serialization format for nouns. It encodes a noun into a bitstream that can be later decoded with `cue`. The format uses length-prefixing and supports **backreferences** to handle shared structure efficiently. + +## Basic Jam Encoding + +For each noun, jam writes: +- **Atoms**: `0` bit followed by mat-encoded value +- **Cells**: `01` bits followed by jam-encoded head and tail +- **Backreferences**: `11` bits followed by mat-encoded bit position + +## What are Backreferences? + +A backreference is a pointer to a previously-encoded noun at a specific bit position in the output stream. Instead of encoding the same noun multiple times, jam can say "this noun is the same as the one at position X". + +### Example + +Consider jamming `[[1 2] [1 2]]` where both cells are the **same object** in memory: + +```ocaml +let cell = pair (atom_int 1) (atom_int 2) in +pair cell cell +``` + +**Without backreferences**, you'd encode: +``` +01 (* outer cell *) + 01 0<1> 0<2> (* first [1 2] *) + 01 0<1> 0<2> (* second [1 2] - duplicate! *) +``` + +**With backreferences**, jam encodes: +``` +01 (* outer cell *) + 01 0<1> 0<2> (* first [1 2] at position 2 *) + 11 <2> (* backref to position 2 *) +``` + +The backref is smaller! + +## Physical vs Structural Equality + +This is where things get interesting. Jam needs to decide: "Have I seen this noun before?" + +### Vere's Approach + +Vere uses `u3r_sing()` which does: +1. **Fast path**: Check physical equality `(a == b)` - same pointer? +2. **Slow path**: If not same pointer, do deep structural comparison + +This means Vere detects BOTH: +- Shared structure (same object used twice) +- Duplicate values (different objects with same value) + +### Initial OCaml Bug + +Our first implementation used **physical equality only**: +```ocaml +module NounTbl = Hashtbl.Make (struct + type t = noun + let equal = (==) (* BUG: physical equality only! *) + let hash = Hashtbl.hash +end) +``` + +This caused problems because: +- In C: `pair (atom_int 0) (atom_int 0)` might reuse the SAME `0` object (if it's cached) +- Our hashtable would see the same `0` twice and create a backref +- But the C version creates backrefs based on structural equality too + +### The Fix + +Change to structural equality: +```ocaml +let equal = Noun.equal (* structural equality *) +``` + +Now we match Vere's behavior for detecting duplicates. + +## The Size Optimization + +But wait! If we always use backrefs for duplicate atoms, why doesn't `[0 0]` use a backref? + +The answer: **Vere compares the size** of encoding the atom vs the backref! + +From `vere/pkg/noun/serial.c:157-169`: +```c +if ( a_w <= b_w ) { + // if atom is smaller than backref, encode atom directly + _cs_jam_fib_chop(fib_u, 1, 0); + _cs_jam_fib_mat(fib_u, a); +} +else { + // otherwise, encode backref + _cs_jam_fib_chop(fib_u, 2, 3); + _cs_jam_fib_mat(fib_u, b); +} +``` + +### Example: `[0 0]` + +Encoding atom `0`: +- `0` bit (atom tag) +- `1` bit (mat encoding of 0) +- **Total: 2 bits** + +Encoding backref to position 2: +- `11` bits (backref tag) +- mat encoding of `2` (which is `01` `10` = 4 bits) +- **Total: 6 bits** + +**Atom is smaller!** So jam encodes the atom directly even though it's a duplicate. + +## Why Variable Reuse Matters in OCaml + +Consider this OCaml code: +```ocaml +let base = pair (atom_int 0) (atom_int 0) in +pair base (pair base (atom_int 1)) +``` + +vs this: +```ocaml +pair + (pair (atom_int 0) (atom_int 0)) + (pair + (pair (atom_int 0) (atom_int 0)) + (atom_int 1)) +``` + +### First version (reusing `base`): +- Both occurrences of `base` point to the **same object** in memory +- Jam's hashtable (using structural equality) will find the duplicate +- It will create a backref if the backref is smaller than re-encoding + +### Second version (no reuse): +- Each `pair (atom_int 0) (atom_int 0)` creates a **fresh object** +- They're different objects, even though structurally equal +- Jam's hashtable compares them with `Noun.equal` +- They're equal, so jam still considers them duplicates! + +Wait, so both should produce backrefs? + +### The Subtlety + +Actually, the issue is more nuanced. When you write: +```ocaml +let base = pair (atom_int 0) (atom_int 0) in +``` + +The OCaml runtime creates ONE object. Using `base` twice means the SAME object appears in two places in the structure. This is **true sharing**. + +But in C: +```c +u3nc(u3nc(0, 0), u3nc(u3nc(0, 0), ...)) +``` + +Each `u3nc(0, 0)` call allocates a FRESH cell, even if the values are the same. There's no sharing unless you explicitly save and reuse the pointer. + +## The Test Case Fix + +The Vere tests are written to create fresh allocations with no sharing. To match this in OCaml, we needed to inline everything without variable reuse: + +**Before** (has sharing): +```ocaml +let base = pair (atom_int 0) (atom_int 0) in +pair base (pair base (atom_int 1)) +``` + +**After** (no sharing): +```ocaml +pair + (pair (atom_int 0) (atom_int 0)) + (pair + (pair (atom_int 0) (atom_int 0)) + (atom_int 1)) +``` + +Both create structurally equal `[0 0]` cells, but the second version creates them as separate objects. + +## Summary + +1. **Jam uses backrefs** to avoid encoding the same noun multiple times +2. **Vere uses structural equality** (`u3r_sing`) to detect duplicates +3. **Size optimization**: Vere only uses backrefs if they're smaller than re-encoding +4. **OCaml variable reuse** creates true shared structure (same object) +5. **C function calls** create fresh allocations (different objects with same value) +6. **Our fix**: Use structural equality + size optimization + avoid variable reuse in tests + +The key insight: Jam is **correct** when it detects shared structure. Our tests just needed to match Vere's allocation patterns (fresh allocations, no sharing). |
