# Jam Serialization and Backreferences ## What is Jam? Jam is Urbit's serialization format for nouns. It encodes a noun into a bitstream that can be later decoded with `cue`. The format uses length-prefixing and supports **backreferences** to handle shared structure efficiently. ## Basic Jam Encoding For each noun, jam writes: - **Atoms**: `0` bit followed by mat-encoded value - **Cells**: `01` bits followed by jam-encoded head and tail - **Backreferences**: `11` bits followed by mat-encoded bit position ## What are Backreferences? A backreference is a pointer to a previously-encoded noun at a specific bit position in the output stream. Instead of encoding the same noun multiple times, jam can say "this noun is the same as the one at position X". ### Example Consider jamming `[[1 2] [1 2]]` where both cells are the **same object** in memory: ```ocaml let cell = pair (atom_int 1) (atom_int 2) in pair cell cell ``` **Without backreferences**, you'd encode: ``` 01 (* outer cell *) 01 0<1> 0<2> (* first [1 2] *) 01 0<1> 0<2> (* second [1 2] - duplicate! *) ``` **With backreferences**, jam encodes: ``` 01 (* outer cell *) 01 0<1> 0<2> (* first [1 2] at position 2 *) 11 <2> (* backref to position 2 *) ``` The backref is smaller! ## Physical vs Structural Equality This is where things get interesting. Jam needs to decide: "Have I seen this noun before?" ### Vere's Approach Vere uses `u3r_sing()` which does: 1. **Fast path**: Check physical equality `(a == b)` - same pointer? 2. **Slow path**: If not same pointer, do deep structural comparison This means Vere detects BOTH: - Shared structure (same object used twice) - Duplicate values (different objects with same value) ### Initial OCaml Bug Our first implementation used **physical equality only**: ```ocaml module NounTbl = Hashtbl.Make (struct type t = noun let equal = (==) (* BUG: physical equality only! *) let hash = Hashtbl.hash end) ``` This caused problems because: - In C: `pair (atom_int 0) (atom_int 0)` might reuse the SAME `0` object (if it's cached) - Our hashtable would see the same `0` twice and create a backref - But the C version creates backrefs based on structural equality too ### The Fix Change to structural equality: ```ocaml let equal = Noun.equal (* structural equality *) ``` Now we match Vere's behavior for detecting duplicates. ## The Size Optimization But wait! If we always use backrefs for duplicate atoms, why doesn't `[0 0]` use a backref? The answer: **Vere compares the size** of encoding the atom vs the backref! From `vere/pkg/noun/serial.c:157-169`: ```c if ( a_w <= b_w ) { // if atom is smaller than backref, encode atom directly _cs_jam_fib_chop(fib_u, 1, 0); _cs_jam_fib_mat(fib_u, a); } else { // otherwise, encode backref _cs_jam_fib_chop(fib_u, 2, 3); _cs_jam_fib_mat(fib_u, b); } ``` ### Example: `[0 0]` Encoding atom `0`: - `0` bit (atom tag) - `1` bit (mat encoding of 0) - **Total: 2 bits** Encoding backref to position 2: - `11` bits (backref tag) - mat encoding of `2` (which is `01` `10` = 4 bits) - **Total: 6 bits** **Atom is smaller!** So jam encodes the atom directly even though it's a duplicate. ## Why Variable Reuse Matters in OCaml Consider this OCaml code: ```ocaml let base = pair (atom_int 0) (atom_int 0) in pair base (pair base (atom_int 1)) ``` vs this: ```ocaml pair (pair (atom_int 0) (atom_int 0)) (pair (pair (atom_int 0) (atom_int 0)) (atom_int 1)) ``` ### First version (reusing `base`): - Both occurrences of `base` point to the **same object** in memory - Jam's hashtable (using structural equality) will find the duplicate - It will create a backref if the backref is smaller than re-encoding ### Second version (no reuse): - Each `pair (atom_int 0) (atom_int 0)` creates a **fresh object** - They're different objects, even though structurally equal - Jam's hashtable compares them with `Noun.equal` - They're equal, so jam still considers them duplicates! Wait, so both should produce backrefs? ### The Subtlety Actually, the issue is more nuanced. When you write: ```ocaml let base = pair (atom_int 0) (atom_int 0) in ``` The OCaml runtime creates ONE object. Using `base` twice means the SAME object appears in two places in the structure. This is **true sharing**. But in C: ```c u3nc(u3nc(0, 0), u3nc(u3nc(0, 0), ...)) ``` Each `u3nc(0, 0)` call allocates a FRESH cell, even if the values are the same. There's no sharing unless you explicitly save and reuse the pointer. ## The Test Case Fix The Vere tests are written to create fresh allocations with no sharing. To match this in OCaml, we needed to inline everything without variable reuse: **Before** (has sharing): ```ocaml let base = pair (atom_int 0) (atom_int 0) in pair base (pair base (atom_int 1)) ``` **After** (no sharing): ```ocaml pair (pair (atom_int 0) (atom_int 0)) (pair (pair (atom_int 0) (atom_int 0)) (atom_int 1)) ``` Both create structurally equal `[0 0]` cells, but the second version creates them as separate objects. ## Summary 1. **Jam uses backrefs** to avoid encoding the same noun multiple times 2. **Vere uses structural equality** (`u3r_sing`) to detect duplicates 3. **Size optimization**: Vere only uses backrefs if they're smaller than re-encoding 4. **OCaml variable reuse** creates true shared structure (same object) 5. **C function calls** create fresh allocations (different objects with same value) 6. **Our fix**: Use structural equality + size optimization + avoid variable reuse in tests The key insight: Jam is **correct** when it detects shared structure. Our tests just needed to match Vere's allocation patterns (fresh allocations, no sharing).