# Jam Serialization and Backreferences

## What is Jam?

Jam is Urbit's serialization format for nouns. It encodes a noun into a bitstream that can be later decoded with `cue`. The format uses length-prefixing and supports **backreferences** to handle shared structure efficiently.

## Basic Jam Encoding

For each noun, jam writes:
- **Atoms**: `0` bit followed by mat-encoded value
- **Cells**: `01` bits followed by jam-encoded head and tail
- **Backreferences**: `11` bits followed by mat-encoded bit position

## What are Backreferences?

A backreference is a pointer to a previously-encoded noun at a specific bit position in the output stream. Instead of encoding the same noun multiple times, jam can say "this noun is the same as the one at position X".

### Example

Consider jamming `[[1 2] [1 2]]` where both cells are the **same object** in memory:

```ocaml
let cell = pair (atom_int 1) (atom_int 2) in
pair cell cell
```

**Without backreferences**, you'd encode:
```
01                    (* outer cell *)
  01 0<1> 0<2>       (* first [1 2] *)
  01 0<1> 0<2>       (* second [1 2] - duplicate! *)
```

**With backreferences**, jam encodes:
```
01                    (* outer cell *)
  01 0<1> 0<2>       (* first [1 2] at position 2 *)
  11 <2>              (* backref to position 2 *)
```

The backref is smaller!

## Physical vs Structural Equality

This is where things get interesting. Jam needs to decide: "Have I seen this noun before?"

### Vere's Approach

Vere uses `u3r_sing()` which does:
1. **Fast path**: Check physical equality `(a == b)` - same pointer?
2. **Slow path**: If not same pointer, do deep structural comparison

This means Vere detects BOTH:
- Shared structure (same object used twice)
- Duplicate values (different objects with same value)

### Initial OCaml Bug

Our first implementation used **physical equality only**:
```ocaml
module NounTbl = Hashtbl.Make (struct
  type t = noun
  let equal = (==)    (* BUG: physical equality only! *)
  let hash = Hashtbl.hash
end)
```

This caused problems because:
- In C: `pair (atom_int 0) (atom_int 0)` might reuse the SAME `0` object (if it's cached)
- Our hashtable would see the same `0` twice and create a backref
- But the C version creates backrefs based on structural equality too

### The Fix

Change to structural equality:
```ocaml
let equal = Noun.equal    (* structural equality *)
```

Now we match Vere's behavior for detecting duplicates.

## The Size Optimization

But wait! If we always use backrefs for duplicate atoms, why doesn't `[0 0]` use a backref?

The answer: **Vere compares the size** of encoding the atom vs the backref!

From `vere/pkg/noun/serial.c:157-169`:
```c
if ( a_w <= b_w ) {
  // if atom is smaller than backref, encode atom directly
  _cs_jam_fib_chop(fib_u, 1, 0);
  _cs_jam_fib_mat(fib_u, a);
}
else {
  // otherwise, encode backref
  _cs_jam_fib_chop(fib_u, 2, 3);
  _cs_jam_fib_mat(fib_u, b);
}
```

### Example: `[0 0]`

Encoding atom `0`:
- `0` bit (atom tag)
- `1` bit (mat encoding of 0)
- **Total: 2 bits**

Encoding backref to position 2:
- `11` bits (backref tag)
- mat encoding of `2` (which is `01` `10` = 4 bits)
- **Total: 6 bits**

**Atom is smaller!** So jam encodes the atom directly even though it's a duplicate.

## Why Variable Reuse Matters in OCaml

Consider this OCaml code:
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
pair base (pair base (atom_int 1))
```

vs this:
```ocaml
pair
  (pair (atom_int 0) (atom_int 0))
  (pair
    (pair (atom_int 0) (atom_int 0))
    (atom_int 1))
```

### First version (reusing `base`):
- Both occurrences of `base` point to the **same object** in memory
- Jam's hashtable (using structural equality) will find the duplicate
- It will create a backref if the backref is smaller than re-encoding

### Second version (no reuse):
- Each `pair (atom_int 0) (atom_int 0)` creates a **fresh object**
- They're different objects, even though structurally equal
- Jam's hashtable compares them with `Noun.equal`
- They're equal, so jam still considers them duplicates!

Wait, so both should produce backrefs?

### The Subtlety

Actually, the issue is more nuanced. When you write:
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
```

The OCaml runtime creates ONE object. Using `base` twice means the SAME object appears in two places in the structure. This is **true sharing**.

But in C:
```c
u3nc(u3nc(0, 0), u3nc(u3nc(0, 0), ...))
```

Each `u3nc(0, 0)` call allocates a FRESH cell, even if the values are the same. There's no sharing unless you explicitly save and reuse the pointer.

## The Test Case Fix

The Vere tests are written to create fresh allocations with no sharing. To match this in OCaml, we needed to inline everything without variable reuse:

**Before** (has sharing):
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
pair base (pair base (atom_int 1))
```

**After** (no sharing):
```ocaml
pair
  (pair (atom_int 0) (atom_int 0))
  (pair
    (pair (atom_int 0) (atom_int 0))
    (atom_int 1))
```

Both create structurally equal `[0 0]` cells, but the second version creates them as separate objects.

## Summary

1. **Jam uses backrefs** to avoid encoding the same noun multiple times
2. **Vere uses structural equality** (`u3r_sing`) to detect duplicates
3. **Size optimization**: Vere only uses backrefs if they're smaller than re-encoding
4. **OCaml variable reuse** creates true shared structure (same object)
5. **C function calls** create fresh allocations (different objects with same value)
6. **Our fix**: Use structural equality + size optimization + avoid variable reuse in tests

The key insight: Jam is **correct** when it detects shared structure. Our tests just needed to match Vere's allocation patterns (fresh allocations, no sharing).