summaryrefslogtreecommitdiff
path: root/ocaml/JAM_BACKREFS.md
blob: fbb82697f18f7647df7e36017b2a10b4bc41edb7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Jam Serialization and Backreferences

## What is Jam?

Jam is Urbit's serialization format for nouns. It encodes a noun into a bitstream that can be later decoded with `cue`. The format uses length-prefixing and supports **backreferences** to handle shared structure efficiently.

## Basic Jam Encoding

For each noun, jam writes:
- **Atoms**: `0` bit followed by mat-encoded value
- **Cells**: `01` bits followed by jam-encoded head and tail
- **Backreferences**: `11` bits followed by mat-encoded bit position

## What are Backreferences?

A backreference is a pointer to a previously-encoded noun at a specific bit position in the output stream. Instead of encoding the same noun multiple times, jam can say "this noun is the same as the one at position X".

### Example

Consider jamming `[[1 2] [1 2]]` where both cells are the **same object** in memory:

```ocaml
let cell = pair (atom_int 1) (atom_int 2) in
pair cell cell
```

**Without backreferences**, you'd encode:
```
01                    (* outer cell *)
  01 0<1> 0<2>       (* first [1 2] *)
  01 0<1> 0<2>       (* second [1 2] - duplicate! *)
```

**With backreferences**, jam encodes:
```
01                    (* outer cell *)
  01 0<1> 0<2>       (* first [1 2] at position 2 *)
  11 <2>              (* backref to position 2 *)
```

The backref is smaller!

## Physical vs Structural Equality

This is where things get interesting. Jam needs to decide: "Have I seen this noun before?"

### Vere's Approach

Vere uses `u3r_sing()` which does:
1. **Fast path**: Check physical equality `(a == b)` - same pointer?
2. **Slow path**: If not same pointer, do deep structural comparison

This means Vere detects BOTH:
- Shared structure (same object used twice)
- Duplicate values (different objects with same value)

### Initial OCaml Bug

Our first implementation used **physical equality only**:
```ocaml
module NounTbl = Hashtbl.Make (struct
  type t = noun
  let equal = (==)    (* BUG: physical equality only! *)
  let hash = Hashtbl.hash
end)
```

This caused problems because:
- In C: `pair (atom_int 0) (atom_int 0)` might reuse the SAME `0` object (if it's cached)
- Our hashtable would see the same `0` twice and create a backref
- But the C version creates backrefs based on structural equality too

### The Fix

Change to structural equality:
```ocaml
let equal = Noun.equal    (* structural equality *)
```

Now we match Vere's behavior for detecting duplicates.

## The Size Optimization

But wait! If we always use backrefs for duplicate atoms, why doesn't `[0 0]` use a backref?

The answer: **Vere compares the size** of encoding the atom vs the backref!

From `vere/pkg/noun/serial.c:157-169`:
```c
if ( a_w <= b_w ) {
  // if atom is smaller than backref, encode atom directly
  _cs_jam_fib_chop(fib_u, 1, 0);
  _cs_jam_fib_mat(fib_u, a);
}
else {
  // otherwise, encode backref
  _cs_jam_fib_chop(fib_u, 2, 3);
  _cs_jam_fib_mat(fib_u, b);
}
```

### Example: `[0 0]`

Encoding atom `0`:
- `0` bit (atom tag)
- `1` bit (mat encoding of 0)
- **Total: 2 bits**

Encoding backref to position 2:
- `11` bits (backref tag)
- mat encoding of `2` (which is `01` `10` = 4 bits)
- **Total: 6 bits**

**Atom is smaller!** So jam encodes the atom directly even though it's a duplicate.

## Why Variable Reuse Matters in OCaml

Consider this OCaml code:
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
pair base (pair base (atom_int 1))
```

vs this:
```ocaml
pair
  (pair (atom_int 0) (atom_int 0))
  (pair
    (pair (atom_int 0) (atom_int 0))
    (atom_int 1))
```

### First version (reusing `base`):
- Both occurrences of `base` point to the **same object** in memory
- Jam's hashtable (using structural equality) will find the duplicate
- It will create a backref if the backref is smaller than re-encoding

### Second version (no reuse):
- Each `pair (atom_int 0) (atom_int 0)` creates a **fresh object**
- They're different objects, even though structurally equal
- Jam's hashtable compares them with `Noun.equal`
- They're equal, so jam still considers them duplicates!

Wait, so both should produce backrefs?

### The Subtlety

Actually, the issue is more nuanced. When you write:
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
```

The OCaml runtime creates ONE object. Using `base` twice means the SAME object appears in two places in the structure. This is **true sharing**.

But in C:
```c
u3nc(u3nc(0, 0), u3nc(u3nc(0, 0), ...))
```

Each `u3nc(0, 0)` call allocates a FRESH cell, even if the values are the same. There's no sharing unless you explicitly save and reuse the pointer.

## The Test Case Fix

The Vere tests are written to create fresh allocations with no sharing. To match this in OCaml, we needed to inline everything without variable reuse:

**Before** (has sharing):
```ocaml
let base = pair (atom_int 0) (atom_int 0) in
pair base (pair base (atom_int 1))
```

**After** (no sharing):
```ocaml
pair
  (pair (atom_int 0) (atom_int 0))
  (pair
    (pair (atom_int 0) (atom_int 0))
    (atom_int 1))
```

Both create structurally equal `[0 0]` cells, but the second version creates them as separate objects.

## Summary

1. **Jam uses backrefs** to avoid encoding the same noun multiple times
2. **Vere uses structural equality** (`u3r_sing`) to detect duplicates
3. **Size optimization**: Vere only uses backrefs if they're smaller than re-encoding
4. **OCaml variable reuse** creates true shared structure (same object)
5. **C function calls** create fresh allocations (different objects with same value)
6. **Our fix**: Use structural equality + size optimization + avoid variable reuse in tests

The key insight: Jam is **correct** when it detects shared structure. Our tests just needed to match Vere's allocation patterns (fresh allocations, no sharing).