knobs.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118


Got it. Let’s break down all the knobs in this hybrid + rerank (+ optional MMR) pipeline and what they control. Think of them as dials on your retrieval engine; how you set them shifts the tradeoff between recall, precision, speed, and diversity.

1. k_vec — how many nearest-neighbor (embedding) hits to pull

What it does: number of top candidates retrieved by cosine similarity from the vector index (vec0).

Low values (10–20): very fast, but risky — if the embedding missed some relevant chunks, they’ll never make it into rerank.

High values (50–100+): better recall, since the CE reranker has a wider pool to choose from. Slower, but vector search is cheap.

Typical sweet spot: 40–80. Vector ANN is blazing fast, so it’s safe to overshoot.

2. k_bm25 — how many lexical hits to pull (FTS5)

What it does: number of chunks fetched with BM25 scoring (keyword-based search).

Why needed: embeddings sometimes fuzz things too much; BM25 catches exact matches, rare names, technical jargon.

Low values (10–20): cheap, but may miss keyword-rich relevant hits.

High values (50–100+): good for “needle in haystack” terms, but can pull lots of noise.

Typical sweet spot: 30–60. Balances recall with noise.

3. Merging strategy (vec+bm25)

Current code: concatenates vector hits then BM25 hits, deduplicates, passes to CE.

Effect: vector has slight priority, but BM25 ensures coverage.

Alternative: interleave or weighted merge (future upgrade if you want).

4. k_ce — how many merged candidates to rerank

What it does: size of candidate pool fed into the CrossEncoder.

Why important: CE is expensive — each (query,doc) is a transformer forward pass.

Low (10–20): very fast, but can miss gems that were just outside the cutoff.

High (50–100): CE sees more context, better chance to surface true top chunks, but slower (linear in k_ce).

Ballpark costs:

bge-reranker-base on GPU: ~2ms per pair.

k_ce=30 → ~60ms.

k_ce=100 → ~200ms.

Typical sweet spot: 20–50. Enough diversity without killing latency.

5. k_final — how many chunks you actually keep

What it does: final number of chunks to return for context injection or answer.

Low (3–5): compact context, but maybe too narrow for complex queries.

High (15–20): more coverage, but can bloat your prompt and confuse the LLM.

Typical sweet spot: 8–12. Enough context richness, still fits in a 4k–8k token window easily.

6. use_mmr — toggle for Maximal Marginal Relevance

What it does: apply MMR on the CE top-N (e.g. 30) before picking final K.

Why: rerankers often cluster — you’ll get 5 almost-identical chunks from one section. MMR diversifies.

Cost: you need vectors for those CE top candidates (either re-embed on the fly or store in DB). Cheap compared to CE.

When to turn on: long documents where redundancy is high (e.g., laws, academic papers, transcripts).

When to skip: short docs, or if you want maximum precision and don’t care about duplicates.

7. mmr_lambda — relevance vs. diversity balance

Range: 0 → pure diversity, 1 → pure relevance.

Typical settings:

0.6 → favors relevance but still kicks out duplicates.

0.7–0.8 → more focused, just enough diversity.

0.4–0.5 → exploratory search, less focused but broad coverage.

Use case: If CE is already precise, set 0.7+. If your doc is redundant, drop closer to 0.5.

8. Secondary knobs (not in your code yet but worth considering)

BM25 cutoff / minimum match: require a keyword overlap for lexical candidates.

Chunk length / overlap: directly affects retriever performance. Shorter chunks = finer retrieval, but noisier. Longer = richer context, but less precise.

Normalization choice: your pipeline uses cosine (good default). Alternatives: dot-product (works if embeddings are already normalized).

Practical example

Let’s say you ask: “How did Japanese scholars engage with Shuihu zhuan?”

If k_vec=20, k_bm25=20, k_ce=20: CE only sees 40 candidates, may miss the one chapter that actually describes Bakin’s commentary.

If k_vec=80, k_bm25=50, k_ce=50: CE sees 130 candidates, reranks, and reliably bubbles up the right passage. Latency maybe 150ms, but precision ↑.

If use_mmr=True, mmr_lambda=0.6: instead of 10 chunks all from the same chapter, you get 10 chunks spread across reception, transmission, and commentary — much better for LLM summarization.

👉 So the way to think about it:

k_vec + k_bm25 = recall reservoir (make this generously high).

k_ce = how much of that reservoir the expensive reranker drinks.

k_final = how many glasses of water you hand to the LLM.

use_mmr + mmr_lambda = whether you want those glasses from one pitcher or spread across the table.

Do you want me to also suggest default knob profiles (like “fast mode”, “balanced mode”, “deep recall mode”) so you can flip between them depending on your use-case?