summaryrefslogtreecommitdiff
path: root/knobs.md
diff options
context:
space:
mode:
authorpolwex <polwex@sortug.com>2025-09-23 03:50:53 +0700
committerpolwex <polwex@sortug.com>2025-09-23 03:50:53 +0700
commit57aaafdb137fe49930711f6ed5ccc83b3a119cd2 (patch)
tree1a7556927bed94377630d33dd29c3bf07d159619 /knobs.md
init
Diffstat (limited to 'knobs.md')
-rw-r--r--knobs.md118
1 files changed, 118 insertions, 0 deletions
diff --git a/knobs.md b/knobs.md
new file mode 100644
index 0000000..0199714
--- /dev/null
+++ b/knobs.md
@@ -0,0 +1,118 @@
+
+Got it. Let’s break down all the knobs in this hybrid + rerank (+ optional MMR) pipeline and what they control. Think of them as dials on your retrieval engine; how you set them shifts the tradeoff between recall, precision, speed, and diversity.
+
+1. k_vec — how many nearest-neighbor (embedding) hits to pull
+
+What it does: number of top candidates retrieved by cosine similarity from the vector index (vec0).
+
+Low values (10–20): very fast, but risky — if the embedding missed some relevant chunks, they’ll never make it into rerank.
+
+High values (50–100+): better recall, since the CE reranker has a wider pool to choose from. Slower, but vector search is cheap.
+
+Typical sweet spot: 40–80. Vector ANN is blazing fast, so it’s safe to overshoot.
+
+2. k_bm25 — how many lexical hits to pull (FTS5)
+
+What it does: number of chunks fetched with BM25 scoring (keyword-based search).
+
+Why needed: embeddings sometimes fuzz things too much; BM25 catches exact matches, rare names, technical jargon.
+
+Low values (10–20): cheap, but may miss keyword-rich relevant hits.
+
+High values (50–100+): good for “needle in haystack” terms, but can pull lots of noise.
+
+Typical sweet spot: 30–60. Balances recall with noise.
+
+3. Merging strategy (vec+bm25)
+
+Current code: concatenates vector hits then BM25 hits, deduplicates, passes to CE.
+
+Effect: vector has slight priority, but BM25 ensures coverage.
+
+Alternative: interleave or weighted merge (future upgrade if you want).
+
+4. k_ce — how many merged candidates to rerank
+
+What it does: size of candidate pool fed into the CrossEncoder.
+
+Why important: CE is expensive — each (query,doc) is a transformer forward pass.
+
+Low (10–20): very fast, but can miss gems that were just outside the cutoff.
+
+High (50–100): CE sees more context, better chance to surface true top chunks, but slower (linear in k_ce).
+
+Ballpark costs:
+
+bge-reranker-base on GPU: ~2ms per pair.
+
+k_ce=30 → ~60ms.
+
+k_ce=100 → ~200ms.
+
+Typical sweet spot: 20–50. Enough diversity without killing latency.
+
+5. k_final — how many chunks you actually keep
+
+What it does: final number of chunks to return for context injection or answer.
+
+Low (3–5): compact context, but maybe too narrow for complex queries.
+
+High (15–20): more coverage, but can bloat your prompt and confuse the LLM.
+
+Typical sweet spot: 8–12. Enough context richness, still fits in a 4k–8k token window easily.
+
+6. use_mmr — toggle for Maximal Marginal Relevance
+
+What it does: apply MMR on the CE top-N (e.g. 30) before picking final K.
+
+Why: rerankers often cluster — you’ll get 5 almost-identical chunks from one section. MMR diversifies.
+
+Cost: you need vectors for those CE top candidates (either re-embed on the fly or store in DB). Cheap compared to CE.
+
+When to turn on: long documents where redundancy is high (e.g., laws, academic papers, transcripts).
+
+When to skip: short docs, or if you want maximum precision and don’t care about duplicates.
+
+7. mmr_lambda — relevance vs. diversity balance
+
+Range: 0 → pure diversity, 1 → pure relevance.
+
+Typical settings:
+
+0.6 → favors relevance but still kicks out duplicates.
+
+0.7–0.8 → more focused, just enough diversity.
+
+0.4–0.5 → exploratory search, less focused but broad coverage.
+
+Use case: If CE is already precise, set 0.7+. If your doc is redundant, drop closer to 0.5.
+
+8. Secondary knobs (not in your code yet but worth considering)
+
+BM25 cutoff / minimum match: require a keyword overlap for lexical candidates.
+
+Chunk length / overlap: directly affects retriever performance. Shorter chunks = finer retrieval, but noisier. Longer = richer context, but less precise.
+
+Normalization choice: your pipeline uses cosine (good default). Alternatives: dot-product (works if embeddings are already normalized).
+
+Practical example
+
+Let’s say you ask: “How did Japanese scholars engage with Shuihu zhuan?”
+
+If k_vec=20, k_bm25=20, k_ce=20: CE only sees 40 candidates, may miss the one chapter that actually describes Bakin’s commentary.
+
+If k_vec=80, k_bm25=50, k_ce=50: CE sees 130 candidates, reranks, and reliably bubbles up the right passage. Latency maybe 150ms, but precision ↑.
+
+If use_mmr=True, mmr_lambda=0.6: instead of 10 chunks all from the same chapter, you get 10 chunks spread across reception, transmission, and commentary — much better for LLM summarization.
+
+👉 So the way to think about it:
+
+k_vec + k_bm25 = recall reservoir (make this generously high).
+
+k_ce = how much of that reservoir the expensive reranker drinks.
+
+k_final = how many glasses of water you hand to the LLM.
+
+use_mmr + mmr_lambda = whether you want those glasses from one pitcher or spread across the table.
+
+Do you want me to also suggest default knob profiles (like “fast mode”, “balanced mode”, “deep recall mode”) so you can flip between them depending on your use-case?