diff options
author | polwex <polwex@sortug.com> | 2025-09-23 03:50:53 +0700 |
---|---|---|
committer | polwex <polwex@sortug.com> | 2025-09-23 03:50:53 +0700 |
commit | 57aaafdb137fe49930711f6ed5ccc83b3a119cd2 (patch) | |
tree | 1a7556927bed94377630d33dd29c3bf07d159619 /knobs.md |
init
Diffstat (limited to 'knobs.md')
-rw-r--r-- | knobs.md | 118 |
1 files changed, 118 insertions, 0 deletions
diff --git a/knobs.md b/knobs.md new file mode 100644 index 0000000..0199714 --- /dev/null +++ b/knobs.md @@ -0,0 +1,118 @@ + +Got it. Let’s break down all the knobs in this hybrid + rerank (+ optional MMR) pipeline and what they control. Think of them as dials on your retrieval engine; how you set them shifts the tradeoff between recall, precision, speed, and diversity. + +1. k_vec — how many nearest-neighbor (embedding) hits to pull + +What it does: number of top candidates retrieved by cosine similarity from the vector index (vec0). + +Low values (10–20): very fast, but risky — if the embedding missed some relevant chunks, they’ll never make it into rerank. + +High values (50–100+): better recall, since the CE reranker has a wider pool to choose from. Slower, but vector search is cheap. + +Typical sweet spot: 40–80. Vector ANN is blazing fast, so it’s safe to overshoot. + +2. k_bm25 — how many lexical hits to pull (FTS5) + +What it does: number of chunks fetched with BM25 scoring (keyword-based search). + +Why needed: embeddings sometimes fuzz things too much; BM25 catches exact matches, rare names, technical jargon. + +Low values (10–20): cheap, but may miss keyword-rich relevant hits. + +High values (50–100+): good for “needle in haystack” terms, but can pull lots of noise. + +Typical sweet spot: 30–60. Balances recall with noise. + +3. Merging strategy (vec+bm25) + +Current code: concatenates vector hits then BM25 hits, deduplicates, passes to CE. + +Effect: vector has slight priority, but BM25 ensures coverage. + +Alternative: interleave or weighted merge (future upgrade if you want). + +4. k_ce — how many merged candidates to rerank + +What it does: size of candidate pool fed into the CrossEncoder. + +Why important: CE is expensive — each (query,doc) is a transformer forward pass. + +Low (10–20): very fast, but can miss gems that were just outside the cutoff. + +High (50–100): CE sees more context, better chance to surface true top chunks, but slower (linear in k_ce). + +Ballpark costs: + +bge-reranker-base on GPU: ~2ms per pair. + +k_ce=30 → ~60ms. + +k_ce=100 → ~200ms. + +Typical sweet spot: 20–50. Enough diversity without killing latency. + +5. k_final — how many chunks you actually keep + +What it does: final number of chunks to return for context injection or answer. + +Low (3–5): compact context, but maybe too narrow for complex queries. + +High (15–20): more coverage, but can bloat your prompt and confuse the LLM. + +Typical sweet spot: 8–12. Enough context richness, still fits in a 4k–8k token window easily. + +6. use_mmr — toggle for Maximal Marginal Relevance + +What it does: apply MMR on the CE top-N (e.g. 30) before picking final K. + +Why: rerankers often cluster — you’ll get 5 almost-identical chunks from one section. MMR diversifies. + +Cost: you need vectors for those CE top candidates (either re-embed on the fly or store in DB). Cheap compared to CE. + +When to turn on: long documents where redundancy is high (e.g., laws, academic papers, transcripts). + +When to skip: short docs, or if you want maximum precision and don’t care about duplicates. + +7. mmr_lambda — relevance vs. diversity balance + +Range: 0 → pure diversity, 1 → pure relevance. + +Typical settings: + +0.6 → favors relevance but still kicks out duplicates. + +0.7–0.8 → more focused, just enough diversity. + +0.4–0.5 → exploratory search, less focused but broad coverage. + +Use case: If CE is already precise, set 0.7+. If your doc is redundant, drop closer to 0.5. + +8. Secondary knobs (not in your code yet but worth considering) + +BM25 cutoff / minimum match: require a keyword overlap for lexical candidates. + +Chunk length / overlap: directly affects retriever performance. Shorter chunks = finer retrieval, but noisier. Longer = richer context, but less precise. + +Normalization choice: your pipeline uses cosine (good default). Alternatives: dot-product (works if embeddings are already normalized). + +Practical example + +Let’s say you ask: “How did Japanese scholars engage with Shuihu zhuan?” + +If k_vec=20, k_bm25=20, k_ce=20: CE only sees 40 candidates, may miss the one chapter that actually describes Bakin’s commentary. + +If k_vec=80, k_bm25=50, k_ce=50: CE sees 130 candidates, reranks, and reliably bubbles up the right passage. Latency maybe 150ms, but precision ↑. + +If use_mmr=True, mmr_lambda=0.6: instead of 10 chunks all from the same chapter, you get 10 chunks spread across reception, transmission, and commentary — much better for LLM summarization. + +👉 So the way to think about it: + +k_vec + k_bm25 = recall reservoir (make this generously high). + +k_ce = how much of that reservoir the expensive reranker drinks. + +k_final = how many glasses of water you hand to the LLM. + +use_mmr + mmr_lambda = whether you want those glasses from one pitcher or spread across the table. + +Do you want me to also suggest default knob profiles (like “fast mode”, “balanced mode”, “deep recall mode”) so you can flip between them depending on your use-case? |