fix(search): use global stats for shard-independent text scoring#7250
fix(search): use global stats for shard-independent text scoring#7250vyavdoshenko wants to merge 1 commit intomainfrom
Conversation
|
augment review |
There was a problem hiding this comment.
Pull request overview
This PR fixes shard-count-dependent ranking for text scorers (BM25STD/TFIDF/TFIDF.DOCNORM) by switching multi-shard queries to use merged, cluster-wide scoring statistics (global IDF + avgdl), ensuring consistent top-K ordering and scores regardless of --proactor_threads.
Changes:
- Add a DFS-style two-phase flow for multi-shard scoring queries: collect per-shard term/field stats, merge into
GlobalScoringStats, then score using the merged stats. - Add deterministic tie-breaking for score-based ordering across shards (by
(score, key)), including a hidden__keytie-breaker for FT.AGGREGATE sorting. - Add Python integration tests that assert identical top-K ordering/scores across 1-shard vs 4-shard setups for FT.SEARCH and FT.AGGREGATE.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/dragonfly/search_test.py | Adds shard-independence regressions for FT.SEARCH/FT.AGGREGATE scorers and a SORTBY/WITHSCORES alignment test. |
| src/server/search/search_family.cc | Implements the phase-1 stats collection + merge, passes global stats into per-shard searches, and re-sorts merged results by (score,key) for correct global LIMIT behavior. |
| src/server/search/doc_index.h | Extends per-shard search API to accept optional GlobalScoringStats and adds CollectScoringStats. |
| src/server/search/doc_index.cc | Wires global stats into search execution; adds per-shard re-rank by (score,key) and injects hidden __key for aggregate determinism. |
| src/server/search/aggregator.cc | Adds hidden __key tie-breaker in sort comparator to make cross-shard ordering deterministic when explicit sort fields tie. |
| src/core/search/search.h | Extends SearchAlgorithm::Search to accept optional global stats and adds CollectScoringStats. |
| src/core/search/search.cc | Uses global stats for term_docs/avgdl in scoring; adds AST-based StatsCollector to compute per-shard scoring stats; caches posting lists for matched terms. |
| src/core/search/scoring.h / src/core/search/scoring.cc | Introduces ShardScoringStats/GlobalScoringStats and merge/query helpers for global IDF/avgdl inputs. |
| src/core/search/indices.h | Adds per-field total-doc-length accessor and stores schema field identifier on text indices for canonical stat keys. |
🤖 Augment PR SummarySummary: Makes BM25STD/TFIDF/TFIDF.DOCNORM rankings independent of shard count by scoring with cluster-wide (merged) IDF and avgdl instead of per-shard local stats. Changes:
Technical Notes: Global-stats collection is skipped for single-shard and vector-only (HNSW) queries; hybrid KNN prefilter queries still use the global text stats for scoring. 🤖 Was this summary useful? React with 👍 or 👎 |
bab93ca to
c606034
Compare
BM25STD/TFIDF/TFIDF.DOCNORM rankings shifted with
--proactor_threadsbecause each shard scored docs using its local IDF and avgdl. With the same dataset, top-K and scores diverged across shard counts. This PR makes scoring use stats merged across all shards so ranking is shard-count-independent.Approach:
DFS-style two-phase query:
total_docs_len/num_docs.GlobalScoringStats(single instance, all shards).Single-shard skips phase 1 (local == global). KNN/HNSW skips it (vector distance, not text scoring).
jaccard@10 identical across shard counts confirms shard-independence.