Skip to content

fix(search): use global stats for shard-independent text scoring#7250

Open
vyavdoshenko wants to merge 1 commit intomainfrom
bobik/scroting_update
Open

fix(search): use global stats for shard-independent text scoring#7250
vyavdoshenko wants to merge 1 commit intomainfrom
bobik/scroting_update

Conversation

@vyavdoshenko
Copy link
Copy Markdown
Contributor

BM25STD/TFIDF/TFIDF.DOCNORM rankings shifted with --proactor_threads because each shard scored docs using its local IDF and avgdl. With the same dataset, top-K and scores diverged across shard counts. This PR makes scoring use stats merged across all shards so ranking is shard-count-independent.

Approach:
DFS-style two-phase query:

  1. Phase 1: each shard walks the AST and reports per-(field, term) doc counts and per-field total_docs_len / num_docs.
  2. Coordinator merges into GlobalScoringStats (single instance, all shards).
  3. Phase 2: each shard runs the actual search, scoring with the merged counts so IDF/avgdl are identical regardless of shard count.

Single-shard skips phase 1 (local == global). KNN/HNSW skips it (vector distance, not text scoring).

jaccard@10 identical across shard counts confirms shard-independence.

@vyavdoshenko vyavdoshenko self-assigned this May 1, 2026
Copilot AI review requested due to automatic review settings May 1, 2026 14:43
@vyavdoshenko
Copy link
Copy Markdown
Contributor Author

augment review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes shard-count-dependent ranking for text scorers (BM25STD/TFIDF/TFIDF.DOCNORM) by switching multi-shard queries to use merged, cluster-wide scoring statistics (global IDF + avgdl), ensuring consistent top-K ordering and scores regardless of --proactor_threads.

Changes:

  • Add a DFS-style two-phase flow for multi-shard scoring queries: collect per-shard term/field stats, merge into GlobalScoringStats, then score using the merged stats.
  • Add deterministic tie-breaking for score-based ordering across shards (by (score, key)), including a hidden __key tie-breaker for FT.AGGREGATE sorting.
  • Add Python integration tests that assert identical top-K ordering/scores across 1-shard vs 4-shard setups for FT.SEARCH and FT.AGGREGATE.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/dragonfly/search_test.py Adds shard-independence regressions for FT.SEARCH/FT.AGGREGATE scorers and a SORTBY/WITHSCORES alignment test.
src/server/search/search_family.cc Implements the phase-1 stats collection + merge, passes global stats into per-shard searches, and re-sorts merged results by (score,key) for correct global LIMIT behavior.
src/server/search/doc_index.h Extends per-shard search API to accept optional GlobalScoringStats and adds CollectScoringStats.
src/server/search/doc_index.cc Wires global stats into search execution; adds per-shard re-rank by (score,key) and injects hidden __key for aggregate determinism.
src/server/search/aggregator.cc Adds hidden __key tie-breaker in sort comparator to make cross-shard ordering deterministic when explicit sort fields tie.
src/core/search/search.h Extends SearchAlgorithm::Search to accept optional global stats and adds CollectScoringStats.
src/core/search/search.cc Uses global stats for term_docs/avgdl in scoring; adds AST-based StatsCollector to compute per-shard scoring stats; caches posting lists for matched terms.
src/core/search/scoring.h / src/core/search/scoring.cc Introduces ShardScoringStats/GlobalScoringStats and merge/query helpers for global IDF/avgdl inputs.
src/core/search/indices.h Adds per-field total-doc-length accessor and stores schema field identifier on text indices for canonical stat keys.

Comment thread src/core/search/search.cc Outdated
Comment thread src/core/search/search.cc Outdated
Comment thread src/server/search/doc_index.cc
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 1, 2026

🤖 Augment PR Summary

Summary: Makes BM25STD/TFIDF/TFIDF.DOCNORM rankings independent of shard count by scoring with cluster-wide (merged) IDF and avgdl instead of per-shard local stats.

Changes:

  • Adds ShardScoringStats and GlobalScoringStats plus merge/query helpers for scorer inputs.
  • Extends text indices to expose total doc-length sums and a schema-canonical field identifier for cross-shard aggregation.
  • Introduces a phase-1 AST walk (StatsCollector) to collect per-(field,term) document counts and per-field length totals on each shard.
  • Runs phase-2 searches with the merged global stats so all shards compute identical IDF/avgdl and produce shard-count-stable scores.
  • Caches posting-list containers during AST traversal to avoid repeated trie lookups during scoring.
  • Improves cross-shard determinism by re-ranking/merging on (text score, key) and adding a hidden __key tie-breaker for aggregation sorts.
  • Adds regression tests to verify top-K ordering and near-identical scores across --proactor_threads for the affected scorers, plus a SORTBY/WITHSCORES alignment check.

Technical Notes: Global-stats collection is skipped for single-shard and vector-only (HNSW) queries; hybrid KNN prefilter queries still use the global text stats for scoring.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/core/search/search.cc Outdated
Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/core/search/search.cc Outdated
Comment thread src/server/search/search_family.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants