fix(search): use global stats for shard-independent text scoring by vyavdoshenko · Pull Request #7250 · dragonflydb/dragonfly

vyavdoshenko · 2026-05-01T14:43:38Z

BM25STD/TFIDF/TFIDF.DOCNORM rankings shifted with --proactor_threads because each shard scored docs using its local IDF and avgdl. With the same dataset, top-K and scores diverged across shard counts. This PR makes scoring use stats merged across all shards so ranking is shard-count-independent.

Approach:
DFS-style two-phase query:

Phase 1: each shard walks the AST and reports per-(field, term) doc counts and per-field total_docs_len / num_docs.
Coordinator merges into GlobalScoringStats (single instance, all shards).
Phase 2: each shard runs the actual search, scoring with the merged counts so IDF/avgdl are identical regardless of shard count.

Single-shard skips phase 1 (local == global). KNN/HNSW skips it (vector distance, not text scoring).

jaccard@10 identical across shard counts confirms shard-independence.

vyavdoshenko · 2026-05-01T14:46:22Z

augment review

Copilot

Pull request overview

This PR fixes shard-count-dependent ranking for text scorers (BM25STD/TFIDF/TFIDF.DOCNORM) by switching multi-shard queries to use merged, cluster-wide scoring statistics (global IDF + avgdl), ensuring consistent top-K ordering and scores regardless of --proactor_threads.

Changes:

Add a DFS-style two-phase flow for multi-shard scoring queries: collect per-shard term/field stats, merge into GlobalScoringStats, then score using the merged stats.
Add deterministic tie-breaking for score-based ordering across shards (by (score, key)), including a hidden __key tie-breaker for FT.AGGREGATE sorting.
Add Python integration tests that assert identical top-K ordering/scores across 1-shard vs 4-shard setups for FT.SEARCH and FT.AGGREGATE.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/dragonfly/search_test.py	Adds shard-independence regressions for FT.SEARCH/FT.AGGREGATE scorers and a SORTBY/WITHSCORES alignment test.
src/server/search/search_family.cc	Implements the phase-1 stats collection + merge, passes global stats into per-shard searches, and re-sorts merged results by `(score,key)` for correct global LIMIT behavior.
src/server/search/doc_index.h	Extends per-shard search API to accept optional `GlobalScoringStats` and adds `CollectScoringStats`.
src/server/search/doc_index.cc	Wires global stats into search execution; adds per-shard re-rank by `(score,key)` and injects hidden `__key` for aggregate determinism.
src/server/search/aggregator.cc	Adds hidden `__key` tie-breaker in sort comparator to make cross-shard ordering deterministic when explicit sort fields tie.
src/core/search/search.h	Extends `SearchAlgorithm::Search` to accept optional global stats and adds `CollectScoringStats`.
src/core/search/search.cc	Uses global stats for `term_docs`/avgdl in scoring; adds AST-based `StatsCollector` to compute per-shard scoring stats; caches posting lists for matched terms.
src/core/search/scoring.h / src/core/search/scoring.cc	Introduces `ShardScoringStats`/`GlobalScoringStats` and merge/query helpers for global IDF/avgdl inputs.
src/core/search/indices.h	Adds per-field total-doc-length accessor and stores schema field identifier on text indices for canonical stat keys.

augmentcode · 2026-05-01T14:55:08Z

🤖 Augment PR Summary

Summary: Makes BM25STD/TFIDF/TFIDF.DOCNORM rankings independent of shard count by scoring with cluster-wide (merged) IDF and avgdl instead of per-shard local stats.

Changes:

Adds ShardScoringStats and GlobalScoringStats plus merge/query helpers for scorer inputs.
Extends text indices to expose total doc-length sums and a schema-canonical field identifier for cross-shard aggregation.
Introduces a phase-1 AST walk (StatsCollector) to collect per-(field,term) document counts and per-field length totals on each shard.
Runs phase-2 searches with the merged global stats so all shards compute identical IDF/avgdl and produce shard-count-stable scores.
Caches posting-list containers during AST traversal to avoid repeated trie lookups during scoring.
Improves cross-shard determinism by re-ranking/merging on (text score, key) and adding a hidden __key tie-breaker for aggregation sorts.
Adds regression tests to verify top-K ordering and near-identical scores across --proactor_threads for the affected scorers, plus a SORTBY/WITHSCORES alignment check.

Technical Notes: Global-stats collection is skipped for single-shard and vector-only (HNSW) queries; hybrid KNN prefilter queries still use the global text stats for scoring.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

vyavdoshenko self-assigned this May 1, 2026

Copilot AI review requested due to automatic review settings May 1, 2026 14:43

Copilot started reviewing on behalf of vyavdoshenko May 1, 2026 14:44 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread src/core/search/search.cc Outdated

Comment thread src/core/search/search.cc Outdated

Comment thread src/server/search/doc_index.cc

augmentcode Bot reviewed May 1, 2026

View reviewed changes

Comment thread src/core/search/search.cc Outdated

augmentcode Bot reviewed May 1, 2026

View reviewed changes

Comment thread src/core/search/search.cc Outdated

Comment thread src/server/search/search_family.cc

fix: scoring

c606034

vyavdoshenko force-pushed the bobik/scroting_update branch from bab93ca to c606034 Compare May 1, 2026 15:30

vyavdoshenko requested review from BorysTheDev, dranikpg, mkaruza and romange May 1, 2026 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(search): use global stats for shard-independent text scoring#7250

fix(search): use global stats for shard-independent text scoring#7250
vyavdoshenko wants to merge 1 commit intomainfrom
bobik/scroting_update

vyavdoshenko commented May 1, 2026

Uh oh!

vyavdoshenko commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

augmentcode Bot commented May 1, 2026 •

edited

Loading

Uh oh!

augmentcode Bot left a comment

Uh oh!

Uh oh!

augmentcode Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vyavdoshenko commented May 1, 2026

Uh oh!

vyavdoshenko commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

augmentcode Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

augmentcode Bot commented May 1, 2026 •

edited

Loading