Skip to content

feat(search): add configurable file-level deduplication#188

Open
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin:feat/search-dedup-by-file
Open

feat(search): add configurable file-level deduplication#188
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin:feat/search-dedup-by-file

Conversation

@Don-Yin
Copy link

@Don-Yin Don-Yin commented Mar 16, 2026

problem

when a search query matches several chunks from the same file, the result list fills up with duplicate file entries. this crowds out results from other relevant files and reduces the effective diversity of the top-N list. in practice, a user who asks for 10 results may receive 4 or 5 entries that all point to the same file, which leaves fewer slots for other files that would otherwise rank highly.

solution

this PR adds a search.dedup.enabled option (default: true) that retains only the highest-scoring chunk per file path. deduplication runs after boost scoring, so the best chunk is selected with structural penalties and bonuses already applied.

when dedup is enabled, the internal fetch limit increases from limit * 2 to limit * 4 so that enough unique files survive the filter. users can disable it in config.yaml if they prefer the original behaviour:

search:
  dedup:
    enabled: false

changes

file what
config/config.go add DedupConfig struct, wire it into SearchConfig, set the default to true
search/search.go read the dedup config, scale fetchLimit conditionally, call DeduplicateByFile when enabled
search/dedup.go implement DeduplicateByFile (keeps the first occurrence per file path from a pre-sorted slice)
search/dedup_test.go unit tests: standard case, empty input, all-unique input

test plan

  • go test ./search/ passes (all 20 tests, including 3 new dedup tests)
  • go test ./config/ passes
  • go build ./... succeeds with no warnings
  • manual verification: run grepai search <query> --json and confirm that each file path appears at most once in the output
  • manual verification: set search.dedup.enabled: false in config.yaml and confirm that duplicate file entries reappear

note

this branch will remain open for additional work (e.g. wiring dedup into the workspace search path). further commits may follow.

when multiple chunks from the same file match a query, the
result list can contain several entries that all point to the
same file. this crowds out results from other relevant files
and reduces the effective diversity of the top-N list.

this commit introduces a `search.dedup.enabled` config option
(default: true) that retains only the highest-scoring chunk
per file path after boost scoring. when enabled, the internal
fetch limit is doubled (from 2x to 4x the requested limit)
so that enough unique files survive the filter.

changes:
- add DedupConfig struct and wire it into SearchConfig
- condition dedup and fetch-limit scaling on the config flag
- add DeduplicateByFile helper with unit tests
@Don-Yin Don-Yin force-pushed the feat/search-dedup-by-file branch from 110dff6 to 8c14033 Compare March 16, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant