feat(search): add configurable file-level deduplication by Don-Yin · Pull Request #188 · yoanbernabeu/grepai

Don-Yin · 2026-03-16T16:10:27Z

problem

when a search query matches several chunks from the same file, the result list fills up with duplicate file entries. this crowds out results from other relevant files and reduces the effective diversity of the top-N list. in practice, a user who asks for 10 results may receive 4 or 5 entries that all point to the same file, which leaves fewer slots for other files that would otherwise rank highly.

solution

this PR adds a search.dedup.enabled option (default: true) that retains only the highest-scoring chunk per file path. deduplication runs after boost scoring, so the best chunk is selected with structural penalties and bonuses already applied.

when dedup is enabled, the internal fetch limit increases from limit * 2 to limit * 4 so that enough unique files survive the filter. users can disable it in config.yaml if they prefer the original behaviour:

search:
  dedup:
    enabled: false

changes

file	what
`config/config.go`	add `DedupConfig` struct, wire it into `SearchConfig`, set the default to `true`
`search/search.go`	read the dedup config, scale `fetchLimit` conditionally, call `DeduplicateByFile` when enabled
`search/dedup.go`	implement `DeduplicateByFile` (keeps the first occurrence per file path from a pre-sorted slice)
`search/dedup_test.go`	unit tests: standard case, empty input, all-unique input

test plan

go test ./search/ passes (all 20 tests, including 3 new dedup tests)
go test ./config/ passes
go build ./... succeeds with no warnings
manual verification: run grepai search <query> --json and confirm that each file path appears at most once in the output
manual verification: set search.dedup.enabled: false in config.yaml and confirm that duplicate file entries reappear

note

this branch will remain open for additional work (e.g. wiring dedup into the workspace search path). further commits may follow.

when multiple chunks from the same file match a query, the result list can contain several entries that all point to the same file. this crowds out results from other relevant files and reduces the effective diversity of the top-N list. this commit introduces a `search.dedup.enabled` config option (default: true) that retains only the highest-scoring chunk per file path after boost scoring. when enabled, the internal fetch limit is doubled (from 2x to 4x the requested limit) so that enough unique files survive the filter. changes: - add DedupConfig struct and wire it into SearchConfig - condition dedup and fetch-limit scaling on the config flag - add DeduplicateByFile helper with unit tests

Don-Yin force-pushed the feat/search-dedup-by-file branch from 110dff6 to 8c14033 Compare March 16, 2026 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): add configurable file-level deduplication#188

feat(search): add configurable file-level deduplication#188
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin:feat/search-dedup-by-file

Don-Yin commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Don-Yin commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

problem

solution

changes

test plan

note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Don-Yin commented Mar 16, 2026 •

edited

Loading