add `chunk_expansion_boundary_fields` to solr vector_io provider by mwcz · Pull Request #54 · lightspeed-core/lightspeed-providers

mwcz · 2026-02-17T04:25:47Z

Description

This is a follow-up to the Solr provider PR #51. The current Solr provider has a "chunk window expansion" algorithm that, when enabled, begins with the matched chunk and bidirectionally adds additional chunks from the document, until a token budget is reached. Giving the LLM more context is the idea, but in many documents it's a bit too aggressive which results in unrelated context being fetched. To improve things, this PR adds an optional configuration option, chunk_expansion_boundary_fields which enabled providing field names which must be the same as the originally matched chunk in order for adjacent chunks to be included in the expanding window.

Restricting chunk expansion to specific headings (h1/h2/etc in HTML) is the motivating idea. If we have chunks from a single document like this:

chunk_index	parent_id	heading	chunk
0	doc_1	Install guide	Run the installer with `./install.sh`.
1	doc_1	Install guide	Reboot after installation completes.
2	doc_1	Legal notice	This software is provided as-is.
3	doc_1	Legal notice	Redistribution requires written consent.

If we do a query, and the "Reboot after installation completes" chunk is the first match, then chunk window expansion will start including adjacent chunks. Before, "This software is provided as-is." will be included despite it being utterly unrelated.

With the new configuration option, we can ensure only chunks with the same heading are eligible for inclusion.

chunk_expansion_boundary_fields: ["parent_id", "heading"]

With this config, now when "Reboot after installation completes." is matched with its parent_id=doc_1 and heading="Install guide", only chunks with those same values will be included in the chunk expansion algorithm. The context sent to the LLM will be "Run the installer with ./install.sh. Reboot after installation completes.", and the legal notice text is not included.

Note: chunk_parent_id_field is always used as a boundary regardless of this setting. chunk_expansion_boundary_fields specifies additional boundaries.

Logging cleanup

Also removes print() debug statements and downgrades several high-frequency log.info calls to log.debug (per-document score filtering, per-chunk expansion steps, cache hits).

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Opus 4.6
Generated by: Claude code 2.1.44

Related Tickets & Documents

Related PR add Solr vector_io provider #51
Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Adds an optional list of chunk field names that constrain window expansion. Context chunks are filtered to only those sharing the same boundary field values as the anchor chunk. This allows expansion to be bounded by heading/section in addition to parent document. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwcz force-pushed the mwcz/solr-provider-chunk-expansion-boundary-enhancement branch from 3ff95be to 5244d43 Compare February 21, 2026 19:33

mwcz force-pushed the mwcz/solr-provider-chunk-expansion-boundary-enhancement branch from 5244d43 to f20b55a Compare February 21, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `chunk_expansion_boundary_fields` to solr vector_io provider#54

add `chunk_expansion_boundary_fields` to solr vector_io provider#54
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:mwcz/solr-provider-chunk-expansion-boundary-enhancement

mwcz commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwcz commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Logging cleanup

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mwcz commented Feb 17, 2026 •

edited

Loading