add chunk_expansion_boundary_fields to solr vector_io provider#54
Draft
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
Draft
add chunk_expansion_boundary_fields to solr vector_io provider#54mwcz wants to merge 1 commit intolightspeed-core:mainfrom
chunk_expansion_boundary_fields to solr vector_io provider#54mwcz wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
3ff95be to
5244d43
Compare
Adds an optional list of chunk field names that constrain window expansion. Context chunks are filtered to only those sharing the same boundary field values as the anchor chunk. This allows expansion to be bounded by heading/section in addition to parent document. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5244d43 to
f20b55a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This is a follow-up to the Solr provider PR #51. The current Solr provider has a "chunk window expansion" algorithm that, when enabled, begins with the matched chunk and bidirectionally adds additional chunks from the document, until a token budget is reached. Giving the LLM more context is the idea, but in many documents it's a bit too aggressive which results in unrelated context being fetched. To improve things, this PR adds an optional configuration option,
chunk_expansion_boundary_fieldswhich enabled providing field names which must be the same as the originally matched chunk in order for adjacent chunks to be included in the expanding window.Restricting chunk expansion to specific headings (h1/h2/etc in HTML) is the motivating idea. If we have chunks from a single document like this:
./install.sh.If we do a query, and the "Reboot after installation completes" chunk is the first match, then chunk window expansion will start including adjacent chunks. Before, "This software is provided as-is." will be included despite it being utterly unrelated.
With the new configuration option, we can ensure only chunks with the same
headingare eligible for inclusion.With this config, now when "Reboot after installation completes." is matched with its
parent_id=doc_1andheading="Install guide", only chunks with those same values will be included in the chunk expansion algorithm. The context sent to the LLM will be "Run the installer with ./install.sh. Reboot after installation completes.", and the legal notice text is not included.Note:
chunk_parent_id_fieldis always used as a boundary regardless of this setting.chunk_expansion_boundary_fieldsspecifies additional boundaries.Logging cleanup
Also removes
print()debug statements and downgrades several high-frequencylog.infocalls tolog.debug(per-document score filtering, per-chunk expansion steps, cache hits).Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing