Skip to content

add chunk_expansion_boundary_fields to solr vector_io provider#54

Draft
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:mwcz/solr-provider-chunk-expansion-boundary-enhancement
Draft

add chunk_expansion_boundary_fields to solr vector_io provider#54
mwcz wants to merge 1 commit intolightspeed-core:mainfrom
mwcz:mwcz/solr-provider-chunk-expansion-boundary-enhancement

Conversation

@mwcz
Copy link
Contributor

@mwcz mwcz commented Feb 17, 2026

Description

This is a follow-up to the Solr provider PR #51. The current Solr provider has a "chunk window expansion" algorithm that, when enabled, begins with the matched chunk and bidirectionally adds additional chunks from the document, until a token budget is reached. Giving the LLM more context is the idea, but in many documents it's a bit too aggressive which results in unrelated context being fetched. To improve things, this PR adds an optional configuration option, chunk_expansion_boundary_fields which enabled providing field names which must be the same as the originally matched chunk in order for adjacent chunks to be included in the expanding window.

Restricting chunk expansion to specific headings (h1/h2/etc in HTML) is the motivating idea. If we have chunks from a single document like this:

chunk_index parent_id heading chunk
0 doc_1 Install guide Run the installer with ./install.sh.
1 doc_1 Install guide Reboot after installation completes.
2 doc_1 Legal notice This software is provided as-is.
3 doc_1 Legal notice Redistribution requires written consent.

If we do a query, and the "Reboot after installation completes" chunk is the first match, then chunk window expansion will start including adjacent chunks. Before, "This software is provided as-is." will be included despite it being utterly unrelated.

With the new configuration option, we can ensure only chunks with the same heading are eligible for inclusion.

chunk_expansion_boundary_fields: ["parent_id", "heading"]

With this config, now when "Reboot after installation completes." is matched with its parent_id=doc_1 and heading="Install guide", only chunks with those same values will be included in the chunk expansion algorithm. The context sent to the LLM will be "Run the installer with ./install.sh. Reboot after installation completes.", and the legal notice text is not included.

Note: chunk_parent_id_field is always used as a boundary regardless of this setting. chunk_expansion_boundary_fields specifies additional boundaries.

Logging cleanup

Also removes print() debug statements and downgrades several high-frequency log.info calls to log.debug (per-document score filtering, per-chunk expansion steps, cache hits).

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Opus 4.6
  • Generated by: Claude code 2.1.44

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

@mwcz mwcz force-pushed the mwcz/solr-provider-chunk-expansion-boundary-enhancement branch from 3ff95be to 5244d43 Compare February 21, 2026 19:33
Adds an optional list of chunk field names that constrain window
expansion. Context chunks are filtered to only those sharing the
same boundary field values as the anchor chunk. This allows
expansion to be bounded by heading/section in addition to parent
document.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwcz mwcz force-pushed the mwcz/solr-provider-chunk-expansion-boundary-enhancement branch from 5244d43 to f20b55a Compare February 21, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant