Skip to content

add chunk_family_fields to okp enrichment config#1304

Merged
tisnik merged 2 commits intolightspeed-core:mainfrom
mwcz:mwcz/solr-config-updates
Mar 11, 2026
Merged

add chunk_family_fields to okp enrichment config#1304
tisnik merged 2 commits intolightspeed-core:mainfrom
mwcz:mwcz/solr-config-updates

Conversation

@mwcz
Copy link
Contributor

@mwcz mwcz commented Mar 10, 2026

Description

In the OKP enrichment config, this sets chunk_family_fields to ["headings"] which causes the Solr provider's chunk expansion algorithm to expand only within a shared heading, lowering the likelihood of irrelevant chunks being included. I'm marking it as a bug fix too, because currently if chunk expansion is enabled, omitting chunk_family_fields causes an error even though the field is meant to be optional.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: none
  • Generated by: none

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Improvements
    • Enhanced Solr-based search filtering by including heading metadata in chunk grouping, improving organization and relevance of search results and enabling more comprehensive content filtering.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

Added a single configuration entry to the Solr vector_io provider's chunk_window_config in src/llama_stack_configuration.py, explicitly including "headings" in chunk_family_fields.

Changes

Cohort / File(s) Summary
Configuration Update
src/llama_stack_configuration.py
Added "chunk_family_fields": ["headings"] to the Solr vector_io provider's chunk_window_config, so chunk filtering metadata now includes the headings field.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding chunk_family_fields to the OKP enrichment configuration, which directly matches the single-line addition in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llama_stack_configuration.py (1)

402-454: ⚠️ Potential issue | 🟠 Major

Backfill chunk_family_fields on existing Solr providers too.

This only sets chunk_family_fields when the Solr provider is newly appended. If constants.SOLR_PROVIDER_ID is already present in the input config, this branch is skipped and upgraded configs keep the missing field that this PR is trying to fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 402 - 454, The code only sets
chunk_family_fields when appending a new Solr provider, so existing Solr entries
keep missing chunk_family_fields; update the branch that detects
existing_providers (using existing_providers, ls_config, and
constants.SOLR_PROVIDER_ID) to iterate over ls_config["providers"]["vector_io"]
and, for any provider with provider_id == constants.SOLR_PROVIDER_ID, ensure its
config.chunk_window_config includes chunk_family_fields (e.g., add ["headings"]
if absent) and the other chunk_window_config defaults (use
solr_config.get("chunk_filter_query", "is_chunk:true") as used when appending)
so upgrades backfill the missing field for existing providers as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llama_stack_configuration.py`:
- Line 445: The multiline dict entry for the key "chunk_family_fields" is
missing a trailing comma which causes Black to reformat the file; update the
dict literal containing the "chunk_family_fields": ["headings"] entry by adding
a trailing comma after the closing bracket so the line becomes
"chunk_family_fields": ["headings"], ensuring the surrounding dict (where this
key is defined) is syntactically correct and Black will pass.

---

Outside diff comments:
In `@src/llama_stack_configuration.py`:
- Around line 402-454: The code only sets chunk_family_fields when appending a
new Solr provider, so existing Solr entries keep missing chunk_family_fields;
update the branch that detects existing_providers (using existing_providers,
ls_config, and constants.SOLR_PROVIDER_ID) to iterate over
ls_config["providers"]["vector_io"] and, for any provider with provider_id ==
constants.SOLR_PROVIDER_ID, ensure its config.chunk_window_config includes
chunk_family_fields (e.g., add ["headings"] if absent) and the other
chunk_window_config defaults (use solr_config.get("chunk_filter_query",
"is_chunk:true") as used when appending) so upgrades backfill the missing field
for existing providers as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b046e6c-fabd-47bb-9b6c-73ef9877e9c3

📥 Commits

Reviewing files that changed from the base of the PR and between de8a85a and ac3205f.

📒 Files selected for processing (1)
  • src/llama_stack_configuration.py

@Anxhela21
Copy link
Contributor

/lgtm

@mwcz mwcz force-pushed the mwcz/solr-config-updates branch from ac3205f to 040ff3a Compare March 10, 2026 18:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/llama_stack_configuration.py (1)

445-445: Add a regression check for chunk_family_fields.

The current Solr enrichment tests only verify provider/store/model IDs, so this new key can be removed or misspelled without any test failing. Please extend tests/unit/test_llama_stack_configuration.py to assert chunk_window_config["chunk_family_fields"] == ["headings"].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` at line 445, The tests don't assert the new
chunk_family_fields key so regressions can remove or misspell it; add an
assertion in tests/unit/test_llama_stack_configuration.py that verifies
chunk_window_config["chunk_family_fields"] == ["headings"]. Locate where
chunk_window_config is built or loaded in that test (search for
chunk_window_config, chunk_window, or llama_stack_configuration usage) and add a
single assertion checking the exact list value, ensuring the test fails if the
key is missing or altered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/llama_stack_configuration.py`:
- Line 445: The tests don't assert the new chunk_family_fields key so
regressions can remove or misspell it; add an assertion in
tests/unit/test_llama_stack_configuration.py that verifies
chunk_window_config["chunk_family_fields"] == ["headings"]. Locate where
chunk_window_config is built or loaded in that test (search for
chunk_window_config, chunk_window, or llama_stack_configuration usage) and add a
single assertion checking the exact list value, ensuring the test fails if the
key is missing or altered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 750b1853-0051-4634-ae9c-6af0c46632f6

📥 Commits

Reviewing files that changed from the base of the PR and between ac3205f and 040ff3a.

📒 Files selected for processing (1)
  • src/llama_stack_configuration.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/llama_stack_configuration.py (1)

437-446: LGTM! The trailing comma issue from the previous review has been addressed.

The addition of chunk_family_fields to chunk_window_config is correct and aligns with the PR objective to fix the error when chunk expansion is enabled without this field.

One consideration: the existing tests for enrich_solr (see tests/unit/test_llama_stack_configuration.py:406-450) validate provider/store registration but do not assert on the internal structure of chunk_window_config. Adding a test that verifies chunk_family_fields is set to ["headings"] would help prevent regressions.

,

Optional: Add test coverage for chunk_window_config
def test_enrich_solr_sets_chunk_family_fields() -> None:
    """Test enrich_solr configures chunk_family_fields in chunk_window_config."""
    ls_config: dict[str, Any] = {}
    enrich_solr(ls_config, {"enabled": True})

    solr_provider = next(
        p for p in ls_config["providers"]["vector_io"]
        if p["provider_id"] == "okp_solr"
    )
    chunk_window_config = solr_provider["config"]["chunk_window_config"]
    assert chunk_window_config["chunk_family_fields"] == ["headings"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama_stack_configuration.py` around lines 437 - 446, Add a unit test
that asserts enrich_solr sets chunk_family_fields to ["headings"] in the Solr
provider's chunk_window_config: in tests/unit/test_llama_stack_configuration.py
create a test (e.g., test_enrich_solr_sets_chunk_family_fields) that calls
enrich_solr(ls_config, {"enabled": True}), locates the Solr provider via
ls_config["providers"]["vector_io"] where provider_id == "okp_solr", extracts
provider["config"]["chunk_window_config"], and asserts
chunk_window_config["chunk_family_fields"] == ["headings"] to prevent
regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/llama_stack_configuration.py`:
- Around line 437-446: Add a unit test that asserts enrich_solr sets
chunk_family_fields to ["headings"] in the Solr provider's chunk_window_config:
in tests/unit/test_llama_stack_configuration.py create a test (e.g.,
test_enrich_solr_sets_chunk_family_fields) that calls enrich_solr(ls_config,
{"enabled": True}), locates the Solr provider via
ls_config["providers"]["vector_io"] where provider_id == "okp_solr", extracts
provider["config"]["chunk_window_config"], and asserts
chunk_window_config["chunk_family_fields"] == ["headings"] to prevent
regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ff31f06-9189-4eb1-80d2-df61f981149e

📥 Commits

Reviewing files that changed from the base of the PR and between 040ff3a and ffb769a.

📒 Files selected for processing (1)
  • src/llama_stack_configuration.py

@tisnik tisnik merged commit b4daa8e into lightspeed-core:main Mar 11, 2026
20 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants