Skip to content

LCORE-1377 Embed index name as source attribute in chunk metadata#99

Merged
tisnik merged 1 commit intolightspeed-core:mainfrom
max-svistunov:lcore-1377-embed-index-name-in-chunk-metadata
Mar 11, 2026
Merged

LCORE-1377 Embed index name as source attribute in chunk metadata#99
tisnik merged 1 commit intolightspeed-core:mainfrom
max-svistunov:lcore-1377-embed-index-name-in-chunk-metadata

Conversation

@max-svistunov
Copy link
Contributor

@max-svistunov max-svistunov commented Mar 10, 2026

Description

Embed the index name as a "source" attribute in chunk metadata during indexing (e.g. "ocp-documentation"), so that lightspeed-stack is able to figure out which store a chunk originated from in case more than one vector store is queried.

The lightspeed-stack counterpart PR is lightspeed-core/lightspeed-stack#1300

(Continuation of lightspeed-core/lightspeed-stack#1135, lightspeed-core/lightspeed-stack#1208, lightspeed-core/lightspeed-stack#1248)

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

  • Assisted-by: Claude Opus 4.6
  • Generated by: Claude Opus 4.6

Related Tickets & Documents

  • Related Issue # LCORE-1377
  • Closes # LCORE-1377

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  1. Start llama-stack with faiss + sentence-transformers provider
  2. Create a vector store via REST API
  3. Insert a chunk with metadata: {"source": "test-index", ...} through vector_io.insert
  4. Search the vector store using vector_stores/{id}/search.
  5. Verify result.attributes["source"] == "test-index".

Summary by CodeRabbit

  • Improvements

    • Document processor now embeds a "source" identifier for each document chunk in both manual and automatic chunking, while preserving existing metadata and attributes for improved attribution and traceability.
  • Tests

    • Added tests to confirm the "source" identifier is set on chunks/files and that existing metadata/attributes remain intact during chunking operations.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4215782-5c10-48d0-8a65-f8ce60d5241f

📥 Commits

Reviewing files that changed from the base of the PR and between 5317141 and e78b826.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_stack.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_document_processor_llama_stack.py
  • src/lightspeed_rag_content/document_processor.py

Walkthrough

Adds a "source" field containing the current document index to document metadata: merged into each chunk's metadata on manual chunking and added to file attributes before vector store creation on automatic chunking. Tests assert presence and preservation of existing metadata.

Changes

Cohort / File(s) Summary
Document Processing Enhancement
src/lightspeed_rag_content/document_processor.py
Merge a \"source\" field (current index) into chunk metadata for pre-chunked/manual path; add \"source\" to file attributes for automatic chunking before vector store file creation.
Test Coverage
tests/test_document_processor_llama_stack.py
Adds assertions that chunk metadata and file attributes include \"source\" equal to the index and that existing metadata keys (e.g., title, docs_url, document_id) are preserved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: embedding an index name as a 'source' attribute in chunk metadata, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_document_processor_llama_stack.py (1)

560-562: Add one regression check for merge behavior, not just key presence.

These assertions prove source is added, but they do not catch a future refactor that drops existing metadata while doing the merge. A small extra assertion that title/docs_url still survive, or a fixture with a pre-existing source, would lock the new contract down.

Proposed test hardening
         for chunk in call_kwargs["chunks"]:
             assert chunk["metadata"]["source"] == mock.sentinel.index
+            assert "title" in chunk["metadata"]
+            assert "docs_url" in chunk["metadata"]
+            assert chunk["chunk_metadata"]["source"].startswith("https://")

         for call in client.vector_stores.files.create.await_args_list:
             assert call.kwargs["attributes"]["source"] == mock.sentinel.index
+            assert "title" in call.kwargs["attributes"]
+            assert "docs_url" in call.kwargs["attributes"]

Also applies to: 572-574

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_document_processor_llama_stack.py` around lines 560 - 562, The
test currently only asserts that chunk["metadata"]["source"] ==
mock.sentinel.index, but doesn't verify existing metadata is preserved during
the merge; update the assertions in the loop over call_kwargs["chunks"] (and the
similar block at 572-574) to also check that pre-existing metadata keys like
"title" and "docs_url" still exist and retain their original values (e.g.,
compare chunk["metadata"]["title"] and chunk["metadata"]["docs_url"] to the
expected originals or fixture values) to ensure merging doesn't drop prior
metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_document_processor_llama_stack.py`:
- Around line 560-562: The test currently only asserts that
chunk["metadata"]["source"] == mock.sentinel.index, but doesn't verify existing
metadata is preserved during the merge; update the assertions in the loop over
call_kwargs["chunks"] (and the similar block at 572-574) to also check that
pre-existing metadata keys like "title" and "docs_url" still exist and retain
their original values (e.g., compare chunk["metadata"]["title"] and
chunk["metadata"]["docs_url"] to the expected originals or fixture values) to
ensure merging doesn't drop prior metadata.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ed86e71-81c2-4a36-8c34-3520c45039e3

📥 Commits

Reviewing files that changed from the base of the PR and between a3862ee and 2cb67a7.

📒 Files selected for processing (2)
  • src/lightspeed_rag_content/document_processor.py
  • tests/test_document_processor_llama_stack.py

@max-svistunov
Copy link
Contributor Author

@are-ces Could you PTAL?

@max-svistunov max-svistunov force-pushed the lcore-1377-embed-index-name-in-chunk-metadata branch from 2cb67a7 to 5317141 Compare March 10, 2026 17:55
@max-svistunov max-svistunov force-pushed the lcore-1377-embed-index-name-in-chunk-metadata branch from 5317141 to e78b826 Compare March 10, 2026 18:07
@syedriko
Copy link
Contributor

LGTM

Copy link
Contributor

@are-ces are-ces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit c750c6f into lightspeed-core:main Mar 11, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants