LCORE-1377 Embed index name as source attribute in chunk metadata#99
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds a "source" field containing the current document index to document metadata: merged into each chunk's metadata on manual chunking and added to file attributes before vector store creation on automatic chunking. Tests assert presence and preservation of existing metadata. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_document_processor_llama_stack.py (1)
560-562: Add one regression check for merge behavior, not just key presence.These assertions prove
sourceis added, but they do not catch a future refactor that drops existing metadata while doing the merge. A small extra assertion thattitle/docs_urlstill survive, or a fixture with a pre-existingsource, would lock the new contract down.Proposed test hardening
for chunk in call_kwargs["chunks"]: assert chunk["metadata"]["source"] == mock.sentinel.index + assert "title" in chunk["metadata"] + assert "docs_url" in chunk["metadata"] + assert chunk["chunk_metadata"]["source"].startswith("https://") for call in client.vector_stores.files.create.await_args_list: assert call.kwargs["attributes"]["source"] == mock.sentinel.index + assert "title" in call.kwargs["attributes"] + assert "docs_url" in call.kwargs["attributes"]Also applies to: 572-574
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_document_processor_llama_stack.py` around lines 560 - 562, The test currently only asserts that chunk["metadata"]["source"] == mock.sentinel.index, but doesn't verify existing metadata is preserved during the merge; update the assertions in the loop over call_kwargs["chunks"] (and the similar block at 572-574) to also check that pre-existing metadata keys like "title" and "docs_url" still exist and retain their original values (e.g., compare chunk["metadata"]["title"] and chunk["metadata"]["docs_url"] to the expected originals or fixture values) to ensure merging doesn't drop prior metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/test_document_processor_llama_stack.py`:
- Around line 560-562: The test currently only asserts that
chunk["metadata"]["source"] == mock.sentinel.index, but doesn't verify existing
metadata is preserved during the merge; update the assertions in the loop over
call_kwargs["chunks"] (and the similar block at 572-574) to also check that
pre-existing metadata keys like "title" and "docs_url" still exist and retain
their original values (e.g., compare chunk["metadata"]["title"] and
chunk["metadata"]["docs_url"] to the expected originals or fixture values) to
ensure merging doesn't drop prior metadata.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ed86e71-81c2-4a36-8c34-3520c45039e3
📒 Files selected for processing (2)
src/lightspeed_rag_content/document_processor.pytests/test_document_processor_llama_stack.py
|
@are-ces Could you PTAL? |
2cb67a7 to
5317141
Compare
5317141 to
e78b826
Compare
|
LGTM |
Description
Embed the index name as a "source" attribute in chunk metadata during indexing (e.g. "ocp-documentation"), so that lightspeed-stack is able to figure out which store a chunk originated from in case more than one vector store is queried.
The lightspeed-stack counterpart PR is lightspeed-core/lightspeed-stack#1300
(Continuation of lightspeed-core/lightspeed-stack#1135, lightspeed-core/lightspeed-stack#1208, lightspeed-core/lightspeed-stack#1248)
Type of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
metadata: {"source": "test-index", ...}throughvector_io.insertvector_stores/{id}/search.result.attributes["source"]=="test-index".Summary by CodeRabbit
Improvements
Tests