-
-
Notifications
You must be signed in to change notification settings - Fork 224
Fix slack and duplication errors #2352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix slack and duplication errors #2352
Conversation
Summary by CodeRabbit
WalkthroughDeduplicate chunk texts (via a set, then converted to list) before creating chunks and embeddings; update context command messages to “Created/updated …” / “Failed to create/update …”; remove the custom Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)backend/apps/ai/common/base/chunk_command.py (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/tests/apps/ai/common/base/chunk_command_test.py (1)
243-284
: Potential over-aggressive deduplication or incorrect test setup.In
test_process_chunks_batch_multiple_entities
, three entities are processed, each should create 2 chunks (frommock_chunks[:2]
), totaling 6 chunks. However, the assertion now expects only 2 chunks to be saved (Line 283).This could indicate:
- Bug in deduplication logic: The dict keying by
(context_id, text)
might cause all chunks to have the same context_id (since mock_context is reused), and if chunk texts also collide, later chunks would overwrite earlier ones in the dict.- Incorrect mock setup: All entities share the same
mock_context
(Line 265), meaning all chunks havecontext_id=1
. If the chunk texts are also identical across entities, the dict would only keep one chunk per unique text.Root cause analysis needed:
Looking at the test setup:
- All entities use the same
mock_context
(id=1)mock_create_chunks.return_value = mock_chunks[:2]
returns the same chunk objects each time- The mock chunks have fixed
context_id=1
and fixed texts "Chunk text 1", "Chunk text 2"Since the dict is keyed by
(context_id, text)
, and all three entities produce chunks with the same keys(1, "Chunk text 1")
and(1, "Chunk text 2")
, the dict only retains 2 unique entries total.This is actually testing the deduplication behavior correctly - the production code correctly deduplicates chunks with identical (context_id, text) pairs across different entities. However, this test setup obscures what's being tested.
Improve test clarity by either:
- Using distinct mock_context instances for each entity to test that chunks from different contexts are all saved, or
- Adding a comment explaining that this test verifies deduplication across entities with the same context
# Option 1: Test with distinct contexts contexts = [] for i in range(3): ctx = Mock() ctx.id = i + 1 contexts.append(ctx) def context_side_effect(entity_type, entity_id): return Mock(first=Mock(return_value=contexts[entity_id - 1])) mock_context_filter.side_effect = context_side_effect # Then expect 6 chunks to be saved (2 per context) assert len(bulk_save_args) == 6
🧹 Nitpick comments (1)
backend/apps/ai/common/base/chunk_command.py (1)
81-96
: Deduplication logic correctly prevents database integrity errors.The added deduplication logic:
- Extracts unique context_ids and texts from the batch
- Queries existing (context_id, text) pairs
- Filters out chunks that already exist
- Only bulk-saves new chunks
This correctly addresses the PR objective to fix duplication/integrity errors during chunk syncing.
Performance consideration for large batches:
For batches with many contexts and chunks, the query on lines 86-88 performs an IN lookup on two fields. While this is generally efficient with proper indexing, consider monitoring query performance in production if batch sizes are large.
The
unique_together = ("context", "text")
constraint in the Chunk model (from the provided snippets) ensures database-level integrity. Verify that appropriate indexes exist:#!/bin/bash # Check for indexes on the ai_chunks table # This helps ensure the deduplication query is efficient # Using Django's sqlmigrate or inspectdb to check indexes python manage.py sqlmigrate ai <migration_number> | grep -i "create index\|unique" # Or check current database indexes python manage.py dbshell <<EOF \d ai_chunks EOF
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
backend/apps/ai/common/base/chunk_command.py
(3 hunks)backend/apps/ai/common/base/context_command.py
(1 hunks)backend/apps/ai/management/commands/ai_update_slack_message_context.py
(0 hunks)backend/tests/apps/ai/common/base/chunk_command_test.py
(4 hunks)backend/tests/apps/ai/common/base/context_command_test.py
(4 hunks)backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py
(2 hunks)backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py
(2 hunks)
💤 Files with no reviewable changes (1)
- backend/apps/ai/management/commands/ai_update_slack_message_context.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/apps/ai/common/base/context_command.py (1)
backend/apps/ai/common/base/ai_command.py (1)
get_entity_key
(71-73)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)
backend/tests/apps/ai/common/base/ai_command_test.py (2)
command
(19-24)mock_entity
(28-33)backend/apps/ai/common/base/chunk_command.py (1)
process_chunks_batch
(20-98)
backend/apps/ai/common/base/chunk_command.py (1)
backend/apps/ai/models/chunk.py (2)
Chunk
(12-72)bulk_save
(30-32)
🔇 Additional comments (5)
backend/apps/ai/common/base/context_command.py (1)
38-43
: LGTM! Message updates accurately reflect create-or-update semantics.The updated messages correctly indicate that
Context.update_data
can either create or update a context. This improves clarity for users running the command.backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py (1)
172-174
: LGTM! Test assertions correctly updated.The test assertions now match the updated messaging in the command implementation.
Also applies to: 211-211
backend/tests/apps/ai/common/base/context_command_test.py (1)
119-119
: LGTM! Comprehensive test assertion updates.All test assertions have been consistently updated across success, failure, and mixed scenarios to match the new context command messaging.
Also applies to: 133-133, 187-189, 264-264
backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py (1)
75-89
: Verify that removing the add_arguments override is intentional.The test now expects only 3 arguments with a batch-size default of 50 (changed from 100). This suggests the command's
add_arguments
override was removed, and it now inherits the parent class's argument definitions.If the previous command had a custom batch-size default of 100 for Slack messages (potentially due to different performance characteristics), removing this override could impact existing users or scripts that relied on this default behavior.
Please confirm:
- Is this batch-size default change intentional?
- Are there any existing scripts, documentation, or user workflows that depend on the previous batch-size default of 100?
backend/apps/ai/common/base/chunk_command.py (1)
23-23
: LGTM! Dict-based accumulation prevents intra-batch duplicates.Changing from a list to a dict keyed by
(context_id, text)
ensures that within a single batch, if multiple chunks have the same context and text (even from different entities), only one is retained. This is correct behavior given theunique_together = ("context", "text")
constraint in the Chunk model.Also applies to: 71-73
|
||
if batch_chunks_to_create: | ||
Chunk.bulk_save(batch_chunks_to_create) | ||
context_ids = {context_id for context_id, _ in batch_chunks_to_create} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not using (again) our backend-wide approach for handling models data -- model.update_data + model.bulk.save instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using bulk save, the integrity error was due to duplication -- I had 2 options one was to do it while checking in this way and other was to make the ignore_conflicts = true in the bulk save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have a check in the Chunk::update_data method. Why it's not enough and why it can't be handled there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the update_data method will check for duplicates but the integrity error is caused due to race condition, while updating we are deleting the necessary chunks which are to be updated
so to avoid the race condition we are doing this
ignore_conflicts=true in the bulk_save was an option too I read about it on stackoverflow -- but I dont think that is viable
what do you think?
self.stdout.write(f"Created context for {entity_key}") | ||
self.stdout.write(f"Created/updated context for {entity_key}") | ||
else: | ||
entity_key = self.get_entity_key(entity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is repetitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
chunk_texts = Chunk.split_text(full_content) | ||
if not chunk_texts: | ||
unique_chunk_texts = list(dict.fromkeys(chunk_texts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why you need this fancy stuff instead of using set? Does the order matter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the order does not matter -- now I am using set it is fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just set(chunk_texts)
work here without converting it to a list? It doesn't look right at all from performance perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the openAI embeddings only accepts list[str] or str as an input so we need to convert it to list
reference doc : https://platform.openai.com/docs/api-reference/embeddings/create?utm_source=chatgpt.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be so sure about that -- https://github.com/openai/openai-python/blob/d69edeb39e6a9cc8d9822a2838b10ab4102b4cc6/src/openai/resources/embeddings.py#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)
224-228
: Consider adding length assertion for robustness.While the set comparison correctly verifies that all unique chunks are present, it doesn't confirm the exact count. Adding a length check would strengthen the assertion and catch edge cases where duplicates might slip through.
assert result == 1 _, kwargs = mock_create_chunks.call_args assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"} + assert len(kwargs["chunk_texts"]) == 3 assert kwargs["context"] == mock_context assert kwargs["openai_client"] == command.openai_client assert kwargs["save"] is False
463-464
: Consider adding length assertion here too.Similar to the suggestion for line 225, adding a length check here would make the assertion more robust and explicit about the expected deduplication result.
assert result == 1 mock_split_text.assert_called_once() _, kwargs = mock_create_chunks.call_args assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"} + assert len(kwargs["chunk_texts"]) == 3 assert kwargs["context"] == mock_context
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/apps/ai/common/base/chunk_command.py
(1 hunks)backend/tests/apps/ai/common/base/chunk_command_test.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/apps/ai/common/base/chunk_command.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/ai/common/base/chunk_command_test.py (3)
backend/tests/apps/ai/common/base/context_command_test.py (3)
command
(26-34)mock_entity
(38-44)mock_context
(48-55)backend/tests/apps/ai/common/base/ai_command_test.py (2)
command
(19-24)mock_entity
(28-33)backend/apps/ai/common/base/chunk_command.py (1)
process_chunks_batch
(20-84)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Run frontend e2e tests
🔇 Additional comments (2)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)
433-469
: Well-implemented test for deduplication logic!This test effectively verifies that duplicate chunk texts from
split_text
are filtered before processing. The test setup is clear, follows existing patterns, and properly validates that only unique chunks are passed tocreate_chunks_and_embeddings
.
214-469
: Inconsistency between summary and implementation regarding order preservation.The AI summary states "Deduplicate chunk texts (order-preserving) before creating chunks", but the implementation uses
list(set(chunk_texts))
, which does not preserve the original order of elements. In Python,set()
has no guaranteed ordering, andlist(set(...))
will produce an arbitrary order rather than preserving the first-occurrence order fromchunk_texts
.For text chunks representing parts of a document, preserving order might be semantically important to maintain the document's flow. If order preservation is intended, consider using an order-preserving deduplication approach in the implementation:
# Option 1: Using dict.fromkeys (Python 3.7+) unique_chunk_texts = list(dict.fromkeys(chunk_texts)) # Option 2: Manual filtering seen = set() unique_chunk_texts = [x for x in chunk_texts if not (x in seen or seen.add(x))]If order doesn't matter, the current implementation is fine, but the summary description should be clarified.
Based on learnings from the codebase context, please verify whether chunk ordering is semantically important for the embedding and retrieval process.
|
Proposed change
Resolves #2342
fixed the erros of slack and integrity errors of the chunks
Checklist
make check-test
locally; all checks and tests passed.