Skip to content

Conversation

Dishant1804
Copy link
Collaborator

Proposed change

Resolves #2342

fixed the erros of slack and integrity errors of the chunks

Checklist

  • I've read and followed the contributing guidelines.
  • I've run make check-test locally; all checks and tests passed.

Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

Summary by CodeRabbit

  • Refactor
    • Deduplicates chunk content before processing, reducing redundant chunks and speeding up runs.
    • Updates log messages to say “Created/updated context” and “Failed to create/update context” for clearer status reporting.
  • Chores
    • Simplifies the Slack message context update command by removing custom CLI flags; only default options are supported now.
  • Tests
    • Expanded tests to cover duplicate filtering in chunk processing.
    • Updated expectations to match the new “Created/updated” messaging and streamlined CLI argument set.

Walkthrough

Deduplicate chunk texts (via a set, then converted to list) before creating chunks and embeddings; update context command messages to “Created/updated …” / “Failed to create/update …”; remove the custom add_arguments override from the Slack message context command; update tests accordingly.

Changes

Cohort / File(s) Summary of changes
Chunk deduplication & tests
backend/apps/ai/common/base/chunk_command.py, backend/tests/apps/ai/common/base/chunk_command_test.py
Replace direct chunk_texts usage with a deduplicated unique_chunk_texts (set → list) passed to create_chunks_and_embeddings; skip processing when deduplicated set is empty; add test_process_chunks_batch_with_duplicates; tests updated to patch Chunk.objects.filter and assert deduplication.
Context command messaging & tests
backend/apps/ai/common/base/context_command.py, backend/tests/apps/ai/common/base/context_command_test.py, backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py
Update user-facing log strings from “Created context …” / “Failed to create context …” to “Created/updated context …” / “Failed to create/update context …”; adjust test assertions to match new messages.
Slack message context CLI args & tests
backend/apps/ai/management/commands/ai_update_slack_message_context.py, backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py
Remove the command’s add_arguments override (custom CLI flags removed); tests updated to expect only the parent-class arguments (3 additions) and the same default batch-size.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

nestbot

Suggested reviewers

  • kasya

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly highlights the two main areas addressed by the changeset—Slack-related errors and duplication errors—using concise and specific language that aligns with the PR’s primary focus.
Linked Issues Check ✅ Passed The pull request implements deduplication of chunk texts to prevent duplicate entries and adds corresponding test coverage, directly addressing the duplication errors outlined in issue #2342.
Out of Scope Changes Check ✅ Passed All modifications, including deduplication logic in chunk processing, updated context messaging, Slack sync command adjustments, and aligned tests, relate to preventing duplication and fixing context synchronization issues without introducing unrelated functionality.
Description Check ✅ Passed The description directly references resolving issue #2342 and summarizes the fixes for Slack errors and chunk integrity problems, demonstrating clear relevance to the changeset without diverging into unrelated topics.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b589d9f and 0c87471.

📒 Files selected for processing (1)
  • backend/apps/ai/common/base/chunk_command.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/apps/ai/common/base/chunk_command.py (2)
backend/apps/ai/models/chunk.py (2)
  • Chunk (12-72)
  • split_text (35-42)
backend/apps/ai/common/utils.py (1)
  • create_chunks_and_embeddings (19-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run backend tests
🔇 Additional comments (1)
backend/apps/ai/common/base/chunk_command.py (1)

60-65: LGTM! Deduplication prevents integrity errors from duplicate chunk texts.

The set-based deduplication correctly addresses the integrity error issue described in #2342. Without this, if Chunk.split_text returns duplicate texts within the same batch, the bulk save operation would fail due to the unique_together = ("context", "text") constraint on the Chunk model. Since chunks are deleted before creation (line 44), the database check in Chunk.update_data cannot catch within-batch duplicates.

This approach also optimizes by avoiding unnecessary OpenAI API calls for duplicate chunks.

The conversion to list() on line 65 maintains compatibility with the create_chunks_and_embeddings function signature. While the underlying OpenAI API accepts iterables (as noted in past review comments), keeping the list conversion ensures type safety and matches the current function contract.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/tests/apps/ai/common/base/chunk_command_test.py (1)

243-284: Potential over-aggressive deduplication or incorrect test setup.

In test_process_chunks_batch_multiple_entities, three entities are processed, each should create 2 chunks (from mock_chunks[:2]), totaling 6 chunks. However, the assertion now expects only 2 chunks to be saved (Line 283).

This could indicate:

  1. Bug in deduplication logic: The dict keying by (context_id, text) might cause all chunks to have the same context_id (since mock_context is reused), and if chunk texts also collide, later chunks would overwrite earlier ones in the dict.
  2. Incorrect mock setup: All entities share the same mock_context (Line 265), meaning all chunks have context_id=1. If the chunk texts are also identical across entities, the dict would only keep one chunk per unique text.

Root cause analysis needed:

Looking at the test setup:

  • All entities use the same mock_context (id=1)
  • mock_create_chunks.return_value = mock_chunks[:2] returns the same chunk objects each time
  • The mock chunks have fixed context_id=1 and fixed texts "Chunk text 1", "Chunk text 2"

Since the dict is keyed by (context_id, text), and all three entities produce chunks with the same keys (1, "Chunk text 1") and (1, "Chunk text 2"), the dict only retains 2 unique entries total.

This is actually testing the deduplication behavior correctly - the production code correctly deduplicates chunks with identical (context_id, text) pairs across different entities. However, this test setup obscures what's being tested.

Improve test clarity by either:

  1. Using distinct mock_context instances for each entity to test that chunks from different contexts are all saved, or
  2. Adding a comment explaining that this test verifies deduplication across entities with the same context
# Option 1: Test with distinct contexts
contexts = []
for i in range(3):
    ctx = Mock()
    ctx.id = i + 1
    contexts.append(ctx)

def context_side_effect(entity_type, entity_id):
    return Mock(first=Mock(return_value=contexts[entity_id - 1]))

mock_context_filter.side_effect = context_side_effect

# Then expect 6 chunks to be saved (2 per context)
assert len(bulk_save_args) == 6
🧹 Nitpick comments (1)
backend/apps/ai/common/base/chunk_command.py (1)

81-96: Deduplication logic correctly prevents database integrity errors.

The added deduplication logic:

  1. Extracts unique context_ids and texts from the batch
  2. Queries existing (context_id, text) pairs
  3. Filters out chunks that already exist
  4. Only bulk-saves new chunks

This correctly addresses the PR objective to fix duplication/integrity errors during chunk syncing.

Performance consideration for large batches:

For batches with many contexts and chunks, the query on lines 86-88 performs an IN lookup on two fields. While this is generally efficient with proper indexing, consider monitoring query performance in production if batch sizes are large.

The unique_together = ("context", "text") constraint in the Chunk model (from the provided snippets) ensures database-level integrity. Verify that appropriate indexes exist:

#!/bin/bash
# Check for indexes on the ai_chunks table
# This helps ensure the deduplication query is efficient

# Using Django's sqlmigrate or inspectdb to check indexes
python manage.py sqlmigrate ai <migration_number> | grep -i "create index\|unique"

# Or check current database indexes
python manage.py dbshell <<EOF
\d ai_chunks
EOF
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f839b9 and 833afb5.

📒 Files selected for processing (7)
  • backend/apps/ai/common/base/chunk_command.py (3 hunks)
  • backend/apps/ai/common/base/context_command.py (1 hunks)
  • backend/apps/ai/management/commands/ai_update_slack_message_context.py (0 hunks)
  • backend/tests/apps/ai/common/base/chunk_command_test.py (4 hunks)
  • backend/tests/apps/ai/common/base/context_command_test.py (4 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py (2 hunks)
  • backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py (2 hunks)
💤 Files with no reviewable changes (1)
  • backend/apps/ai/management/commands/ai_update_slack_message_context.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/apps/ai/common/base/context_command.py (1)
backend/apps/ai/common/base/ai_command.py (1)
  • get_entity_key (71-73)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)
backend/tests/apps/ai/common/base/ai_command_test.py (2)
  • command (19-24)
  • mock_entity (28-33)
backend/apps/ai/common/base/chunk_command.py (1)
  • process_chunks_batch (20-98)
backend/apps/ai/common/base/chunk_command.py (1)
backend/apps/ai/models/chunk.py (2)
  • Chunk (12-72)
  • bulk_save (30-32)
🔇 Additional comments (5)
backend/apps/ai/common/base/context_command.py (1)

38-43: LGTM! Message updates accurately reflect create-or-update semantics.

The updated messages correctly indicate that Context.update_data can either create or update a context. This improves clarity for users running the command.

backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py (1)

172-174: LGTM! Test assertions correctly updated.

The test assertions now match the updated messaging in the command implementation.

Also applies to: 211-211

backend/tests/apps/ai/common/base/context_command_test.py (1)

119-119: LGTM! Comprehensive test assertion updates.

All test assertions have been consistently updated across success, failure, and mixed scenarios to match the new context command messaging.

Also applies to: 133-133, 187-189, 264-264

backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py (1)

75-89: Verify that removing the add_arguments override is intentional.

The test now expects only 3 arguments with a batch-size default of 50 (changed from 100). This suggests the command's add_arguments override was removed, and it now inherits the parent class's argument definitions.

If the previous command had a custom batch-size default of 100 for Slack messages (potentially due to different performance characteristics), removing this override could impact existing users or scripts that relied on this default behavior.

Please confirm:

  1. Is this batch-size default change intentional?
  2. Are there any existing scripts, documentation, or user workflows that depend on the previous batch-size default of 100?
backend/apps/ai/common/base/chunk_command.py (1)

23-23: LGTM! Dict-based accumulation prevents intra-batch duplicates.

Changing from a list to a dict keyed by (context_id, text) ensures that within a single batch, if multiple chunks have the same context and text (even from different entities), only one is retained. This is correct behavior given the unique_together = ("context", "text") constraint in the Chunk model.

Also applies to: 71-73


if batch_chunks_to_create:
Chunk.bulk_save(batch_chunks_to_create)
context_ids = {context_id for context_id, _ in batch_chunks_to_create}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not using (again) our backend-wide approach for handling models data -- model.update_data + model.bulk.save instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using bulk save, the integrity error was due to duplication -- I had 2 options one was to do it while checking in this way and other was to make the ignore_conflicts = true in the bulk save

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have a check in the Chunk::update_data method. Why it's not enough and why it can't be handled there?

Copy link
Collaborator Author

@Dishant1804 Dishant1804 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update_data method will check for duplicates but the integrity error is caused due to race condition, while updating we are deleting the necessary chunks which are to be updated

so to avoid the race condition we are doing this
ignore_conflicts=true in the bulk_save was an option too I read about it on stackoverflow -- but I dont think that is viable
what do you think?

self.stdout.write(f"Created context for {entity_key}")
self.stdout.write(f"Created/updated context for {entity_key}")
else:
entity_key = self.get_entity_key(entity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is repetitive.

Copy link
Collaborator Author

@Dishant1804 Dishant1804 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


chunk_texts = Chunk.split_text(full_content)
if not chunk_texts:
unique_chunk_texts = list(dict.fromkeys(chunk_texts))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why you need this fancy stuff instead of using set? Does the order matter here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the order does not matter -- now I am using set it is fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just set(chunk_texts) work here without converting it to a list? It doesn't look right at all from performance perspective.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the openAI embeddings only accepts list[str] or str as an input so we need to convert it to list

reference doc : https://platform.openai.com/docs/api-reference/embeddings/create?utm_source=chatgpt.com

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dishant1804 Dishant1804 requested a review from arkid15r October 7, 2025 05:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)

224-228: Consider adding length assertion for robustness.

While the set comparison correctly verifies that all unique chunks are present, it doesn't confirm the exact count. Adding a length check would strengthen the assertion and catch edge cases where duplicates might slip through.

         assert result == 1
         _, kwargs = mock_create_chunks.call_args
         assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"}
+        assert len(kwargs["chunk_texts"]) == 3
         assert kwargs["context"] == mock_context
         assert kwargs["openai_client"] == command.openai_client
         assert kwargs["save"] is False

463-464: Consider adding length assertion here too.

Similar to the suggestion for line 225, adding a length check here would make the assertion more robust and explicit about the expected deduplication result.

         assert result == 1
         mock_split_text.assert_called_once()
         _, kwargs = mock_create_chunks.call_args
         assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"}
+        assert len(kwargs["chunk_texts"]) == 3
         assert kwargs["context"] == mock_context
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c6eb8 and b589d9f.

📒 Files selected for processing (2)
  • backend/apps/ai/common/base/chunk_command.py (1 hunks)
  • backend/tests/apps/ai/common/base/chunk_command_test.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/apps/ai/common/base/chunk_command.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/ai/common/base/chunk_command_test.py (3)
backend/tests/apps/ai/common/base/context_command_test.py (3)
  • command (26-34)
  • mock_entity (38-44)
  • mock_context (48-55)
backend/tests/apps/ai/common/base/ai_command_test.py (2)
  • command (19-24)
  • mock_entity (28-33)
backend/apps/ai/common/base/chunk_command.py (1)
  • process_chunks_batch (20-84)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run frontend e2e tests
🔇 Additional comments (2)
backend/tests/apps/ai/common/base/chunk_command_test.py (2)

433-469: Well-implemented test for deduplication logic!

This test effectively verifies that duplicate chunk texts from split_text are filtered before processing. The test setup is clear, follows existing patterns, and properly validates that only unique chunks are passed to create_chunks_and_embeddings.


214-469: Inconsistency between summary and implementation regarding order preservation.

The AI summary states "Deduplicate chunk texts (order-preserving) before creating chunks", but the implementation uses list(set(chunk_texts)), which does not preserve the original order of elements. In Python, set() has no guaranteed ordering, and list(set(...)) will produce an arbitrary order rather than preserving the first-occurrence order from chunk_texts.

For text chunks representing parts of a document, preserving order might be semantically important to maintain the document's flow. If order preservation is intended, consider using an order-preserving deduplication approach in the implementation:

# Option 1: Using dict.fromkeys (Python 3.7+)
unique_chunk_texts = list(dict.fromkeys(chunk_texts))

# Option 2: Manual filtering
seen = set()
unique_chunk_texts = [x for x in chunk_texts if not (x in seen or seen.add(x))]

If order doesn't matter, the current implementation is fine, but the summary description should be clarified.

Based on learnings from the codebase context, please verify whether chunk ordering is semantically important for the embedding and retrieval process.

Copy link

sonarqubecloud bot commented Oct 8, 2025

@arkid15r arkid15r enabled auto-merge (squash) October 8, 2025 21:01
@arkid15r arkid15r merged commit 385ca51 into OWASP:feature/nestbot-ai-assistant Oct 8, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants