Fix slack and duplication errors #2352

Dishant1804 · 2025-10-02T10:07:01Z

Proposed change

Resolves #2342

fixed the erros of slack and integrity errors of the chunks

Checklist

I've read and followed the contributing guidelines.
I've run make check-test locally; all checks and tests passed.

coderabbitai · 2025-10-02T10:07:08Z

Summary by CodeRabbit

Refactor
- Deduplicates chunk content before processing, reducing redundant chunks and speeding up runs.
- Updates log messages to say “Created/updated context” and “Failed to create/update context” for clearer status reporting.
Chores
- Simplifies the Slack message context update command by removing custom CLI flags; only default options are supported now.
Tests
- Expanded tests to cover duplicate filtering in chunk processing.
- Updated expectations to match the new “Created/updated” messaging and streamlined CLI argument set.

Walkthrough

Deduplicate chunk texts (via a set, then converted to list) before creating chunks and embeddings; update context command messages to “Created/updated …” / “Failed to create/update …”; remove the custom add_arguments override from the Slack message context command; update tests accordingly.

Changes

Cohort / File(s)	Summary of changes
Chunk deduplication & tests `backend/apps/ai/common/base/chunk_command.py`, `backend/tests/apps/ai/common/base/chunk_command_test.py`	Replace direct `chunk_texts` usage with a deduplicated `unique_chunk_texts` (set → list) passed to `create_chunks_and_embeddings`; skip processing when deduplicated set is empty; add `test_process_chunks_batch_with_duplicates`; tests updated to patch `Chunk.objects.filter` and assert deduplication.
Context command messaging & tests `backend/apps/ai/common/base/context_command.py`, `backend/tests/apps/ai/common/base/context_command_test.py`, `backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py`	Update user-facing log strings from “Created context …” / “Failed to create context …” to “Created/updated context …” / “Failed to create/update context …”; adjust test assertions to match new messages.
Slack message context CLI args & tests `backend/apps/ai/management/commands/ai_update_slack_message_context.py`, `backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py`	Remove the command’s `add_arguments` override (custom CLI flags removed); tests updated to expect only the parent-class arguments (3 additions) and the same default `batch-size`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

event chunks created #1715 — touches chunk-creation flow and embedding invocation similar to the deduplication changes.
NestBot AI Assistant Contexts #1891 — overlaps with earlier changes to BaseChunkCommand and chunk creation semantics.

Suggested labels

nestbot

Suggested reviewers

kasya

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly highlights the two main areas addressed by the changeset—Slack-related errors and duplication errors—using concise and specific language that aligns with the PR’s primary focus.
Linked Issues Check	✅ Passed	The pull request implements deduplication of chunk texts to prevent duplicate entries and adds corresponding test coverage, directly addressing the duplication errors outlined in issue #2342.
Out of Scope Changes Check	✅ Passed	All modifications, including deduplication logic in chunk processing, updated context messaging, Slack sync command adjustments, and aligned tests, relate to preventing duplication and fixing context synchronization issues without introducing unrelated functionality.
Description Check	✅ Passed	The description directly references resolving issue #2342 and summarizes the fixes for Slack errors and chunk integrity problems, demonstrating clear relevance to the changeset without diverging into unrelated topics.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b589d9f and 0c87471.

📒 Files selected for processing (1)

backend/apps/ai/common/base/chunk_command.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/apps/ai/common/base/chunk_command.py (2)

backend/apps/ai/models/chunk.py (2)

Chunk (12-72)

split_text (35-42)

backend/apps/ai/common/utils.py (1)

create_chunks_and_embeddings (19-70)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Run frontend unit tests
GitHub Check: Run frontend e2e tests
GitHub Check: Run backend tests

🔇 Additional comments (1)

backend/apps/ai/common/base/chunk_command.py (1)

60-65: LGTM! Deduplication prevents integrity errors from duplicate chunk texts.

The set-based deduplication correctly addresses the integrity error issue described in #2342. Without this, if Chunk.split_text returns duplicate texts within the same batch, the bulk save operation would fail due to the unique_together = ("context", "text") constraint on the Chunk model. Since chunks are deleted before creation (line 44), the database check in Chunk.update_data cannot catch within-batch duplicates.

This approach also optimizes by avoiding unnecessary OpenAI API calls for duplicate chunks.

The conversion to list() on line 65 maintains compatibility with the create_chunks_and_embeddings function signature. While the underlying OpenAI API accepts iterables (as noted in past review comments), keeping the list conversion ensures type safety and matches the current function contract.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/tests/apps/ai/common/base/chunk_command_test.py (1)
243-284: Potential over-aggressive deduplication or incorrect test setup.

In test_process_chunks_batch_multiple_entities, three entities are processed, each should create 2 chunks (from mock_chunks[:2]), totaling 6 chunks. However, the assertion now expects only 2 chunks to be saved (Line 283).

This could indicate:

Bug in deduplication logic: The dict keying by (context_id, text) might cause all chunks to have the same context_id (since mock_context is reused), and if chunk texts also collide, later chunks would overwrite earlier ones in the dict.

Incorrect mock setup: All entities share the same mock_context (Line 265), meaning all chunks have context_id=1. If the chunk texts are also identical across entities, the dict would only keep one chunk per unique text.

Root cause analysis needed:

Looking at the test setup:

All entities use the same mock_context (id=1)

mock_create_chunks.return_value = mock_chunks[:2] returns the same chunk objects each time

The mock chunks have fixed context_id=1 and fixed texts "Chunk text 1", "Chunk text 2"

Since the dict is keyed by (context_id, text), and all three entities produce chunks with the same keys (1, "Chunk text 1") and (1, "Chunk text 2"), the dict only retains 2 unique entries total.

This is actually testing the deduplication behavior correctly - the production code correctly deduplicates chunks with identical (context_id, text) pairs across different entities. However, this test setup obscures what's being tested.

Improve test clarity by either:

Using distinct mock_context instances for each entity to test that chunks from different contexts are all saved, or

Adding a comment explaining that this test verifies deduplication across entities with the same context
# Option 1: Test with distinct contexts
contexts = []
for i in range(3):
    ctx = Mock()
    ctx.id = i + 1
    contexts.append(ctx)

def context_side_effect(entity_type, entity_id):
    return Mock(first=Mock(return_value=contexts[entity_id - 1]))

mock_context_filter.side_effect = context_side_effect

# Then expect 6 chunks to be saved (2 per context)
assert len(bulk_save_args) == 6

🧹 Nitpick comments (1)

backend/apps/ai/common/base/chunk_command.py (1)
81-96: Deduplication logic correctly prevents database integrity errors.

The added deduplication logic:

Extracts unique context_ids and texts from the batch

Queries existing (context_id, text) pairs

Filters out chunks that already exist

Only bulk-saves new chunks

This correctly addresses the PR objective to fix duplication/integrity errors during chunk syncing.

Performance consideration for large batches:

For batches with many contexts and chunks, the query on lines 86-88 performs an IN lookup on two fields. While this is generally efficient with proper indexing, consider monitoring query performance in production if batch sizes are large.

The unique_together = ("context", "text") constraint in the Chunk model (from the provided snippets) ensures database-level integrity. Verify that appropriate indexes exist:
#!/bin/bash
# Check for indexes on the ai_chunks table
# This helps ensure the deduplication query is efficient

# Using Django's sqlmigrate or inspectdb to check indexes
python manage.py sqlmigrate ai <migration_number> | grep -i "create index\|unique"

# Or check current database indexes
python manage.py dbshell <<EOF
\d ai_chunks
EOF

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f839b9 and 833afb5.

📒 Files selected for processing (7)

backend/apps/ai/common/base/chunk_command.py (3 hunks)
backend/apps/ai/common/base/context_command.py (1 hunks)
backend/apps/ai/management/commands/ai_update_slack_message_context.py (0 hunks)
backend/tests/apps/ai/common/base/chunk_command_test.py (4 hunks)
backend/tests/apps/ai/common/base/context_command_test.py (4 hunks)
backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py (2 hunks)
backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py (2 hunks)

💤 Files with no reviewable changes (1)

backend/apps/ai/management/commands/ai_update_slack_message_context.py

🧰 Additional context used

🧬 Code graph analysis (3)

backend/apps/ai/common/base/context_command.py (1)

backend/apps/ai/common/base/ai_command.py (1)

get_entity_key (71-73)

backend/tests/apps/ai/common/base/chunk_command_test.py (2)

backend/tests/apps/ai/common/base/ai_command_test.py (2)

command (19-24)

mock_entity (28-33)

backend/apps/ai/common/base/chunk_command.py (1)

process_chunks_batch (20-98)

backend/apps/ai/common/base/chunk_command.py (1)

backend/apps/ai/models/chunk.py (2)

Chunk (12-72)

bulk_save (30-32)

🔇 Additional comments (5)

backend/apps/ai/common/base/context_command.py (1)

38-43: LGTM! Message updates accurately reflect create-or-update semantics.

The updated messages correctly indicate that Context.update_data can either create or update a context. This improves clarity for users running the command.

backend/tests/apps/ai/management/commands/ai_update_committee_context_test.py (1)

172-174: LGTM! Test assertions correctly updated.

The test assertions now match the updated messaging in the command implementation.

Also applies to: 211-211

backend/tests/apps/ai/common/base/context_command_test.py (1)

119-119: LGTM! Comprehensive test assertion updates.

All test assertions have been consistently updated across success, failure, and mixed scenarios to match the new context command messaging.

Also applies to: 133-133, 187-189, 264-264

backend/tests/apps/ai/management/commands/ai_update_slack_message_context_test.py (1)

75-89: Verify that removing the add_arguments override is intentional.

The test now expects only 3 arguments with a batch-size default of 50 (changed from 100). This suggests the command's add_arguments override was removed, and it now inherits the parent class's argument definitions.

If the previous command had a custom batch-size default of 100 for Slack messages (potentially due to different performance characteristics), removing this override could impact existing users or scripts that relied on this default behavior.

Please confirm:

Is this batch-size default change intentional?

Are there any existing scripts, documentation, or user workflows that depend on the previous batch-size default of 100?

backend/apps/ai/common/base/chunk_command.py (1)

23-23: LGTM! Dict-based accumulation prevents intra-batch duplicates.

Changing from a list to a dict keyed by (context_id, text) ensures that within a single batch, if multiple chunks have the same context and text (even from different entities), only one is retained. This is correct behavior given the unique_together = ("context", "text") constraint in the Chunk model.

Also applies to: 71-73

backend/tests/apps/ai/common/base/chunk_command_test.py

arkid15r · 2025-10-02T23:17:50Z

backend/apps/ai/common/base/chunk_command.py


        if batch_chunks_to_create:
-            Chunk.bulk_save(batch_chunks_to_create)
+            context_ids = {context_id for context_id, _ in batch_chunks_to_create}


Any reason to not using (again) our backend-wide approach for handling models data -- model.update_data + model.bulk.save instead?

I'm using bulk save, the integrity error was due to duplication -- I had 2 options one was to do it while checking in this way and other was to make the ignore_conflicts = true in the bulk save

You already have a check in the Chunk::update_data method. Why it's not enough and why it can't be handled there?

the update_data method will check for duplicates but the integrity error is caused due to race condition, while updating we are deleting the necessary chunks which are to be updated

so to avoid the race condition we are doing this
ignore_conflicts=true in the bulk_save was an option too I read about it on stackoverflow -- but I dont think that is viable
what do you think?

arkid15r · 2025-10-02T23:18:35Z

backend/apps/ai/common/base/context_command.py

-                self.stdout.write(f"Created context for {entity_key}")
+                self.stdout.write(f"Created/updated context for {entity_key}")
            else:
                entity_key = self.get_entity_key(entity)


This line is repetitive.

arkid15r · 2025-10-07T02:51:19Z

backend/apps/ai/common/base/chunk_command.py


                chunk_texts = Chunk.split_text(full_content)
-                if not chunk_texts:
+                unique_chunk_texts = list(dict.fromkeys(chunk_texts))


Could you explain why you need this fancy stuff instead of using set? Does the order matter here?

no the order does not matter -- now I am using set it is fixed

Would just set(chunk_texts) work here without converting it to a list? It doesn't look right at all from performance perspective.

the openAI embeddings only accepts list[str] or str as an input so we need to convert it to list

reference doc : https://platform.openai.com/docs/api-reference/embeddings/create?utm_source=chatgpt.com

I wouldn't be so sure about that -- https://github.com/openai/openai-python/blob/d69edeb39e6a9cc8d9822a2838b10ab4102b4cc6/src/openai/resources/embeddings.py#L50

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

backend/tests/apps/ai/common/base/chunk_command_test.py (2)
224-228: Consider adding length assertion for robustness.

While the set comparison correctly verifies that all unique chunks are present, it doesn't confirm the exact count. Adding a length check would strengthen the assertion and catch edge cases where duplicates might slip through.
         assert result == 1
         _, kwargs = mock_create_chunks.call_args
         assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"}
+        assert len(kwargs["chunk_texts"]) == 3
         assert kwargs["context"] == mock_context
         assert kwargs["openai_client"] == command.openai_client
         assert kwargs["save"] is False
463-464: Consider adding length assertion here too.

Similar to the suggestion for line 225, adding a length check here would make the assertion more robust and explicit about the expected deduplication result.
         assert result == 1
         mock_split_text.assert_called_once()
         _, kwargs = mock_create_chunks.call_args
         assert set(kwargs["chunk_texts"]) == {"chunk1", "chunk2", "chunk3"}
+        assert len(kwargs["chunk_texts"]) == 3
         assert kwargs["context"] == mock_context

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2c6eb8 and b589d9f.

📒 Files selected for processing (2)

backend/apps/ai/common/base/chunk_command.py (1 hunks)
backend/tests/apps/ai/common/base/chunk_command_test.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/apps/ai/common/base/chunk_command.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/tests/apps/ai/common/base/chunk_command_test.py (3)

backend/tests/apps/ai/common/base/context_command_test.py (3)

command (26-34)

mock_entity (38-44)

mock_context (48-55)

backend/tests/apps/ai/common/base/ai_command_test.py (2)

command (19-24)

mock_entity (28-33)

backend/apps/ai/common/base/chunk_command.py (1)

process_chunks_batch (20-84)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Run frontend e2e tests

🔇 Additional comments (2)

backend/tests/apps/ai/common/base/chunk_command_test.py (2)
433-469: Well-implemented test for deduplication logic!

This test effectively verifies that duplicate chunk texts from split_text are filtered before processing. The test setup is clear, follows existing patterns, and properly validates that only unique chunks are passed to create_chunks_and_embeddings.

214-469: Inconsistency between summary and implementation regarding order preservation.

The AI summary states "Deduplicate chunk texts (order-preserving) before creating chunks", but the implementation uses list(set(chunk_texts)), which does not preserve the original order of elements. In Python, set() has no guaranteed ordering, and list(set(...)) will produce an arbitrary order rather than preserving the first-occurrence order from chunk_texts.

For text chunks representing parts of a document, preserving order might be semantically important to maintain the document's flow. If order preservation is intended, consider using an order-preserving deduplication approach in the implementation:
# Option 1: Using dict.fromkeys (Python 3.7+)
unique_chunk_texts = list(dict.fromkeys(chunk_texts))

# Option 2: Manual filtering
seen = set()
unique_chunk_texts = [x for x in chunk_texts if not (x in seen or seen.add(x))]
If order doesn't matter, the current implementation is fine, but the summary description should be clarified.

Based on learnings from the codebase context, please verify whether chunk ordering is semantically important for the embedding and retrieval process.

sonarqubecloud · 2025-10-08T20:56:49Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix slack and duplication errors

833afb5

Dishant1804 requested review from arkid15r and kasya as code owners October 2, 2025 10:07

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

backend/tests/apps/ai/common/base/chunk_command_test.py Show resolved Hide resolved

code rabbit suggestions

59dcc50

arkid15r requested changes Oct 2, 2025

View reviewed changes

integrity error solved

b2c6eb8

github-actions bot added backend backend-tests labels Oct 6, 2025

arkid15r reviewed Oct 7, 2025

View reviewed changes

using set

b589d9f

Dishant1804 requested a review from arkid15r October 7, 2025 05:47

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

Update code

0c87471

arkid15r enabled auto-merge (squash) October 8, 2025 21:01

arkid15r approved these changes Oct 8, 2025

View reviewed changes

arkid15r merged commit 385ca51 into OWASP:feature/nestbot-ai-assistant Oct 8, 2025
25 checks passed

Uh oh!

Fix slack and duplication errors #2352

Fix slack and duplication errors #2352

Uh oh!

Conversation

Dishant1804 commented Oct 2, 2025

Proposed change

Checklist

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dishant1804 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dishant1804 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 8, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

Dishant1804 Oct 3, 2025 •

edited

Loading

Dishant1804 Oct 3, 2025 •

edited

Loading