Skip to content

Conversation

@lfnovo
Copy link
Owner

@lfnovo lfnovo commented Jan 5, 2026

Fixes #362

Problem

Users were seeing hundreds of ERROR/WARNING logs when processing large documents due to SurrealDB v2 transaction conflicts during concurrent chunk embedding operations. This made it appear that something was seriously wrong, even though the retry mechanism was working correctly.

Related upstream issue: surrealdb/surrealdb#6681

Solution

Three-layer fix to eliminate retry log noise:

  1. Repository layer (repository.py): Changed RuntimeError logging from ERROR → DEBUG
  2. Retry mechanism: Upgraded to surreal-commands v1.3.0 and set retry_log_level: "debug"
  3. Exception handlers: Changed retry warnings to logger.debug()

Additionally increased retry resilience to handle deep queues:

  • Retry attempts: 5 → 15
  • Max wait time: 30s → 120s

Testing

✅ Verified with document upload containing 200+ chunks
✅ Clean logs at INFO level - no scary ERROR messages
✅ Transaction conflicts only visible with --log-level DEBUG

Files Changed

  • commands/embedding_commands.py - embed_chunk retry config
  • commands/source_commands.py - process_source retry config
  • open_notebook/database/repository.py - RuntimeError logging level
  • pyproject.toml - surreal-commands >= 1.3.0
  • Documentation updates

Notes

This is a temporary workaround for SurrealDB v2.x transaction conflict issues with SEARCH indexes. Retry settings can be reduced after migrating to SurrealDB v3, which fixes the underlying concurrency issue (see upstream issue link above).

Addresses issue #362 - users were seeing hundreds of ERROR/WARNING logs
when processing large documents due to SurrealDB v2 transaction conflicts
during concurrent chunk embedding operations.

Changes:
- Upgraded to surreal-commands v1.3.0 which includes retry_log_level feature
- Increased retry attempts from 5 to 15 with max wait time 120s (from 30s)
  to handle deep queues during concurrent processing
- Set retry_log_level to "debug" in embed_chunk and process_source commands
- Changed repository.py RuntimeError logging from ERROR to DEBUG level
- Updated command exception handlers to log retries at DEBUG level
- Updated documentation to reflect retry strategy

This is a temporary workaround for SurrealDB v2.x transaction conflict
issues with SEARCH indexes. Settings can be reduced after migrating to
SurrealDB v3 which fixes the underlying concurrency issue.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@claude
Copy link

claude bot commented Jan 5, 2026

Claude finished @lfnovo's task —— View job


PR Review: Reduce Retry Log Noise During Concurrent Chunk Processing

This PR addresses a legitimate user experience issue where hundreds of ERROR/WARNING logs were appearing during large document processing, making it seem like something was seriously wrong when the retry mechanism was actually working correctly.

Strengths

Well-Targeted Solution

The three-layer fix is well thought out:

  • Repository layer: RuntimeError logging moved from ERROR → DEBUG (repository.py:77-78)
  • Retry configuration: Upgraded to surreal-commands>=1.3.0 with retry_log_level: "debug"
  • Exception handlers: Changed retry warnings to logger.debug() in commands

Appropriate Retry Adjustments

The increased retry resilience makes sense for the identified problem:

  • Attempts: 5 → 15 (reasonable for deep queues with 200+ chunks)
  • Max wait: 30s → 120s (allows sufficient queue drain time)
  • Strategy: Exponential jitter (good for reducing thundering herd effects)

Good Documentation

Excellent inline documentation explaining:

  • Why aggressive retry settings are needed (SurrealDB v2 transaction conflicts)
  • That these are temporary workarounds until SurrealDB v3 migration
  • Clear retry strategy explanations in embed_chunk_command:210-224

Proper Exception Handling

Well-structured exception handling that distinguishes:

  • Retriable errors: RuntimeError, ConnectionError, TimeoutError → re-raised for retry
  • Permanent errors: ValueError, auth errors → caught and returned as failures

⚠️ Areas of Concern

Performance Considerations

  1. Aggressive Retry Settings: 15 attempts with up to 120s wait could lead to very long processing times in worst-case scenarios. Consider adding circuit breaker patterns for systematic failures.

  2. Resource Usage: Deep retry queues (200+ chunks × 15 retries each) could consume significant memory/worker resources if many chunks fail simultaneously.

  3. No Timeout Bounds: Individual chunk processing has no overall timeout - a stuck chunk could retry for hours (15 × 120s = 30+ minutes max).

Logging Strategy

While the DEBUG level reduces noise, it might make legitimate issues harder to detect:

  • Consider structured logging with metrics/counters for retry patterns
  • Maybe add a summary log at INFO level after processing completes (e.g., "Processed 200 chunks, 45 retries needed")

🔒 Security Assessment

Low Risk: Changes are primarily logging and retry configuration. No security concerns identified.

🧪 Test Coverage Gaps

The PR mentions verification with 200+ chunk documents but lacks:

  1. Automated tests for retry behavior under transaction conflicts
  2. Unit tests for the new logging levels
  3. Integration tests for the surreal-commands v1.3.0 upgrade
  4. Performance tests to validate 15-attempt retry behavior doesn't cause resource exhaustion

Code Quality

Good Practices

  • ✅ Consistent error handling patterns
  • ✅ Clear parameter documentation
  • ✅ Proper type hints throughout
  • ✅ Good separation of concerns (retry config vs. business logic)

Minor Issues

  • embedding_commands.py:269-274: The RuntimeError re-raise could include more context about which specific chunk failed for easier debugging
  • Dependency management: Consider pinning surreal-commands to ==1.3.0 rather than >=1.3.0 to prevent unexpected behavior from future versions

📋 Recommendations

Short-term (for this PR)

  1. Add monitoring: Include retry count metrics in command outputs for observability
  2. Consider timeout bounds: Add overall processing timeout per source to prevent indefinite retries
  3. Documentation: Update the CLAUDE.md files to mention the temporary nature of these settings

Medium-term (follow-up work)

  1. Create automated tests for retry behavior
  2. Add dashboards/metrics for monitoring retry patterns in production
  3. Plan SurrealDB v3 migration to remove need for aggressive retry settings

Overall Assessment

This is a solid fix for a real user experience problem. The approach is sound, documentation is excellent, and the changes are appropriately scoped. While the retry settings are aggressive, they're well-justified as a temporary workaround for a known upstream issue.

Recommendation: APPROVE with minor suggestions for monitoring improvements.

Review Checklist

  • Read and analyze all changed files
  • Review retry mechanism changes in commands
  • Evaluate logging level modifications
  • Check dependency updates
  • Assess performance implications
  • Review security considerations
  • Evaluate test coverage needs
  • Provide comprehensive feedback

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="commands/CLAUDE.md">

<violation number="1" location="commands/CLAUDE.md:18">
P2: Documentation inconsistency: This line documents &#39;15 attempts&#39; but the Key Components section (line 7) still says &#39;max 5×&#39; for `process_source_command`. Consider updating line 7 to reflect the new retry count for consistency.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


- **Pydantic I/O**: All commands use `CommandInput`/`CommandOutput` subclasses for type safety and serialization.
- **Error handling**: Permanent errors return failure output; `RuntimeError` exceptions auto-retry via surreal-commands.
- **Retry configuration**: Aggressive retry settings (15 attempts, 1-120s backoff, DEBUG log level) are a temporary workaround for SurrealDB v2.x transaction conflicts with SEARCH indexes. These can be reduced after migrating to SurrealDB v3.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Documentation inconsistency: This line documents '15 attempts' but the Key Components section (line 7) still says 'max 5×' for process_source_command. Consider updating line 7 to reflect the new retry count for consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At commands/CLAUDE.md, line 18:

<comment>Documentation inconsistency: This line documents &#39;15 attempts&#39; but the Key Components section (line 7) still says &#39;max 5×&#39; for `process_source_command`. Consider updating line 7 to reflect the new retry count for consistency.</comment>

<file context>
@@ -15,8 +15,9 @@
 
 - **Pydantic I/O**: All commands use `CommandInput`/`CommandOutput` subclasses for type safety and serialization.
 - **Error handling**: Permanent errors return failure output; `RuntimeError` exceptions auto-retry via surreal-commands.
+- **Retry configuration**: Aggressive retry settings (15 attempts, 1-120s backoff, DEBUG log level) are a temporary workaround for SurrealDB v2.x transaction conflicts with SEARCH indexes. These can be reduced after migrating to SurrealDB v3.
 - **Model dumping**: Recursive `full_model_dump()` utility converts Pydantic models → dicts for DB/API responses.
-- **Logging**: Uses `loguru.logger` throughout; logs execution start/end and key metrics (processing time, counts).
</file context>
Fix with Cubic

@lfnovo lfnovo merged commit 3695d6d into main Jan 5, 2026
8 checks passed
@lfnovo lfnovo deleted the fix/reduce-retry-log-noise branch January 5, 2026 20:14
kt-mattie-langenberg pushed a commit to kt-mattie-langenberg/open-notebook that referenced this pull request Jan 10, 2026
fix: reduce retry log noise during concurrent chunk processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: SurrealDB cannot commit transactions (read or write conflict)

2 participants