-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix: agent retry loop for tool concurrency errors (#1546) [v3] #1606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: agent retry loop for tool concurrency errors (#1546) [v3] #1606
Conversation
…0#1546) - Add exponential backoff retry logic (2s, 4s, 8s, 16s, 32s) for 400 tool concurrency errors - Add is_tool_concurrency_error() detection in session.py - Update run_agent_session to return 3-tuple (status, response, error_info) - Track consecutive concurrency errors (max 5 retries before marking subtask as stuck) - Add error context to agent prompt instructing it to use one tool at a time - Fix: Reset first_run=True on planning concurrency errors to retry planning - Update test mocks for run_agent_session 3-tuple return type - Update test mocks for get_token_from_keychain config_dir parameter Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Avoids JSON serialization issues and potential internal leaks when error_info is logged or sent via IPC. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Extract duplicated concurrency error reset logic into _reset_concurrency_state() helper - Fix stale docstring referencing "exception" key (now "exception_type") - Move `import os` to module level in simple_client.py Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Hey @AndyMik90 — this PR supersedes #1565. Why a new PR: PR #1565 had diverged so far from What we did:
Review comment status:
The #1565 can be closed once this is merged. |
📝 WalkthroughWalkthroughAdds per-profile OAuth token lookup, propagates error context from sessions, and implements exponential-backoff retry handling for tool-concurrency (HTTP 400) errors; also integrates Claude profile loading into the frontend and updates related tests and translations. Changes
Sequence Diagram(s)sequenceDiagram
participant Coder as Coder Agent
participant Session as run_agent_session
participant Detector as is_tool_concurrency_error
participant Backoff as Backoff Manager
Coder->>Session: run_agent_session(task)
Session->>Coder: (status, response, error_info)
alt error_info.type == "tool_concurrency"
Coder->>Detector: check error_info
Detector-->>Coder: true
Coder->>Backoff: check/increment retry count
Backoff-->>Coder: delay (2s→4s→8s...)
Coder->>Coder: inject error_info into next prompt
Coder->>Session: run_agent_session(task) [retry after delay]
else success
Coder->>Coder: _reset_concurrency_state()
end
sequenceDiagram
participant App as Frontend App
participant Profiles as Profile Store
participant ClaudeStore as Claude Profile Store
participant Backend as Backend/SDK
App->>Profiles: loadProfiles()
Profiles-->>App: profiles loaded
App->>ClaudeStore: loadClaudeProfiles()
ClaudeStore->>Backend: fetch profiles
Backend-->>ClaudeStore: profile list + active
ClaudeStore-->>App: profile updates (subscribed)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @MikeeBuilds, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the agent's resilience and reliability when interacting with the Claude API, particularly concerning tool concurrency limits. By introducing an intelligent retry mechanism with exponential backoff and providing contextual guidance to the agent after encountering concurrency errors, it ensures smoother operation and prevents the agent from stalling. Additionally, it refines error reporting and preserves the agent's operational phase during retries, contributing to a more stable and predictable agent workflow. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses an infinite retry loop caused by tool concurrency errors from the Claude API. The implementation of exponential backoff, a maximum retry limit, and contextual error prompts for the agent are well-executed. The code is clean and the refactoring to avoid duplication is a good improvement. I have one suggestion to further improve maintainability by extracting a large prompt template from the main agent logic.
| error_context_message = ( | ||
| "## CRITICAL: TOOL CONCURRENCY ERROR\n\n" | ||
| f"Your previous session hit Claude API's tool concurrency limit (HTTP 400).\n" | ||
| f"This is retry {consecutive_concurrency_errors}/{MAX_CONCURRENCY_RETRIES}.\n\n" | ||
| "**IMPORTANT: You MUST adjust your approach:**\n" | ||
| "1. Use ONE tool at a time - do NOT call multiple tools in parallel\n" | ||
| "2. Wait for each tool result before calling the next tool\n" | ||
| "3. Avoid starting with `pwd` or multiple Read calls at once\n" | ||
| "4. If you need to read multiple files, read them one by one\n" | ||
| "5. Take a more incremental, step-by-step approach\n\n" | ||
| "Start by focusing on ONE specific action for this subtask." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This large, multi-line string for the agent's error prompt is defined directly within the main agent loop. For better maintainability and separation of concerns, it's a good practice to extract prompt templates from the core logic.
Consider moving this template to a dedicated prompts module (e.g., prompts.py) and generating it with a helper function. This would make the core agent logic cleaner and centralize prompt management.
For example, in a prompts module:
def generate_concurrency_error_prompt(retry_count: int, max_retries: int) -> str:
return f'''## CRITICAL: TOOL CONCURRENCY ERROR
Your previous session hit Claude API's tool concurrency limit (HTTP 400).
This is retry {retry_count}/{max_retries}.
**IMPORTANT: You MUST adjust your approach:**
1. Use ONE tool at a time - do NOT call multiple tools in parallel
2. Wait for each tool result before calling the next tool
3. Avoid starting with `pwd` or multiple Read calls at once
4. If you need to read multiple files, read them one by one
5. Take a more incremental, step-by-step approach
Start by focusing on ONE specific action for this subtask.'''Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@apps/backend/agents/coder.py`:
- Around line 224-235: The helper _reset_concurrency_state and the surrounding
variable declarations (consecutive_concurrency_errors, current_retry_delay,
concurrency_error_context, and INITIAL_RETRY_DELAY_SECONDS) are mis-formatted
causing ruff to fail; reformat the block so the inline comment and the
optional-typed assignment for concurrency_error_context use conventional
single-line or properly wrapped multi-line syntax (remove awkward
parentheses/newlines around the comment), then run ruff format (e.g., ruff
format apps/backend/agents/coder.py) to apply the correct formatting.
| def _reset_concurrency_state() -> None: | ||
| """Reset concurrency error tracking state after a successful session or non-concurrency error.""" | ||
| nonlocal \ | ||
| consecutive_concurrency_errors, \ | ||
| current_retry_delay, \ | ||
| concurrency_error_context | ||
| consecutive_concurrency_errors = 0 | ||
| current_retry_delay = INITIAL_RETRY_DELAY_SECONDS | ||
| concurrency_error_context = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The _reset_concurrency_state function fails to reset planning_retry_context, which can cause stale error context to be used in subsequent planning attempts after a specific error sequence.
Severity: MEDIUM
Suggested Fix
Add planning_retry_context to the nonlocal statement within the _reset_concurrency_state function and set it to None along with the other state variables being reset.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: apps/backend/agents/coder.py#L230-L238
Potential issue: The function `_reset_concurrency_state` is designed to clear
error-related state but fails to reset the `planning_retry_context` variable. This can
lead to a bug under a specific sequence of events: if a concurrency error occurs during
the planning phase, `planning_retry_context` is set. If a subsequent planning attempt
then fails with a different, non-concurrency error, `_reset_concurrency_state` is called
but does not clear `planning_retry_context`. As a result, the next planning attempt will
incorrectly use the stale error context from the initial concurrency error, potentially
misleading the agent with outdated guidance.
Did we get this right? 👍 / 👎 to inform future reviews.
AndyMik90
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Auto Claude Review - APPROVED
Status: Ready to Merge
Summary: ### Merge Verdict: ✅ READY TO MERGE
✅ Ready to merge - All checks passing, no blocking issues found.
No blocking issues found
Risk Assessment
| Factor | Level | Notes |
|---|---|---|
| Complexity | Medium | Based on lines changed |
| Security Impact | None | Based on security findings |
| Scope Coherence | Good | Based on structural review |
Generated by Auto Claude PR Review
This automated review found no blocking issues. The PR can be safely merged.
Generated by Auto Claude
Summary
Fixes #1546
developdevelopwith all review feedback addressedChanges
From original PR #1565
agents/session.py: Addis_tool_concurrency_error()detection, return 3-tuple(status, response, error_info)with structured error categorizationagents/coder.py: Add concurrency-aware retry loop with exponential backoff, error context injection into agent prompts, planning-phase retry preservationagents/base.py: Add retry configuration constants (MAX_CONCURRENCY_RETRIES,INITIAL_RETRY_DELAY_SECONDS,MAX_RETRY_DELAY_SECONDS)Review feedback addressed (this PR)
_reset_concurrency_state()helper (was duplicated in 3 places)"exception"key → now correctly documents"exception_type"import osfrom function-level to module-level insimple_client.pyexception_type: type(e).__name__instead of raw exception object (JSON-serializable, no internal leaks)>not>=, allows all 5 retries)planning_retry_contextItems from #1565 not carried over (already in develop)
auth.pymulti-account keychain changes — already merged intodevelopvia other PRsAuthStatusIndicator.tsx— debug console.logs already removed in developApp.tsxonboarding race condition — separate concern, not related to retry loop fixTest plan
pytest tests/test_issue_884_plan_schema.py -v🤖 Generated with Claude Code