fix: agent retry loop for tool concurrency errors (#1546) [v3] #1606

MikeeBuilds · 2026-01-29T18:09:55Z

Summary

Fixes #1546

Supersedes fix: agent retry loop and multi-account authentication (#1546) #1565 which had 152 merge conflicts with develop
Rebased cleanly onto latest develop with all review feedback addressed
Fixes agent getting stuck in infinite retry loop when hitting Claude API's tool concurrency limit (400 errors)
Implements exponential backoff retry logic (2s, 4s, 8s, 16s, 32s) with max 5 retries
Adds error context to agent prompt instructing it to use one tool at a time after concurrency errors
Preserves planning mode on planner-side concurrency errors (prevents skipping planner)

Changes

From original PR #1565

agents/session.py: Add is_tool_concurrency_error() detection, return 3-tuple (status, response, error_info) with structured error categorization
agents/coder.py: Add concurrency-aware retry loop with exponential backoff, error context injection into agent prompts, planning-phase retry preservation
agents/base.py: Add retry configuration constants (MAX_CONCURRENCY_RETRIES, INITIAL_RETRY_DELAY_SECONDS, MAX_RETRY_DELAY_SECONDS)

Review feedback addressed (this PR)

Extract duplicated concurrency error state reset into _reset_concurrency_state() helper (was duplicated in 3 places)
Fix stale docstring referencing "exception" key → now correctly documents "exception_type"
Move import os from function-level to module-level in simple_client.py
Use exception_type: type(e).__name__ instead of raw exception object (JSON-serializable, no internal leaks)
Off-by-one backoff already fixed (> not >=, allows all 5 retries)
Empty f-strings already fixed (plain strings, no Ruff F541)
Test mocks already return 3-tuples matching new signature
Planning concurrency guidance already implemented via planning_retry_context

Items from #1565 not carried over (already in develop)

auth.py multi-account keychain changes — already merged into develop via other PRs
AuthStatusIndicator.tsx — debug console.logs already removed in develop
App.tsx onboarding race condition — separate concern, not related to retry loop fix

Test plan

Verify agent handles 400 tool concurrency errors with exponential backoff
Verify agent prompt includes concurrency guidance after retry
Verify planning phase retries stay in planning mode on concurrency error
Verify agent gives up after 5 consecutive concurrency errors
Run existing test suite: pytest tests/test_issue_884_plan_schema.py -v

🤖 Generated with Claude Code

…0#1546) - Add exponential backoff retry logic (2s, 4s, 8s, 16s, 32s) for 400 tool concurrency errors - Add is_tool_concurrency_error() detection in session.py - Update run_agent_session to return 3-tuple (status, response, error_info) - Track consecutive concurrency errors (max 5 retries before marking subtask as stuck) - Add error context to agent prompt instructing it to use one tool at a time - Fix: Reset first_run=True on planning concurrency errors to retry planning - Update test mocks for run_agent_session 3-tuple return type - Update test mocks for get_token_from_keychain config_dir parameter Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Avoids JSON serialization issues and potential internal leaks when error_info is logged or sent via IPC. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Extract duplicated concurrency error reset logic into _reset_concurrency_state() helper - Fix stale docstring referencing "exception" key (now "exception_type") - Move `import os` to module level in simple_client.py Co-Authored-By: Claude Opus 4.5 <[email protected]>

MikeeBuilds · 2026-01-29T18:10:14Z

Hey @AndyMik90 — this PR supersedes #1565.

Why a new PR: PR #1565 had diverged so far from develop that it had 152 merge conflicts across virtually the entire codebase (frontend, backend, CI, i18n, tests — everything). The original branch fix/1546-agent-retry-loop was based on a much older develop and the auth.py changes from that branch had already been merged separately.

What we did:

Reviewed all comments on fix: agent retry loop and multi-account authentication (#1546) #1565 from CodeRabbit, Gemini, Sentry, and GitHub Advanced Security
Verified which issues were already fixed in the existing commits vs still needed work
Applied the remaining fixes (duplicated reset logic → helper function, stale docstring, module-level import)
Rebased cleanly onto latest develop — resolved the single actual conflict in auth.py by keeping develop's version (which already had the keychain improvements)
Opened this fresh PR with a clean 4-commit history

Review comment status:

✅ Raw exception object → exception_type string (already in commit 9dca7845)
✅ Planning retry skipping planner → already fixed (lines 660-667 reset first_run = True)
✅ Off-by-one backoff → already fixed (uses > not >=)
✅ Empty f-strings → already plain strings
✅ Test mock 3-tuple mismatch → already returning 3 values
✅ Planner concurrency guidance → already wired through planning_retry_context
✅ Duplicated reset logic → new fix: extracted _reset_concurrency_state() helper
✅ Stale docstring → new fix: updated to reference exception_type
✅ Function-level import os → new fix: moved to module level

The auth.py multi-account keychain changes from #1565 were not carried over because they're already in develop from other merged PRs. Same for the AuthStatusIndicator.tsx debug logs — already cleaned up in develop.

#1565 can be closed once this is merged.

coderabbitai · 2026-01-29T18:10:21Z

📝 Walkthrough

Walkthrough

Adds per-profile OAuth token lookup, propagates error context from sessions, and implements exponential-backoff retry handling for tool-concurrency (HTTP 400) errors; also integrates Claude profile loading into the frontend and updates related tests and translations.

Changes

Cohort / File(s)	Summary
Retry/Concurrency Constants `apps/backend/agents/base.py`	Added constants: `MAX_CONCURRENCY_RETRIES`, `INITIAL_RETRY_DELAY_SECONDS`, `MAX_RETRY_DELAY_SECONDS`.
Concurrency/Error Flow `apps/backend/agents/coder.py`, `apps/backend/agents/planner.py`	Introduced per-run concurrency state and exponential-backoff retry logic in coder; planner updated to unpack new `error_info` return from sessions.
Session Error Reporting `apps/backend/agents/session.py`	Added `is_tool_concurrency_error()` and changed `run_agent_session` to return `(status, response, error_info)` with richer error categorization and logging.
Per-Profile Auth Token `apps/backend/core/client.py`, `apps/backend/core/simple_client.py`	Token retrieval now uses `config_dir`/`sdk_env` and passes `config_dir` to `require_auth_token()` enabling per-profile Keychain lookup and sets `CLAUDE_CONFIG_DIR` for the SDK.
Frontend: Claude Profiles & Auth UI `apps/frontend/src/renderer/App.tsx`, `apps/frontend/src/renderer/components/AuthStatusIndicator.tsx`	Loaded Claude profiles on startup and integrated Claude profile store + usage data into auth status calculation and OAuth UI rendering.
Locales `apps/frontend/src/shared/i18n/locales/en/common.json`, `apps/frontend/src/shared/i18n/locales/fr/common.json`	Added `usage.account` translation key (`"Account"` / `"Compte"`).
Tests `tests/test_auth.py`, `tests/test_issue_884_plan_schema.py`	Updated mocks and helpers to accept/return the new optional `config_dir` parameter and the new `(status, response, error_info)` tuple respectively.

Sequence Diagram(s)

sequenceDiagram
    participant Coder as Coder Agent
    participant Session as run_agent_session
    participant Detector as is_tool_concurrency_error
    participant Backoff as Backoff Manager

    Coder->>Session: run_agent_session(task)
    Session->>Coder: (status, response, error_info)
    alt error_info.type == "tool_concurrency"
        Coder->>Detector: check error_info
        Detector-->>Coder: true
        Coder->>Backoff: check/increment retry count
        Backoff-->>Coder: delay (2s→4s→8s...)
        Coder->>Coder: inject error_info into next prompt
        Coder->>Session: run_agent_session(task) [retry after delay]
    else success
        Coder->>Coder: _reset_concurrency_state()
    end

sequenceDiagram
    participant App as Frontend App
    participant Profiles as Profile Store
    participant ClaudeStore as Claude Profile Store
    participant Backend as Backend/SDK

    App->>Profiles: loadProfiles()
    Profiles-->>App: profiles loaded
    App->>ClaudeStore: loadClaudeProfiles()
    ClaudeStore->>Backend: fetch profiles
    Backend-->>ClaudeStore: profile list + active
    ClaudeStore-->>App: profile updates (subscribed)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(auth): read tokens from profile configDir to fix 401 errors #1385 — Overlaps per-profile CLAUDE_CONFIG_DIR and config-dir-aware token resolution; strong overlap with auth refactor.
feat: Add multi-auth token support and ANTHROPIC_BASE_URL passthrough (CLI only) #7 — Related to centralized auth token resolution and SDK env handling used in core.client/simple_client.

Suggested labels

bug, area/backend, size/M

Suggested reviewers

AndyMik90

Poem

🐰 Hopping through logs with a retrying heart,
When tools clash and HTTPs start,
I nudge prompts with context and wait,
Backoff doubles—patient, never late.
Profiles bloom; tokens find their part.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: implementing a retry loop with exponential backoff for tool concurrency errors, which is the core fix across agents/base.py, agents/coder.py, and agents/session.py.
Docstring Coverage	✅ Passed	Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-29T18:10:29Z

Summary of Changes

Hello @MikeeBuilds, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the agent's resilience and reliability when interacting with the Claude API, particularly concerning tool concurrency limits. By introducing an intelligent retry mechanism with exponential backoff and providing contextual guidance to the agent after encountering concurrency errors, it ensures smoother operation and prevents the agent from stalling. Additionally, it refines error reporting and preserves the agent's operational phase during retries, contributing to a more stable and predictable agent workflow.

Highlights

Tool Concurrency Error Handling: The agent now robustly handles Claude API's 400 tool concurrency errors, preventing it from getting stuck in an infinite retry loop.
Exponential Backoff Retry Logic: An exponential backoff strategy has been implemented for tool concurrency errors, with delays of 2s, 4s, 8s, 16s, and 32s, allowing for a maximum of 5 retries.
Agent Prompt Context Injection: After a tool concurrency error, the agent's prompt is augmented with specific guidance, instructing it to use one tool at a time to avoid further concurrency issues.
Planning Mode Preservation: If a concurrency error occurs during the planning phase, the system now ensures that the agent remains in planning mode for retry, preventing unintended skips.
Refactored Error Reporting: The run_agent_session function now returns structured error information, including the error type ('tool_concurrency' or 'other'), message, and exception type, facilitating more precise error handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sentry · 2026-01-29T18:11:55Z

Codecov Report

❌ Patch coverage is 37.34940% with 52 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
apps/backend/agents/coder.py	33.92%	37 Missing ⚠️
apps/backend/agents/session.py	7.14%	13 Missing ⚠️
apps/backend/agents/planner.py	0.00%	1 Missing ⚠️
apps/backend/core/client.py	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

gemini-code-assist

Code Review

This pull request effectively addresses an infinite retry loop caused by tool concurrency errors from the Claude API. The implementation of exponential backoff, a maximum retry limit, and contextual error prompts for the agent are well-executed. The code is clean and the refactoring to avoid duplication is a good improvement. I have one suggestion to further improve maintainability by extracting a large prompt template from the main agent logic.

gemini-code-assist · 2026-01-29T18:12:24Z

apps/backend/agents/coder.py

+                error_context_message = (
+                    "## CRITICAL: TOOL CONCURRENCY ERROR\n\n"
+                    f"Your previous session hit Claude API's tool concurrency limit (HTTP 400).\n"
+                    f"This is retry {consecutive_concurrency_errors}/{MAX_CONCURRENCY_RETRIES}.\n\n"
+                    "**IMPORTANT: You MUST adjust your approach:**\n"
+                    "1. Use ONE tool at a time - do NOT call multiple tools in parallel\n"
+                    "2. Wait for each tool result before calling the next tool\n"
+                    "3. Avoid starting with `pwd` or multiple Read calls at once\n"
+                    "4. If you need to read multiple files, read them one by one\n"
+                    "5. Take a more incremental, step-by-step approach\n\n"
+                    "Start by focusing on ONE specific action for this subtask."
+                )


This large, multi-line string for the agent's error prompt is defined directly within the main agent loop. For better maintainability and separation of concerns, it's a good practice to extract prompt templates from the core logic.

Consider moving this template to a dedicated prompts module (e.g., prompts.py) and generating it with a helper function. This would make the core agent logic cleaner and centralize prompt management.

For example, in a prompts module:

def generate_concurrency_error_prompt(retry_count: int, max_retries: int) -> str: return f'''## CRITICAL: TOOL CONCURRENCY ERROR Your previous session hit Claude API's tool concurrency limit (HTTP 400). This is retry {retry_count}/{max_retries}. **IMPORTANT: You MUST adjust your approach:** 1. Use ONE tool at a time - do NOT call multiple tools in parallel 2. Wait for each tool result before calling the next tool 3. Avoid starting with `pwd` or multiple Read calls at once 4. If you need to read multiple files, read them one by one 5. Take a more incremental, step-by-step approach Start by focusing on ONE specific action for this subtask.'''

Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@apps/backend/agents/coder.py`:
- Around line 224-235: The helper _reset_concurrency_state and the surrounding
variable declarations (consecutive_concurrency_errors, current_retry_delay,
concurrency_error_context, and INITIAL_RETRY_DELAY_SECONDS) are mis-formatted
causing ruff to fail; reformat the block so the inline comment and the
optional-typed assignment for concurrency_error_context use conventional
single-line or properly wrapped multi-line syntax (remove awkward
parentheses/newlines around the comment), then run ruff format (e.g., ruff
format apps/backend/agents/coder.py) to apply the correct formatting.

apps/backend/agents/coder.py

sentry · 2026-01-29T18:17:21Z

apps/backend/agents/coder.py

+    def _reset_concurrency_state() -> None:
+        """Reset concurrency error tracking state after a successful session or non-concurrency error."""
+        nonlocal \
+            consecutive_concurrency_errors, \
+            current_retry_delay, \
+            concurrency_error_context
+        consecutive_concurrency_errors = 0
+        current_retry_delay = INITIAL_RETRY_DELAY_SECONDS
+        concurrency_error_context = None


Bug: The _reset_concurrency_state function fails to reset planning_retry_context, which can cause stale error context to be used in subsequent planning attempts after a specific error sequence.
_{Severity: MEDIUM}

Suggested Fix

Add planning_retry_context to the nonlocal statement within the _reset_concurrency_state function and set it to None along with the other state variables being reset.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: apps/backend/agents/coder.py#L230-L238 Potential issue: The function `_reset_concurrency_state` is designed to clear error-related state but fails to reset the `planning_retry_context` variable. This can lead to a bug under a specific sequence of events: if a concurrency error occurs during the planning phase, `planning_retry_context` is set. If a subsequent planning attempt then fails with a different, non-concurrency error, `_reset_concurrency_state` is called but does not clear `planning_retry_context`. As a result, the next planning attempt will incorrectly use the stale error context from the initial concurrency error, potentially misleading the agent with outdated guidance.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

AndyMik90

✅ Auto Claude Review - APPROVED

Status: Ready to Merge

Summary: ### Merge Verdict: ✅ READY TO MERGE

✅ Ready to merge - All checks passing, no blocking issues found.

No blocking issues found

Risk Assessment

Factor	Level	Notes
Complexity	Medium	Based on lines changed
Security Impact	None	Based on security findings
Scope Coherence	Good	Based on structural review

Generated by Auto Claude PR Review

This automated review found no blocking issues. The PR can be safely merged.

Generated by Auto Claude

MikeeBuilds and others added 4 commits January 29, 2026 13:08

style: fix ruff format (line length)

64c6541

Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: use exception_type string instead of raw exception object

5d296ee

Avoids JSON serialization issues and potential internal leaks when error_info is logged or sent via IPC. Co-Authored-By: Claude Opus 4.5 <[email protected]>

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

style: fix ruff format (nonlocal line length)

0ffa458

Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

apps/backend/agents/coder.py Show resolved Hide resolved

sentry bot reviewed Jan 29, 2026

View reviewed changes

MikeeBuilds added bug Something isn't working priority/high Important, fix this week area/fullstack This is Frontend + Backend stable-roadmap v2.7.6 labels Jan 29, 2026

AndyMik90 self-assigned this Jan 29, 2026

AndyMik90 approved these changes Jan 29, 2026

View reviewed changes

AndyMik90 merged commit 0aea4fb into AndyMik90:develop Jan 29, 2026
21 checks passed

Uh oh!

fix: agent retry loop for tool concurrency errors (#1546) [v3] #1606

fix: agent retry loop for tool concurrency errors (#1546) [v3] #1606

Uh oh!

Conversation

MikeeBuilds commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

From original PR #1565

Review feedback addressed (this PR)

Items from #1565 not carried over (already in develop)

Test plan

Uh oh!

MikeeBuilds commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

sentry bot commented Jan 29, 2026

Codecov Report

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

AndyMik90 left a comment

Choose a reason for hiding this comment

✅ Auto Claude Review - APPROVED

Risk Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MikeeBuilds commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading