Switching Google regions to avoid 429 Resource Exhausted errors #3404

Mst96 · 2025-11-05T12:30:41Z

Mst96
Nov 5, 2025

I'm facing a lot of 429 Resource Exhausted issues when using Gemini as my model.

Context (Vertex, Gemini 2.5 Pro, needs to be US-only):
We’re on pay-as-you-go / DSQ and want to stay there. Provisioned Throughput isn’t viable.
This is not high scale traffic
The problem is short spikes (tokens per second) during agent runs.

We’re getting intermittent 429 Resource Exhausted that cause many runs to fail.

As a result, my solution was to:

Implement Token Bucket Limiter smoothing to maintain a more stable average rate. We'd request a 'budget' of token usage per call before each model call. If we hit our internal quota or hit a 429 then we'd utilise multi-region routing of our requests to handle this.

I spoke with GCP and they agreed to the plan.

But my question is - when ADK instantiates google.genai.Client, the Vertex location is baked into that client no? I believe we're stuck with one client for the life of the agent. Is there any way to switch locations of the agent mid run?

Otherwise I'm simply going to have to queue invocations which is not ideal or spin up different agents in different regions and reroute invocations to these different agents.

Answered by adk-bot

Nov 5, 2025

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: You are correct; the Vertex AI client's location is configured at initialization and cannot be changed mid-run for a single agent instance. Your proposed solution of deploying separate ADK agents to different regions and routing requests between them is an excellent and recommended pattern for building resilient, high-availability systems.

Hello! Thanks for your detailed question. Dealing with rate limits is a common challenge when building robust applications.

Client Initialization in ADK

Your understanding is spot on. When an ADK agent initializes its connection to Vertex AI, the configuration, including t…

View full answer

adk-bot · 2025-11-05T12:31:57Z

adk-bot
Nov 5, 2025
Maintainer

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: You are correct; the Vertex AI client's location is configured at initialization and cannot be changed mid-run for a single agent instance. Your proposed solution of deploying separate ADK agents to different regions and routing requests between them is an excellent and recommended pattern for building resilient, high-availability systems.

Hello! Thanks for your detailed question. Dealing with rate limits is a common challenge when building robust applications.

Client Initialization in ADK

Your understanding is spot on. When an ADK agent initializes its connection to Vertex AI, the configuration, including the project and location, is established for the lifecycle of that agent instance. [2, 5] This configuration is typically set via environment variables (GOOGLE_CLOUD_LOCATION) or programmatically when the client is built, but it is not designed to be dynamically altered during a run. [1]

Multi-Region Deployment: A Recommended Pattern

Your idea to "spin up different agents in different regions and reroute invocations" is a solid architectural choice. This pattern is widely used to achieve:

High Availability: If one region experiences an outage or performance degradation, you can failover to a healthy region.
Rate Limit Mitigation: By distributing traffic across multiple regional endpoints (and their corresponding quotas), you can handle higher throughput.
Latency Optimization: You can route users to the geographically closest agent deployment.

You can implement a lightweight load balancer or a "router" service in front of your regional ADK agent deployments to manage the traffic distribution and failover logic.

Additional Best Practice: Exponential Backoff

Before failing over to another region, it's a best practice to implement a retry mechanism for transient errors like 429 Resource Exhausted. The standard approach is exponential backoff with jitter.

This involves:

Catching the 429 error.
Waiting for a short, exponentially increasing duration (e.g., 1s, 2s, 4s, ...).
Adding a small, random delay ("jitter") to the wait time to prevent multiple clients from retrying simultaneously (the "thundering herd" problem).
Retrying the request a few times before giving up and failing over to another region.

Many Google client libraries have this logic built-in, but it's good to ensure your application handles it explicitly when managing calls. This can often resolve temporary spikes without needing to switch regions.

Good luck with your implementation!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switching Google regions to avoid 429 Resource Exhausted errors #3404

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Switching Google regions to avoid 429 Resource Exhausted errors #3404

Uh oh!

Mst96 Nov 5, 2025

Client Initialization in ADK

Replies: 1 comment

Uh oh!

adk-bot Nov 5, 2025 Maintainer

Client Initialization in ADK

Multi-Region Deployment: A Recommended Pattern

Additional Best Practice: Exponential Backoff

Mst96
Nov 5, 2025

adk-bot
Nov 5, 2025
Maintainer