aden-hive · TayyabAhmed561 · Mar 3, 2026
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -205,6 +205,12 @@ pip uninstall -y framework tools
 ./quickstart.sh
 ```
 
+#### Gemini 503 (High Demand)
+
+If you see repeated `503 UNAVAILABLE` or `429 RESOURCE_EXHAUSTED` errors while using Gemini models, see:
+
+- [Gemini 503 Troubleshooting Guide](./troubleshooting/gemini-503.md)
+
 ## Getting Help
 
 - **Documentation**: Check the `/docs` folder

diff --git a/docs/troubleshooting/gemini-503.md b/docs/troubleshooting/gemini-503.md
@@ -0,0 +1,50 @@
+# Troubleshooting Gemini 503 (UNAVAILABLE / High Demand)
+
+## Symptoms
+
+When running an agent with Gemini (Vertex), execution may repeatedly retry and fail with an error similar to:
+
+- `503 UNAVAILABLE`
+- Message includes: "This model is currently experiencing high demand"
+
+You may also see LiteLLM retry logs and messages like `MidStreamFallbackError`.
+
+## Why this happens
+
+This is a provider-side overload condition (temporary capacity or demand spike). Your environment can be correctly configured and still hit this error.
+
+## Quick fixes
+
+Try these in order:
+
+1. **Retry later**
+   - Spikes are often temporary.
+
+2. **Switch models**
+   - If using `gemini-3-flash-preview`, try `gemini-3.1-pro-preview` (often more stable during spikes).
+
+3. **Reduce workload**
+   - Shorten the request scope (example: “5 items from last 7 days”).
+   - Ask for concise output.
+
+4. **Avoid long streaming outputs**
+   - If a setting exists to disable streaming, try turning it off.
+   - Mid-stream failures can be more common under provider instability.
+
+5. **Switch providers**
+   - If you have keys available, try another provider temporarily (OpenAI, Anthropic, Groq, Cerebras).
+
+## How to confirm it is not a local misconfiguration
+
+If you can:
+- launch the Hive UI successfully,
+- run other lightweight prompts sometimes,
+- and the logs specifically show `503 UNAVAILABLE` with “high demand”,
+
+then this is almost certainly provider-side overload, not a local setup issue.
+
+## Suggested resilience behavior (future improvement)
+
+When transient 503 errors exceed retry thresholds, consider:
+- configurable fallback model routing, or
+- returning partial results with a clear “degraded” status.