thushan
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 1 deletion b/‎.gitignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎config/config.yaml‎
Lines changed: 41 additions & 3 deletions b/‎config/config.yaml‎
Lines changed: 41 additions & 3 deletions
diff --git a/‎docs/content/api-reference/overview.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/content/api-reference/overview.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/content/concepts/health-checking.md‎
Lines changed: 37 additions & 6 deletions b/‎docs/content/concepts/health-checking.md‎
Lines changed: 37 additions & 6 deletions
@@ -6,7 +6,7 @@ debug/
 logs/
 tmp/
 *.exe
-olla
+/olla
 # Local config files (never ship these)
 config.local.yaml
 config/*.local.yaml
 
@@ -87,9 +87,9 @@ olla/
 ## Response Headers
 - `X-Olla-Endpoint`: Backend name
 - `X-Olla-Model`: Model used
-- `X-Olla-Backend-Type`: ollama/openai/lmstudio
+- `X-Olla-Backend-Type`: ollama/openai/lmstudio/vllm
 - `X-Olla-Request-ID`: Request ID
-- `X-Olla-Response-Time`: Total time (trailer)
+- `X-Olla-Response-Time`: Total processing time
 
 ## Testing
 - Unit tests: Components in isolation
 
@@ -37,12 +37,37 @@ proxy:
   connection_timeout: 60s
   response_timeout: 900s
   read_timeout: 600s
-  max_retries: 3
-  retry_backoff: 500ms
+  
+  # DEPRECATED as of v0.0.16 - These fields are no longer used
+  # max_retries: 3        # Replaced by retry.max_attempts
+  # retry_backoff: 500ms  # Now uses intelligent exponential backoff
+  
+  # Connection failure retry settings (applies to both Sherpa and Olla engines)
+  # When enabled, the proxy will automatically retry failed requests on other healthy endpoints
+  retry:
+    enabled: true # Enable automatic retry on connection failures
+    on_connection_failure: true # Retry when connection to backend fails (connection refused, reset, timeout)
+    max_attempts: 0 # Maximum retry attempts (0 = try all available endpoints once)
+    # Connection errors that trigger retry:
+    # - Connection refused (backend is down)
+    # - Connection reset (backend crashed)
+    # - Connection timeout (backend is overloaded)
+    # - Network unreachable (network issues)
+    # Failed endpoints are immediately marked as unhealthy and removed from the retry pool
 
 discovery:
   type: "static"
   refresh_interval: 30s
+  
+  # Health check and recovery settings
+  health_check:
+    initial_delay: 1s # Delay before first health check
+    # When an endpoint fails during request processing:
+    # - It's immediately marked as unhealthy
+    # - Consecutive failures increment, causing exponential backoff
+    # - Next check time = now + (consecutive_failures * 2) seconds (max 60s)
+    # - Health checker will automatically recover endpoints when they're back online
+  
   static:
     endpoints:
       - url: "http://localhost:11434"
@@ -51,7 +76,7 @@ discovery:
         priority: 100
         model_url: "/api/tags"
         health_check_url: "/"
-        check_interval: 2s
+        check_interval: 2s # How often to check when healthy
         check_timeout: 1s
       - url: "http://localhost:11234"
         name: "local-lm-studio"
@@ -80,10 +105,23 @@ discovery:
 model_registry:
   type: "memory"
   enable_unifier: true
+
   unification:
     enabled: true
     stale_threshold: 24h  # How long to keep models in memory after last seen
     cleanup_interval: 10m  # How often to check for stale models
+  
+  # Model routing strategy (v0.0.16+)
+  # Controls how requests are routed when models aren't available on all endpoints
+  routing_strategy:
+    type: "strict"  # Options: strict, optimistic, discovery
+    options:
+      # Fallback behavior when model not found (optimistic mode)
+      fallback_behavior: "compatible_only"  # Options: compatible_only, all, none
+      
+      # Discovery mode settings
+      discovery_timeout: 2s  # Timeout for discovery refresh
+      discovery_refresh_on_miss: false  # Refresh discovery when model not found
 
 logging:
   level: "info"  # debug, info, warn, error
 
@@ -87,7 +87,7 @@ All responses include:
 | `X-Olla-Endpoint` | Backend endpoint name |
 | `X-Olla-Model` | Model used (if applicable) |
 | `X-Olla-Backend-Type` | Provider type (ollama/lmstudio/openai/vllm) |
-| `X-Olla-Response-Time` | Total processing time (trailer) |
+| `X-Olla-Response-Time` | Total processing time |
 
 ## Error Responses
 
 
@@ -128,12 +128,13 @@ endpoints:
 
 When an endpoint fails, Olla implements exponential backoff:
 
-1. **First failure**: Check again after `check_interval`
+1. **First failure**: Check again after `check_interval` (no backoff)
 2. **Second failure**: Wait `check_interval * 2`
 3. **Third failure**: Wait `check_interval * 4`
-4. **Max backoff**: Capped at 5 minutes
+4. **Fourth failure**: Wait `check_interval * 8`
+5. **Max backoff**: Capped at `check_interval * 12` or 60 seconds (whichever is lower)
 
-This reduces load on failing endpoints while still detecting recovery.
+This reduces load on failing endpoints while still detecting recovery quickly on the first failure.
 
 ### Fast Recovery Detection
 
@@ -143,6 +144,17 @@ When an unhealthy endpoint might be recovering:
 2. **Success Threshold**: After 2 successful checks, mark healthy
 3. **Full Traffic**: Resume normal routing
 
+### Automatic Model Discovery on Recovery
+
+When an endpoint recovers from an unhealthy state, Olla automatically:
+
+1. **Detects Recovery**: Health check transitions from unhealthy to healthy
+2. **Triggers Discovery**: Automatically initiates model discovery
+3. **Updates Catalog**: Refreshes the unified model catalog with latest models
+4. **Resumes Routing**: Endpoint is immediately available for request routing
+
+This ensures the model catalog stays up-to-date even if models were added/removed while the endpoint was down.
+
 ## Health Check Types
 
 ### HTTP GET Health Checks
@@ -169,6 +181,25 @@ endpoints:
     # Health check also validates model availability
 ```
 
+## Connection Failure Handling
+
+### Automatic Retry on Connection Failures
+
+When a request fails due to connection issues, Olla automatically:
+
+1. **Detects Failure**: Identifies connection refused, reset, or timeout errors
+2. **Marks Unhealthy**: Immediately updates endpoint status to unhealthy
+3. **Retries Request**: Automatically tries the next available healthy endpoint
+4. **Updates Health**: Triggers exponential backoff for failed endpoint
+
+This happens transparently without dropping the user request. The retry behaviour is automatic and built-in as of v0.0.16.
+
+Connection errors that trigger automatic retry:
+- **Connection Refused**: Backend service is down
+- **Connection Reset**: Backend crashed or restarted
+- **Connection Timeout**: Backend is overloaded
+- **Network Unreachable**: Network connectivity issues
+
 ## Circuit Breaker Integration
 
 Health checks work with the circuit breaker to prevent cascade failures:
@@ -193,10 +224,10 @@ Health checks work with the circuit breaker to prevent cascade failures:
 
 The circuit breaker activates after consecutive failures:
 
-1. **Failure Threshold**: 3 consecutive failures trigger opening
+1. **Failure Threshold**: 3 failures (health checker) or 5 failures (Olla proxy engine)
 2. **Open Duration**: Circuit stays open for 30 seconds
-3. **Half-Open Test**: Send 3 test requests
-4. **Recovery**: 2 successful tests close the circuit
+3. **Half-Open Test**: Allows one test request through
+4. **Recovery**: First successful request closes the circuit
 
 ## Monitoring Health Status