thushan
diff --git a/‎docs/content/compare/gpustack.md‎
Lines changed: 250 additions & 0 deletions b/‎docs/content/compare/gpustack.md‎
Lines changed: 250 additions & 0 deletions
@@ -0,0 +1,250 @@
+---
+title: Olla vs GPUStack - GPU Orchestration vs Load Balancing
+description: Understand the differences between Olla and GPUStack. Learn how these complementary tools work together for GPU cluster management and LLM routing.
+keywords: olla vs gpustack, gpu orchestration, llm deployment, kubernetes ai, gpu cluster management, model deployment
+---
+
+# Olla vs GPUStack
+
+## Overview
+
+[Olla](https://github.com/thushan/olla) and [GPUStack](https://github.com/gpustack/gpustack) operate at different layers of the LLM infrastructure stack. GPUStack orchestrates and deploys models across GPU clusters, while Olla provides intelligent routing and failover for existing endpoints.
+
+## Core Differences
+
+### Primary Purpose
+
+**Olla**: Application-layer proxy for routing and resilience
+
+- Routes requests to existing LLM services
+- Provides failover and load balancing
+- No model deployment or GPU management
+- Works with whatever's already running
+
+**GPUStack**: Infrastructure orchestration platform
+
+- Deploys models across GPU clusters
+- Manages GPU allocation and scheduling
+- Handles model downloading and storage
+- Creates and manages inference endpoints
+
+### Stack Position
+
+```
+Application Layer:    Your Apps
+                          ↓
+Routing Layer:          Olla
+                          ↓
+Service Layer:    LLM Endpoints ([Ollama](https://github.com/ollama/ollama), [vLLM](https://github.com/vllm-project/vllm), etc)
+                          ↑
+Orchestration:       GPUStack (creates these)
+                          ↓
+Hardware Layer:      GPU Servers
+```
+
+## Feature Comparison
+
+| Feature | Olla | GPUStack |
+|---------|------|----------|
+| **Infrastructure Management** | | |
+| Model deployment | ❌ | ✅ |
+| GPU resource management | ❌ | ✅ |
+| Model downloading | ❌ | ✅ |
+| Storage management | ❌ | ✅ |
+| Node management | ❌ | ✅ |
+| **Request Handling** | | |
+| Request routing | ✅ Advanced | ✅ Basic |
+| Load balancing strategies | ✅ Multiple | ⚠️ Limited |
+| Circuit breakers | ✅ | ❌ |
+| Retry mechanisms | ✅ Sophisticated | ⚠️ Basic |
+| Health monitoring | ✅ Continuous | ✅ Instance-level |
+| **Model Management** | | |
+| Model discovery | ✅ From endpoints | N/A (deploys them) |
+| Model name unification | ✅ | ❌ |
+| Multi-provider support | ✅ | ❌ (GGUF focus) |
+| **Deployment** | | |
+| Complexity | Simple (binary + YAML) | Platform installation |
+| Resource overhead | ~40MB | Platform overhead |
+| Prerequisites | None | Kubernetes knowledge helpful |
+
+## When to Use Each
+
+### Use Olla When:
+
+- You have existing LLM services running
+- Need intelligent routing between endpoints
+- Want automatic failover without re-deployment
+- Require advanced load balancing
+- Working with multiple LLM providers
+- Need minimal resource overhead
+
+### Use GPUStack When:
+
+- Starting from raw GPU hardware
+- Need to dynamically deploy models
+- Want Kubernetes-like orchestration
+- Managing a cluster of GPUs
+- Require automatic model distribution
+- Need GPU-aware scheduling
+
+## Better Together: Complementary Architecture
+
+Olla and GPUStack work excellently together:
+
+```yaml
+# Olla configuration
+endpoints:
+  # GPUStack-managed endpoints
+  - name: gpustack-pool-1
+    url: http://gpustack-1.internal:8080
+    priority: 1
+    type: openai
+  
+  - name: gpustack-pool-2
+    url: http://gpustack-2.internal:8080
+    priority: 1
+    type: openai
+  
+  # Other endpoints
+  - name: ollama-backup
+    url: http://backup-server:11434
+    priority: 2
+    type: ollama
+  
+  - name: cloud-overflow
+    url: http://litellm:8000
+    priority: 10
+    type: openai
+```
+
+### Benefits of Combined Deployment:
+
+1. **GPUStack manages the GPU infrastructure**
+   - Deploys models based on demand
+   - Handles GPU allocation
+   - Manages model lifecycle
+
+2. **Olla provides the reliability layer**
+   - Routes between GPUStack instances
+   - Fails over to backup endpoints
+   - Provides circuit breakers
+   - Unifies access to all endpoints
+
+## Real-World Scenarios
+
+### Scenario 1: GPU Cluster with Fallbacks
+```
+                 Olla
+                  ↓
+    ┌─────────────┼─────────────┐
+    ↓             ↓             ↓
+GPUStack     Ollama        Cloud API
+(Primary)   (Backup)      (Overflow)
+```
+
+**How it works**:
+
+- GPUStack manages your main GPU cluster
+- Olla routes requests, preferring GPUStack
+- Falls back to Ollama if cluster is busy
+- Overflows to cloud if everything is saturated
+
+### Scenario 2: Multi-Site Deployment
+```
+        Global Olla Instance
+                ↓
+    ┌───────────┼───────────┐
+    ↓           ↓           ↓
+GPUStack    GPUStack    Direct
+(Sydney)   (Melbourne)  Endpoints
+```
+
+### Scenario 3: Development to Production
+```
+Development:  Laptop → Olla → Local Ollama
+                         ↓
+                    Cloud (fallback)
+
+Production:   Apps → Olla → GPUStack Cluster
+                       ↓
+                  Cloud (overflow)
+```
+
+## Integration Patterns
+
+### Pattern 1: GPUStack Primary, Others Secondary
+```yaml
+# Olla prioritises GPUStack but maintains alternatives
+endpoints:
+  - name: gpustack-primary
+    url: http://gpustack:8080
+    priority: 1
+  - name: manual-backup
+    url: http://ollama:11434
+    priority: 5
+```
+
+### Pattern 2: Geographic Distribution
+```yaml
+# Olla routes to nearest GPUStack region
+endpoints:
+  - name: gpustack-syd
+    url: http://syd.gpustack:8080
+    priority: 1  # For Sydney users
+  - name: gpustack-mel
+    url: http://mel.gpustack:8080
+    priority: 1  # For Melbourne users
+```
+
+## Performance Considerations
+
+### Resource Usage
+
+- **Olla**: ~40MB RAM, negligible CPU
+- **GPUStack**: Platform overhead + model memory
+- **Combined**: Minimal additional overhead from Olla
+
+### Latency
+
+- **Olla routing**: <2ms overhead
+- **GPUStack**: Model loading time (first request)
+- **Combined**: Olla can route around cold-start delays
+
+## Common Questions
+
+**Q: Does Olla duplicate GPUStack's routing?**
+A: No. GPUStack does basic request distribution. Olla adds sophisticated load balancing, circuit breakers, and multi-provider support.
+
+**Q: Can Olla deploy models like GPUStack?**
+A: No. Olla only routes to existing endpoints. Use GPUStack for model deployment.
+
+**Q: Should I use both in production?**
+A: Yes, if you need both GPU orchestration and reliable routing. They're designed for different layers.
+
+**Q: Can Olla route to non-GPUStack endpoints?**
+A: Absolutely! Olla works with any HTTP-based LLM endpoint.
+
+## Migration Patterns
+
+### Adding Olla to GPUStack
+
+1. Deploy Olla in front of GPUStack endpoints
+2. Configure health checks and priorities
+3. Add backup endpoints (Ollama, cloud)
+4. Point applications to Olla
+
+### Adding GPUStack to Olla Setup
+
+1. Deploy GPUStack cluster
+2. Add GPUStack endpoints to Olla config
+3. Set appropriate priorities
+4. Monitor and adjust load balancing
+
+## Conclusion
+
+GPUStack and Olla are complementary tools that excel at different layers:
+
+- **GPUStack**: Infrastructure orchestration and model deployment
+- **Olla**: Intelligent routing and reliability
+
+Together, they provide a complete solution: GPUStack manages your GPU infrastructure while Olla ensures reliable, intelligent access to all your LLM resources.