Skip to content

Commit 5d2915b

Browse files
authored
Merge pull request #53 from thushan/docs/compare
docs: comparisons
2 parents 1908f7b + bdb66d7 commit 5d2915b

File tree

10 files changed

+1284
-6
lines changed

10 files changed

+1284
-6
lines changed

docs/content/compare/gpustack.md

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
---
2+
title: Olla vs GPUStack - GPU Orchestration vs Load Balancing
3+
description: Understand the differences between Olla and GPUStack. Learn how these complementary tools work together for GPU cluster management and LLM routing.
4+
keywords: olla vs gpustack, gpu orchestration, llm deployment, kubernetes ai, gpu cluster management, model deployment
5+
---
6+
7+
# Olla vs GPUStack
8+
9+
## Overview
10+
11+
[Olla](https://github.com/thushan/olla) and [GPUStack](https://github.com/gpustack/gpustack) operate at different layers of the LLM infrastructure stack. GPUStack orchestrates and deploys models across GPU clusters, while Olla provides intelligent routing and failover for existing endpoints.
12+
13+
## Core Differences
14+
15+
### Primary Purpose
16+
17+
**Olla**: Application-layer proxy for routing and resilience
18+
19+
- Routes requests to existing LLM services
20+
- Provides failover and load balancing
21+
- No model deployment or GPU management
22+
- Works with whatever's already running
23+
24+
**GPUStack**: Infrastructure orchestration platform
25+
26+
- Deploys models across GPU clusters
27+
- Manages GPU allocation and scheduling
28+
- Handles model downloading and storage
29+
- Creates and manages inference endpoints
30+
31+
### Stack Position
32+
33+
```
34+
Application Layer: Your Apps
35+
36+
Routing Layer: Olla
37+
38+
Service Layer: LLM Endpoints ([Ollama](https://github.com/ollama/ollama), [vLLM](https://github.com/vllm-project/vllm), etc)
39+
40+
Orchestration: GPUStack (creates these)
41+
42+
Hardware Layer: GPU Servers
43+
```
44+
45+
## Feature Comparison
46+
47+
| Feature | Olla | GPUStack |
48+
|---------|------|----------|
49+
| **Infrastructure Management** | | |
50+
| Model deployment |||
51+
| GPU resource management |||
52+
| Model downloading |||
53+
| Storage management |||
54+
| Node management |||
55+
| **Request Handling** | | |
56+
| Request routing | ✅ Advanced | ✅ Basic |
57+
| Load balancing strategies | ✅ Multiple | ⚠️ Limited |
58+
| Circuit breakers |||
59+
| Retry mechanisms | ✅ Sophisticated | ⚠️ Basic |
60+
| Health monitoring | ✅ Continuous | ✅ Instance-level |
61+
| **Model Management** | | |
62+
| Model discovery | ✅ From endpoints | N/A (deploys them) |
63+
| Model name unification |||
64+
| Multi-provider support || ❌ (GGUF focus) |
65+
| **Deployment** | | |
66+
| Complexity | Simple (binary + YAML) | Platform installation |
67+
| Resource overhead | ~40MB | Platform overhead |
68+
| Prerequisites | None | Kubernetes knowledge helpful |
69+
70+
## When to Use Each
71+
72+
### Use Olla When:
73+
74+
- You have existing LLM services running
75+
- Need intelligent routing between endpoints
76+
- Want automatic failover without re-deployment
77+
- Require advanced load balancing
78+
- Working with multiple LLM providers
79+
- Need minimal resource overhead
80+
81+
### Use GPUStack When:
82+
83+
- Starting from raw GPU hardware
84+
- Need to dynamically deploy models
85+
- Want Kubernetes-like orchestration
86+
- Managing a cluster of GPUs
87+
- Require automatic model distribution
88+
- Need GPU-aware scheduling
89+
90+
## Better Together: Complementary Architecture
91+
92+
Olla and GPUStack work excellently together:
93+
94+
```yaml
95+
# Olla configuration
96+
endpoints:
97+
# GPUStack-managed endpoints
98+
- name: gpustack-pool-1
99+
url: http://gpustack-1.internal:8080
100+
priority: 1
101+
type: openai
102+
103+
- name: gpustack-pool-2
104+
url: http://gpustack-2.internal:8080
105+
priority: 1
106+
type: openai
107+
108+
# Other endpoints
109+
- name: ollama-backup
110+
url: http://backup-server:11434
111+
priority: 2
112+
type: ollama
113+
114+
- name: cloud-overflow
115+
url: http://litellm:8000
116+
priority: 10
117+
type: openai
118+
```
119+
120+
### Benefits of Combined Deployment:
121+
122+
1. **GPUStack manages the GPU infrastructure**
123+
- Deploys models based on demand
124+
- Handles GPU allocation
125+
- Manages model lifecycle
126+
127+
2. **Olla provides the reliability layer**
128+
- Routes between GPUStack instances
129+
- Fails over to backup endpoints
130+
- Provides circuit breakers
131+
- Unifies access to all endpoints
132+
133+
## Real-World Scenarios
134+
135+
### Scenario 1: GPU Cluster with Fallbacks
136+
```
137+
Olla
138+
139+
┌─────────────┼─────────────┐
140+
↓ ↓ ↓
141+
GPUStack Ollama Cloud API
142+
(Primary) (Backup) (Overflow)
143+
```
144+
145+
**How it works**:
146+
147+
- GPUStack manages your main GPU cluster
148+
- Olla routes requests, preferring GPUStack
149+
- Falls back to Ollama if cluster is busy
150+
- Overflows to cloud if everything is saturated
151+
152+
### Scenario 2: Multi-Site Deployment
153+
```
154+
Global Olla Instance
155+
156+
┌───────────┼───────────┐
157+
↓ ↓ ↓
158+
GPUStack GPUStack Direct
159+
(Sydney) (Melbourne) Endpoints
160+
```
161+
162+
### Scenario 3: Development to Production
163+
```
164+
Development: Laptop → Olla → Local Ollama
165+
166+
Cloud (fallback)
167+
168+
Production: Apps → Olla → GPUStack Cluster
169+
170+
Cloud (overflow)
171+
```
172+
173+
## Integration Patterns
174+
175+
### Pattern 1: GPUStack Primary, Others Secondary
176+
```yaml
177+
# Olla prioritises GPUStack but maintains alternatives
178+
endpoints:
179+
- name: gpustack-primary
180+
url: http://gpustack:8080
181+
priority: 1
182+
- name: manual-backup
183+
url: http://ollama:11434
184+
priority: 5
185+
```
186+
187+
### Pattern 2: Geographic Distribution
188+
```yaml
189+
# Olla routes to nearest GPUStack region
190+
endpoints:
191+
- name: gpustack-syd
192+
url: http://syd.gpustack:8080
193+
priority: 1 # For Sydney users
194+
- name: gpustack-mel
195+
url: http://mel.gpustack:8080
196+
priority: 1 # For Melbourne users
197+
```
198+
199+
## Performance Considerations
200+
201+
### Resource Usage
202+
203+
- **Olla**: ~40MB RAM, negligible CPU
204+
- **GPUStack**: Platform overhead + model memory
205+
- **Combined**: Minimal additional overhead from Olla
206+
207+
### Latency
208+
209+
- **Olla routing**: <2ms overhead
210+
- **GPUStack**: Model loading time (first request)
211+
- **Combined**: Olla can route around cold-start delays
212+
213+
## Common Questions
214+
215+
**Q: Does Olla duplicate GPUStack's routing?**
216+
A: No. GPUStack does basic request distribution. Olla adds sophisticated load balancing, circuit breakers, and multi-provider support.
217+
218+
**Q: Can Olla deploy models like GPUStack?**
219+
A: No. Olla only routes to existing endpoints. Use GPUStack for model deployment.
220+
221+
**Q: Should I use both in production?**
222+
A: Yes, if you need both GPU orchestration and reliable routing. They're designed for different layers.
223+
224+
**Q: Can Olla route to non-GPUStack endpoints?**
225+
A: Absolutely! Olla works with any HTTP-based LLM endpoint.
226+
227+
## Migration Patterns
228+
229+
### Adding Olla to GPUStack
230+
231+
1. Deploy Olla in front of GPUStack endpoints
232+
2. Configure health checks and priorities
233+
3. Add backup endpoints (Ollama, cloud)
234+
4. Point applications to Olla
235+
236+
### Adding GPUStack to Olla Setup
237+
238+
1. Deploy GPUStack cluster
239+
2. Add GPUStack endpoints to Olla config
240+
3. Set appropriate priorities
241+
4. Monitor and adjust load balancing
242+
243+
## Conclusion
244+
245+
GPUStack and Olla are complementary tools that excel at different layers:
246+
247+
- **GPUStack**: Infrastructure orchestration and model deployment
248+
- **Olla**: Intelligent routing and reliability
249+
250+
Together, they provide a complete solution: GPUStack manages your GPU infrastructure while Olla ensures reliable, intelligent access to all your LLM resources.

0 commit comments

Comments
 (0)