diff --git a/docs/ollama-capacity-analysis.md b/docs/ollama-capacity-analysis.md new file mode 100644 index 00000000..25dce457 --- /dev/null +++ b/docs/ollama-capacity-analysis.md @@ -0,0 +1,441 @@ +# Ollama Capacity Analysis: ollama.internal.coutinho.io + +**Date**: 2025-10-30 +**Model**: nomic-embed-text:latest +**Test Location**: From nextcloud-mcp-server host + +## Summary + +✅ **Ollama instance is operational and performing well** +- Embedding generation working correctly +- Reasonable latency for small-medium workloads +- Good parallelism support +- Suitable for development and small production deployments + +## Test Results + +### Model Configuration + +```json +{ + "model": "nomic-embed-text", + "dimensions": 768, + "status": "operational" +} +``` + +### Performance Metrics + +#### 1. Single Embedding Latency + +**Result**: ~553ms per embedding +- **Total time**: 0.553 seconds +- **Includes**: Network + processing + model inference +- **Quality**: Full 768-dimensional vector + +**Analysis**: +- Higher than bare-metal benchmarks (~100ms) due to network latency +- Acceptable for interactive search queries +- Within expected range for remote Ollama instance + +#### 2. Batch Processing (5 items) + +**Result**: ~1.02 seconds for 5 embeddings +- **Per-item average**: 204ms +- **Throughput**: ~4.9 embeddings/sec +- **Batch efficiency**: 2.7x faster than sequential + +**Analysis**: +- Good batching efficiency (2.7x speedup vs 5x theoretical) +- Optimal for background indexing +- Network overhead amortized across batch + +#### 3. Batch Processing (20 items) + +**Result**: ~6.71 seconds for 20 embeddings +- **Per-item average**: 336ms +- **Throughput**: ~3.0 embeddings/sec +- **Batch efficiency**: 1.65x faster than sequential + +**Analysis**: +- Performance degrades slightly with larger batches +- Still faster than sequential processing +- Matches reported Ollama behavior (quality issues at batch >16) +- **Recommendation**: Keep batch size ≤16 for best quality + +#### 4. Concurrent Requests (5 parallel) + +**Result**: ~1.27 seconds for 5 parallel requests +- **Effective parallelism**: ~4x speedup (vs 2.77s sequential) +- **Per-request average**: 254ms +- **Throughput**: ~3.9 requests/sec + +**Analysis**: +- Excellent parallelism support +- Server handles concurrent requests efficiently +- Network and compute overlap effectively +- Good for multi-user scenarios + +## Capacity Planning + +### Current Performance Profile + +| Metric | Value | Rating | +|--------|-------|--------| +| Single embedding latency | 553ms | ⚠️ Moderate | +| Batch (5) throughput | 4.9/sec | ✅ Good | +| Batch (20) throughput | 3.0/sec | ⚠️ Moderate | +| Concurrent throughput | 3.9/sec | ✅ Good | +| Network latency | ~300-400ms | ⚠️ Significant | + +### Bottleneck Analysis + +**Primary Bottleneck**: Network latency (~300-400ms per request) +- Model inference: ~100-200ms (estimated) +- Network round-trip: ~300-400ms (measured overhead) +- **Impact**: 60-70% of total latency is network + +**Secondary Bottleneck**: CPU/GPU capacity (unknown hardware) +- Batch performance degrades at >16 items +- Suggests resource constraints +- Likely CPU-only (no GPU metrics available) + +### Recommended Usage Patterns + +#### ✅ **Excellent For:** + +**1. Background Indexing** +- Use batch size of 10-15 items +- Expected throughput: 3-5 embeddings/sec +- **10,000 notes**: ~30-55 minutes to index +- **1,000 notes**: ~3-5 minutes to index + +**2. Interactive Search** +- Single query embedding: ~550ms +- Acceptable for user-facing search +- Add 100-200ms for vector search + verification +- **Total search time**: ~650-750ms (reasonable UX) + +**3. Multi-User Development** +- 5-10 concurrent users: Comfortable +- Good parallelism support +- Network latency dominates (shared) + +#### ⚠️ **Consider Alternatives For:** + +**1. Real-Time Applications** +- Sub-100ms latency requirements +- High-frequency queries (>10/sec sustained) +- Consider: Local embeddings or Infinity + +**2. Large-Scale Batch Processing** +- >100,000 documents to index +- >10 embeddings/sec sustained +- Consider: GPU-accelerated TEI + +**3. Production with >50 Users** +- High concurrent load +- Latency sensitivity +- Consider: Dedicated embedding service + +### Deployment Scenarios + +#### Scenario 1: Development Environment + +**Profile**: +- 1-3 developers +- 1,000-5,000 notes total +- Occasional searches/indexing + +**Verdict**: ✅ **Perfect fit** +- Initial index: ~5-15 minutes (one-time) +- Incremental updates: <1 minute +- Search latency: Acceptable +- No infrastructure changes needed + +**Configuration**: +```bash +OLLAMA_URL=https://ollama.internal.coutinho.io +OLLAMA_MODEL=nomic-embed-text +VECTOR_SYNC_INTERVAL=600 # 10 minutes +VECTOR_SYNC_BATCH_SIZE=10 +``` + +#### Scenario 2: Small Production (10-20 users) + +**Profile**: +- 10-20 active users +- 10,000-50,000 notes total +- 50-200 searches/day +- Nightly incremental indexing + +**Verdict**: ✅ **Suitable with optimizations** +- Initial index: 1-3 hours (run overnight) +- Incremental: 5-15 minutes/night +- Search: Acceptable for most users +- Monitor network latency + +**Configuration**: +```bash +OLLAMA_URL=https://ollama.internal.coutinho.io +OLLAMA_MODEL=nomic-embed-text +VECTOR_SYNC_INTERVAL=86400 # Daily at night +VECTOR_SYNC_BATCH_SIZE=12 # Conservative for quality +SEARCH_TIMEOUT_MS=1000 # Account for 550ms latency +``` + +**Optimizations**: +- Run sync during off-hours +- Cache query embeddings (common searches) +- Use hybrid search (keyword + semantic) + +#### Scenario 3: Medium Production (50-100 users) + +**Profile**: +- 50-100 active users +- 100,000+ notes +- 500-1000 searches/day +- Real-time indexing desired + +**Verdict**: ⚠️ **Marginal - monitor closely** +- Initial index: 5-10 hours +- Search latency: May feel slow for some users +- Concurrent load: Approaching limits +- **Recommendation**: Plan migration to Infinity + +**Configuration**: +```bash +OLLAMA_URL=https://ollama.internal.coutinho.io +OLLAMA_MODEL=nomic-embed-text +VECTOR_SYNC_INTERVAL=3600 # Hourly +VECTOR_SYNC_BATCH_SIZE=10 +SEMANTIC_WEIGHT=0.5 # Rely more on keyword search +SEARCH_TIMEOUT_MS=2000 # Generous timeout +``` + +**Migration Path**: +- Start with Ollama +- Monitor latency metrics +- When p95 latency >1s, migrate to Infinity +- Keep Ollama as fallback + +#### Scenario 4: Large Production (>100 users) + +**Profile**: +- >100 active users +- >500,000 notes +- >1000 searches/day +- Real-time expectations + +**Verdict**: ❌ **Not recommended** +- Latency too high for scale +- Throughput insufficient +- Network becomes bottleneck +- **Recommendation**: Use Infinity or TEI from start + +## Network Latency Optimization + +### Current Overhead: ~300-400ms + +**If MCP server runs closer to Ollama**: +``` +Same VPC/network: ~1-5ms (300-400ms savings!) +Same host: <1ms (300-400ms savings!) +``` + +### Recommendation + +**Option A: Co-locate MCP server with Ollama** +- Reduces latency from 550ms → 150-200ms +- 2.5-3x improvement +- Makes Ollama competitive with cloud APIs + +**Option B: Keep separate (current)** +- Simpler deployment +- Better security isolation +- Accept 550ms latency + +**Option C: Add Infinity container to MCP server** +- Best of both worlds +- Use Infinity for speed (local) +- Fallback to Ollama if needed + +## Capacity Estimates + +### Indexing Capacity + +**Sustained Throughput**: 3-4 embeddings/sec (conservative) + +| Document Count | Index Time | Notes | +|----------------|------------|-------| +| 1,000 | 4-5 min | Quick | +| 5,000 | 20-25 min | Reasonable | +| 10,000 | 40-50 min | Acceptable | +| 50,000 | 3.5-4.5 hours | Overnight job | +| 100,000 | 7-9 hours | Long batch | +| 500,000 | 35-45 hours | Not recommended | + +**Incremental Updates** (10% change daily): +- 1,000 docs: ~30 sec +- 10,000 docs: ~5 min +- 50,000 docs: ~25 min + +### Search Capacity + +**Query Latency Budget**: +- Embedding: 550ms +- Vector search: 50-100ms +- Permission verification: 50-100ms +- **Total**: 650-750ms + +**Concurrent Users** (assuming 1 search every 5 minutes): +- 10 users: 2 queries/min → Comfortable +- 50 users: 10 queries/min → Near limit +- 100 users: 20 queries/min → Over capacity + +**Peak Load** (all users search at once): +- Parallelism: ~4 concurrent +- Queue time: Proportional to position +- 10 simultaneous: ~1.5-2 sec for last user +- 50 simultaneous: ~7-10 sec for last user + +## Recommendations + +### Immediate Actions (Development) + +1. **✅ Use Ollama as-is** + - Current setup is perfect for dev/testing + - No changes needed + - Start building semantic search + +2. **Configuration**: + ```bash + OLLAMA_URL=https://ollama.internal.coutinho.io + OLLAMA_MODEL=nomic-embed-text + VECTOR_SYNC_BATCH_SIZE=10 + ``` + +3. **Add Monitoring**: + ```python + # Track these metrics + - embedding_latency_seconds (histogram) + - embedding_batch_size (gauge) + - embedding_errors_total (counter) + ``` + +### Short-Term (Small Production) + +1. **Optimize Batching**: + - Use batch size 10-12 (quality sweet spot) + - Process during off-hours + - Implement incremental sync + +2. **Add Caching**: + ```python + # Cache common query embeddings + @lru_cache(maxsize=1000) + async def embed_with_cache(query: str): + return await ollama.embed(query) + ``` + +3. **Monitor Metrics**: + - P50, P95, P99 latency + - Throughput (embeddings/sec) + - Error rates + +### Medium-Term (If Scaling Up) + +1. **Add Infinity Container** (when >50 users or latency issues): + ```yaml + services: + infinity: + image: michaelf34/infinity:latest + # Local to MCP server - ~10-20ms latency + ``` + +2. **Implement Tiered Fallback**: + ``` + Infinity (local, fast) → Ollama (remote, slower) → Local model + ``` + +3. **Load Testing**: + - Simulate 50-100 concurrent users + - Measure actual throughput limits + - Identify breaking points + +### Long-Term (Enterprise Scale) + +1. **Migrate to TEI Cluster** (when >100 users): + - GPU-accelerated + - Horizontal scaling + - <20ms latency + +2. **Consider Managed Services**: + - Pinecone, Qdrant Cloud + - Removes operational burden + - Better SLAs + +## Testing Recommendations + +### Load Testing Script + +```bash +# Test sustained load +for i in {1..100}; do + curl -s https://ollama.internal.coutinho.io/api/embed \ + -d "{\"model\": \"nomic-embed-text\", \"input\": \"Test $i\"}" & + + # Rate limit: 5 concurrent + if [ $(($i % 5)) -eq 0 ]; then + wait + sleep 1 + fi +done +``` + +### Metrics to Collect + +1. **Latency Distribution**: + - P50 (median) + - P95 (acceptable) + - P99 (outliers) + +2. **Throughput**: + - Embeddings/second + - Peak vs sustained + +3. **Error Rates**: + - Timeouts + - Server errors + - Quality issues + +## Conclusion + +**Your Ollama instance is ready for development and small production use!** + +**Current Capacity**: +- ✅ Development: Unlimited +- ✅ Small prod (10-20 users, 10k docs): Comfortable +- ⚠️ Medium prod (50 users, 50k docs): Monitoring needed +- ❌ Large prod (>100 users): Migrate to Infinity/TEI + +**Key Strengths**: +- Fully operational +- Good parallelism +- Acceptable latency for most use cases +- Easy to integrate + +**Key Limitations**: +- Network latency adds 300-400ms overhead +- Batch quality issues at >16 items +- Limited scalability beyond 50 users + +**Recommendation**: +Start using Ollama immediately for development. Add monitoring and plan for Infinity when you approach 50 users or experience latency issues. The abstraction layer in ADR-003 makes migration seamless. + +**Next Steps**: +1. Configure MCP server with Ollama URL +2. Implement semantic search tools +3. Add basic monitoring +4. Test with real workload +5. Scale up as needed diff --git a/docs/ollama-embeddings-investigation.md b/docs/ollama-embeddings-investigation.md new file mode 100644 index 00000000..38d5ac2f --- /dev/null +++ b/docs/ollama-embeddings-investigation.md @@ -0,0 +1,796 @@ +# Ollama Embeddings Investigation + +**Date**: 2025-10-30 +**Status**: Recommendation for Integration + +## Executive Summary + +Ollama provides a **local, self-hosted embedding solution** that is excellent for **development and small-scale deployments** but has **performance limitations** compared to specialized embedding inference engines (TEI, Infinity). + +**Recommendation**: Include Ollama as **Tier 2 fallback** in our embedding strategy (after cloud APIs, before local sentence-transformers), prioritizing ease of setup over maximum performance. + +## Overview + +Ollama is primarily known as a local LLM runner but added embedding model support in version 0.1.26, making it a convenient option for generating vector embeddings without external API dependencies. + +### Key Characteristics + +- **Local & Self-Hosted**: No external API calls, full privacy +- **Easy Setup**: Single binary, simple model downloads (`ollama pull nomic-embed-text`) +- **Unified Platform**: Same tool for both LLMs and embeddings +- **OpenAI Compatible**: `/v1/embeddings` endpoint for drop-in replacement +- **Multi-Platform**: Linux, macOS, Windows support +- **GPU Support**: CUDA, ROCm, Metal acceleration + +## API Details + +### Endpoint Structure + +**New API** (recommended): +```bash +POST http://localhost:11434/api/embed +``` + +**OpenAI Compatible**: +```bash +POST http://localhost:11434/v1/embeddings +``` + +**Legacy API** (deprecated): +```bash +POST http://localhost:11434/api/embeddings +``` + +### Request Format + +**Single Text Embedding**: +```json +{ + "model": "nomic-embed-text", + "input": "Text to embed" +} +``` + +**Batch Embedding** (since v0.2.0): +```json +{ + "model": "nomic-embed-text", + "input": [ + "First text to embed", + "Second text to embed", + "Third text to embed" + ] +} +``` + +### Response Format + +```json +{ + "model": "nomic-embed-text", + "embeddings": [ + [0.123, -0.456, 0.789, ...], // 768 dimensions for nomic-embed-text + [0.234, -0.567, 0.890, ...] + ] +} +``` + +### Python Integration + +```python +import ollama + +# Single embedding +response = ollama.embed( + model='nomic-embed-text', + input='Text to embed' +) +embedding = response['embeddings'][0] + +# Batch embeddings (more efficient) +response = ollama.embed( + model='nomic-embed-text', + input=[ + 'First text', + 'Second text', + 'Third text' + ] +) +embeddings = response['embeddings'] +``` + +## Available Models + +### 1. nomic-embed-text (Recommended) + +**Specifications**: +- **Parameters**: 137M +- **Dimensions**: 768 +- **Context Length**: 8,192 tokens (2K effective) +- **Size**: 274MB +- **Architecture**: BERT-based + +**Performance**: +- Outperforms OpenAI `text-embedding-ada-002` and `text-embedding-3-small` +- Excellent for long-context tasks +- Strong general-purpose performance + +**Use Cases**: +- General RAG applications +- Long document processing +- Semantic search +- Document clustering + +**Pull Command**: +```bash +ollama pull nomic-embed-text +``` + +### 2. mxbai-embed-large + +**Specifications**: +- **Parameters**: 334M +- **Dimensions**: 1,024 +- **Context Length**: 512 tokens +- **Architecture**: BERT-large optimized + +**Performance**: +- Claims to outperform commercial models +- Higher precision for complex queries +- Best quality but slower + +**Use Cases**: +- High-precision semantic search +- Enterprise knowledge bases +- Multilingual content + +**Pull Command**: +```bash +ollama pull mxbai-embed-large +``` + +### 3. all-minilm + +**Specifications**: +- **Parameters**: 23M +- **Dimensions**: 384 +- **Context Length**: 256 tokens +- **Size**: Smallest footprint + +**Performance**: +- Fastest processing speed +- Good for sentence-level tasks +- Limited context window + +**Use Cases**: +- Real-time applications +- Resource-constrained environments +- High-throughput scenarios +- Development/testing + +**Pull Command**: +```bash +ollama pull all-minilm +``` + +## Performance Benchmarks + +### Throughput Comparison + +| Hardware | Model | Batch Size | Throughput | Notes | +|----------|-------|------------|------------|-------| +| RTX 4090 (24GB) | nomic-embed-text | 256 | 12,450 tok/sec | GPU-accelerated | +| RTX 4090 (24GB) | mxbai-embed-large | 128 | 8,920 tok/sec | GPU-accelerated | +| Intel i9-13900K (CPU) | nomic-embed-text | 32 | 3,250 tok/sec | CPU-only | +| Intel i9-13900K (CPU) | mxbai-embed-large | 16 | 2,180 tok/sec | CPU-only | + +### Latency Comparison + +**Single Request Latency** (RTX 4060): +- Ollama: ~99ms +- TEI: ~20ms (5x faster) +- Infinity: ~30-40ms (2.5-3x faster) + +**Batch Processing**: +- Optimal batch size: 32-64 (model dependent) +- Performance degrades with batches >16 (quality issues reported) +- 2x slower than direct sentence-transformers usage + +### Engine Comparison + +Based on benchmarks from Baseten (2024): + +| Engine | Relative Throughput | Notes | +|--------|---------------------|-------| +| BEI | 9.0x (baseline) | Fastest (proprietary) | +| TEI | 4.5x | Open source, Rust-based | +| Infinity | 3.5x | PyTorch/ONNX optimized | +| vLLM | 3.0x | General LLM inference | +| **Ollama** | **1.0x** | Slowest for embeddings | + +**Key Insight**: Ollama is **5-9x slower** than specialized embedding engines but trades performance for ease of use and unified platform. + +## Integration Implementation + +### Python Client Wrapper + +```python +# nextcloud_mcp_server/embeddings/ollama.py +import httpx +from typing import List + + +class OllamaEmbedding: + """Ollama embedding provider""" + + def __init__( + self, + base_url: str = "http://localhost:11434", + model: str = "nomic-embed-text" + ): + self.base_url = base_url.rstrip("/") + self.model = model + self.client = httpx.AsyncClient(timeout=60.0) + + # Model dimension mapping + self.dimensions = { + "nomic-embed-text": 768, + "mxbai-embed-large": 1024, + "all-minilm": 384 + } + self.dimension = self.dimensions.get(model, 768) + + async def embed(self, text: str) -> List[float]: + """Generate embedding for single text""" + response = await self.client.post( + f"{self.base_url}/api/embed", + json={ + "model": self.model, + "input": text + } + ) + response.raise_for_status() + data = response.json() + return data["embeddings"][0] + + async def embed_batch( + self, + texts: List[str], + batch_size: int = 32 + ) -> List[List[float]]: + """ + Generate embeddings for multiple texts in batches. + + Note: Ollama has reported quality issues with batch sizes >16. + We use batch_size=32 as default but allow configuration. + """ + all_embeddings = [] + + # Process in chunks to avoid batch size issues + for i in range(0, len(texts), batch_size): + batch = texts[i:i + batch_size] + + response = await self.client.post( + f"{self.base_url}/api/embed", + json={ + "model": self.model, + "input": batch + } + ) + response.raise_for_status() + data = response.json() + all_embeddings.extend(data["embeddings"]) + + return all_embeddings + + async def check_health(self) -> bool: + """Check if Ollama server is running and model is available""" + try: + # Check if server is up + response = await self.client.get(f"{self.base_url}/api/tags") + response.raise_for_status() + + # Check if model is pulled + models = response.json().get("models", []) + model_names = [m["name"] for m in models] + + if self.model not in model_names: + raise ValueError( + f"Model '{self.model}' not found. " + f"Run: ollama pull {self.model}" + ) + + return True + + except Exception as e: + raise ConnectionError(f"Ollama health check failed: {e}") + + async def close(self): + """Close HTTP client""" + await self.client.aclose() +``` + +### Auto-Detection in Embedding Service + +```python +# nextcloud_mcp_server/embeddings/service.py +from typing import Optional +import os +import logging + +logger = logging.getLogger(__name__) + + +class EmbeddingService: + """Unified embedding service with automatic provider detection""" + + def __init__(self): + self.provider = None + self._detect_provider() + + def _detect_provider(self): + """Auto-detect available embedding provider""" + + # Tier 1: OpenAI API (best quality) + if os.getenv("OPENAI_API_KEY"): + from .openai import OpenAIEmbedding + self.provider = OpenAIEmbedding( + model=os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small"), + api_key=os.getenv("OPENAI_API_KEY") + ) + logger.info("✓ Using OpenAI embeddings") + return + + # Tier 2a: Infinity (optimized self-hosted) + if os.getenv("INFINITY_URL"): + from .infinity import InfinityEmbedding + try: + self.provider = InfinityEmbedding( + url=os.getenv("INFINITY_URL"), + model=os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5") + ) + logger.info("✓ Using Infinity embeddings (optimized)") + return + except Exception as e: + logger.warning(f"Infinity unavailable: {e}") + + # Tier 2b: Ollama (easy self-hosted) + if os.getenv("OLLAMA_URL"): + from .ollama import OllamaEmbedding + try: + self.provider = OllamaEmbedding( + base_url=os.getenv("OLLAMA_URL", "http://localhost:11434"), + model=os.getenv("OLLAMA_MODEL", "nomic-embed-text") + ) + # Verify Ollama is running and model is available + import asyncio + asyncio.run(self.provider.check_health()) + logger.info("✓ Using Ollama embeddings (easy setup)") + return + except Exception as e: + logger.warning(f"Ollama unavailable: {e}") + + # Tier 3: Local model (fallback) + logger.warning("No cloud/hosted embeddings available, using local model") + from .local import LocalEmbedding + self.provider = LocalEmbedding( + model=os.getenv("LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2") + ) + logger.info("✓ Using local embeddings (CPU fallback)") + + async def embed(self, text: str): + """Generate embedding for text""" + return await self.provider.embed(text) + + async def embed_batch(self, texts: list[str]): + """Generate embeddings for multiple texts""" + return await self.provider.embed_batch(texts) + + @property + def dimension(self) -> int: + """Get embedding dimension""" + return self.provider.dimension +``` + +### Docker Compose Configuration + +```yaml +services: + # Ollama embedding service + ollama: + image: ollama/ollama:latest + restart: always + ports: + - 127.0.0.1:11434:11434 + volumes: + - ollama_models:/root/.ollama + # Optional: GPU support + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] + # Pull models on startup + entrypoint: ["/bin/sh", "-c"] + command: + - | + ollama serve & + sleep 5 + ollama pull nomic-embed-text + wait + + # MCP Server with Ollama embeddings + mcp: + build: . + depends_on: + - ollama + environment: + # ... other vars ... + - OLLAMA_URL=http://ollama:11434 + - OLLAMA_MODEL=nomic-embed-text + + # Vector sync worker + mcp-vector-sync: + build: . + command: ["python", "-m", "nextcloud_mcp_server.sync.vector_indexer"] + depends_on: + - ollama + - qdrant + environment: + # ... other vars ... + - OLLAMA_URL=http://ollama:11434 + - OLLAMA_MODEL=nomic-embed-text + +volumes: + ollama_models: +``` + +## Advantages of Ollama + +### 1. **Ease of Setup** + +```bash +# Install Ollama +curl -fsSL https://ollama.com/install.sh | sh + +# Pull embedding model +ollama pull nomic-embed-text + +# Done! API available at localhost:11434 +``` + +No complex configuration, no Docker registries, no model conversion. + +### 2. **Privacy & Data Sovereignty** + +- All processing happens locally +- No data leaves your infrastructure +- No API keys or external dependencies +- Ideal for sensitive content (medical, legal, financial) + +### 3. **Unified Platform** + +- Same tool for LLMs and embeddings +- Consistent API across model types +- Single point of management +- Simplified operations + +### 4. **Developer Experience** + +- Simple API (similar to OpenAI) +- Good documentation +- Active community +- Framework integrations (LangChain, LlamaIndex) + +### 5. **Cost** + +- Free and open source +- No per-token API costs +- Only infrastructure costs (compute) + +### 6. **Model Variety** + +Growing library of embedding models: +- nomic-embed-text (general purpose) +- mxbai-embed-large (high quality) +- all-minilm (fast) +- More models added regularly + +## Limitations of Ollama + +### 1. **Performance** + +- **5-9x slower** than specialized engines (TEI, Infinity) +- Not optimized specifically for embedding inference +- Batch processing issues at larger batch sizes (>16) +- Higher latency compared to alternatives + +### 2. **Scalability** + +- Single-instance deployment (no native clustering) +- Limited concurrent request handling +- Not designed for high-throughput production +- Resource usage per request is higher + +### 3. **Batch Processing Issues** + +- Quality degradation reported with large batches +- Optimal batch size: 32-64 (conservative) +- Less efficient than specialized engines +- GitHub issues tracking batch problems (#6262) + +### 4. **Resource Usage** + +- Models stay loaded in memory (VRAM/RAM) +- Higher memory footprint per model +- GPU context switching overhead +- Not as memory-efficient as specialized engines + +### 5. **Production Features** + +- No built-in load balancing +- Limited monitoring/metrics +- No automatic scaling +- Basic error handling + +## Use Case Recommendations + +### ✅ **Excellent For:** + +1. **Development & Testing** + - Quick setup for prototyping + - Local development environments + - Testing embedding pipelines + +2. **Small Deployments** + - <10 users + - <10,000 documents + - Infrequent searches (<100/day) + - Hobbyist/personal projects + +3. **Privacy-Critical Applications** + - Medical/healthcare records + - Legal documents + - Financial data + - Air-gapped environments + +4. **Unified LLM Stack** + - Projects already using Ollama for LLMs + - Simplified operations + - Consistent tooling + +5. **Educational/Learning** + - Teaching RAG concepts + - Learning embeddings + - Hackathons/workshops + +### ⚠️ **Consider Alternatives For:** + +1. **Production at Scale** + - >100 users + - >100,000 documents + - High query volume (>1000/day) + - Use: TEI or Infinity + +2. **Performance-Critical** + - Real-time search (<50ms latency) + - High-throughput batch processing + - Use: TEI with GPU + +3. **Enterprise Deployments** + - Need for high availability + - Load balancing requirements + - Advanced monitoring + - Use: Managed services or TEI cluster + +4. **Large-Scale Indexing** + - Millions of documents + - Continuous high-volume ingestion + - Use: Infinity or commercial solutions + +## Integration Strategy + +### Recommended Tier Placement + +**Update ADR-003 embedding strategy:** + +``` +Tier 1: OpenAI API (best quality, requires API key) + ↓ fallback +Tier 2a: Infinity (optimized self-hosted, complex setup) + ↓ fallback +Tier 2b: Ollama (easy self-hosted, moderate performance) ← NEW + ↓ fallback +Tier 3: Local sentence-transformers (CPU fallback, simplest) +``` + +### Configuration + +```bash +# Option 1: Use Infinity (if available) +INFINITY_URL=http://infinity:7997 +EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 + +# Option 2: Use Ollama (if Infinity unavailable) +OLLAMA_URL=http://ollama:11434 +OLLAMA_MODEL=nomic-embed-text + +# Option 3: Use local model (automatic fallback) +# No configuration needed +``` + +### When to Choose Ollama + +**Choose Ollama if**: +- You're already using Ollama for LLMs +- You need privacy/data sovereignty +- You have <10k documents and <100 users +- Ease of setup is more important than max performance +- You're in development/testing phase + +**Choose Infinity/TEI if**: +- You need maximum throughput (>1000 embeddings/sec) +- You have >100k documents +- Latency is critical (<50ms) +- You're in production with >100 users + +**Choose OpenAI API if**: +- You're okay with cloud dependencies +- You need best-in-class quality +- Cost is not a concern (~$0.02 per 1M tokens) + +## Production Deployment Guidance + +### Small Production (Ollama Acceptable) + +**Profile**: +- 5-20 users +- 1,000-10,000 documents +- 50-200 searches/day +- <2 sec acceptable latency + +**Configuration**: +```yaml +ollama: + image: ollama/ollama:latest + deploy: + resources: + limits: + memory: 4GB + cpus: "2.0" + reservations: + devices: + - driver: nvidia # GPU if available + count: 1 + capabilities: [gpu] + environment: + - OLLAMA_NUM_PARALLEL=2 # Concurrent requests +``` + +**Expected Performance**: +- Embedding latency: 100-200ms +- Throughput: 5-10 embeddings/sec +- Memory: 2-3GB (model loaded) + +### Medium Production (Use Infinity/TEI) + +**Profile**: +- 20-200 users +- 10,000-1M documents +- 500-5,000 searches/day +- <500ms acceptable latency + +**Recommendation**: Migrate to Infinity or TEI +```yaml +infinity: + image: michaelf34/infinity:latest + # Better throughput and latency +``` + +### Large Production (Use Specialized Solution) + +**Profile**: +- >200 users +- >1M documents +- >5,000 searches/day +- <100ms required latency + +**Recommendation**: Use TEI cluster or commercial service + +## Monitoring Considerations + +### Key Metrics to Track + +```python +# Add Ollama-specific metrics +from prometheus_client import Histogram, Counter, Gauge + +ollama_embedding_latency = Histogram( + 'ollama_embedding_duration_seconds', + 'Ollama embedding generation time', + ['model', 'batch_size'] +) + +ollama_batch_size = Gauge( + 'ollama_batch_size', + 'Current batch size being processed' +) + +ollama_errors = Counter( + 'ollama_errors_total', + 'Ollama embedding errors', + ['error_type'] +) +``` + +### Health Checks + +```python +async def ollama_health_check(): + """Check Ollama availability""" + try: + async with httpx.AsyncClient() as client: + # Check server + response = await client.get("http://ollama:11434/api/tags") + response.raise_for_status() + + # Verify model loaded + models = response.json().get("models", []) + if "nomic-embed-text" not in [m["name"] for m in models]: + return False, "Model not pulled" + + return True, "OK" + except Exception as e: + return False, str(e) +``` + +## Migration Path + +### Starting with Ollama + +**Phase 1: Development** (Ollama) +- Use Ollama for initial development +- Validate embedding pipeline +- Test search quality + +**Phase 2: Growth** (Ollama → Infinity) +- Monitor performance metrics +- When >50 users or >10k docs, migrate to Infinity +- Simple config change, no code changes + +**Phase 3: Scale** (Infinity → TEI/Commercial) +- When >200 users or performance issues +- Consider TEI cluster or managed services + +### Code Compatibility + +All embedding providers use the same interface: +```python +# Works with Ollama, Infinity, OpenAI, Local +embedding = await embedding_service.embed(text) +embeddings = await embedding_service.embed_batch(texts) +``` + +**Migration is a configuration change only** - no code rewrite needed. + +## Conclusion + +**Ollama is a solid choice for:** +- Early-stage projects +- Development/testing +- Privacy-critical applications +- Small deployments (<10 users, <10k docs) +- Unified LLM + embedding stack + +**But recognize its limitations:** +- 5-9x slower than specialized engines +- Not designed for high-throughput production +- Batch processing can be problematic +- Limited scalability + +**Recommendation**: +✅ **Include Ollama as Tier 2b** (after Infinity, before local models) in the embedding strategy. It provides a good balance of ease-of-use and privacy for small-to-medium deployments while allowing seamless migration to more performant engines as needs grow. + +The key is designing the abstraction layer (as done in ADR-003) so migration between engines requires only configuration changes, not code rewrites. diff --git a/tests/server/oauth/test_dcr_token_type.py b/tests/server/oauth/test_dcr_token_type.py index e69d3837..e5abd26b 100644 --- a/tests/server/oauth/test_dcr_token_type.py +++ b/tests/server/oauth/test_dcr_token_type.py @@ -3,8 +3,8 @@ These tests verify that the Nextcloud OIDC server properly honors the token_type parameter during client registration, issuing the correct type of access tokens: -- token_type="JWT" → JWT-formatted tokens (RFC 9068) -- token_type="Bearer" → Opaque tokens (standard OAuth2) +- token_type="jwt" → JWT-formatted tokens (RFC 9068) +- token_type="opaque" → Opaque tokens (standard OAuth2) This is critical for ensuring: 1. Client choice is respected by the OIDC server @@ -208,12 +208,14 @@ async def test_dcr_respects_jwt_token_type( oauth_callback_server, ): """ - Test that DCR honors token_type=JWT and issues JWT-formatted tokens. + Test that DCR honors token_type=jwt and issues JWT-formatted tokens. This verifies: - 1. Client registration with token_type="JWT" succeeds + 1. Client registration with token_type="jwt" succeeds 2. Tokens obtained via this client are JWT format (base64.base64.signature) 3. JWT payload contains expected claims (sub, iss, scope, etc.) + + Note: The OIDC app uses lowercase 'jwt' (not 'JWT'). """ nextcloud_host = os.getenv("NEXTCLOUD_HOST") if not nextcloud_host: @@ -232,15 +234,15 @@ async def test_dcr_respects_jwt_token_type( token_endpoint = oidc_config.get("token_endpoint") authorization_endpoint = oidc_config.get("authorization_endpoint") - # Register client with token_type="JWT" - logger.info("Registering OAuth client with token_type=JWT...") + # Register client with token_type="jwt" + logger.info("Registering OAuth client with token_type=jwt...") client_info = await register_client( nextcloud_url=nextcloud_host, registration_endpoint=registration_endpoint, client_name="DCR Test - JWT Token Type", redirect_uris=[callback_url], scopes="openid profile email notes:read notes:write", - token_type="JWT", + token_type="jwt", ) logger.info(f"Registered JWT client: {client_info.client_id[:16]}...") @@ -278,7 +280,7 @@ async def test_dcr_respects_jwt_token_type( assert "notes:write" in scopes, "JWT scope claim missing notes:write" logger.info( - f"✅ DCR with token_type=JWT works correctly! " + f"✅ DCR with token_type=jwt works correctly! " f"Token is JWT format with scope claim: {payload['scope']}" ) @@ -290,12 +292,14 @@ async def test_dcr_respects_bearer_token_type( oauth_callback_server, ): """ - Test that DCR honors token_type=Bearer and issues opaque tokens. + Test that DCR honors token_type=opaque and issues opaque tokens. This verifies: - 1. Client registration with token_type="Bearer" succeeds + 1. Client registration with token_type="opaque" succeeds 2. Tokens obtained via this client are opaque (NOT JWT format) 3. Opaque tokens are simple strings, not base64-encoded structures + + Note: The OIDC app uses 'opaque' or 'jwt' as token_type values (not 'Bearer'). """ nextcloud_host = os.getenv("NEXTCLOUD_HOST") if not nextcloud_host: @@ -314,18 +318,18 @@ async def test_dcr_respects_bearer_token_type( token_endpoint = oidc_config.get("token_endpoint") authorization_endpoint = oidc_config.get("authorization_endpoint") - # Register client with token_type="Bearer" (opaque tokens) - logger.info("Registering OAuth client with token_type=Bearer...") + # Register client with token_type="opaque" (opaque tokens) + logger.info("Registering OAuth client with token_type=opaque...") client_info = await register_client( nextcloud_url=nextcloud_host, registration_endpoint=registration_endpoint, - client_name="DCR Test - Bearer Token Type", + client_name="DCR Test - Opaque Token Type", redirect_uris=[callback_url], scopes="openid profile email notes:read notes:write", - token_type="Bearer", + token_type="opaque", ) - logger.info(f"Registered Bearer client: {client_info.client_id[:16]}...") + logger.info(f"Registered Opaque token client: {client_info.client_id[:16]}...") # Obtain token via OAuth flow access_token = await get_oauth_token_with_client( @@ -353,7 +357,7 @@ async def test_dcr_respects_bearer_token_type( pass logger.info( - f"✅ DCR with token_type=Bearer works correctly! " + f"✅ DCR with token_type=opaque works correctly! " f"Token is opaque (not JWT format): {access_token[:30]}..." )