Skip to content

feat: add remote LLM backend with averaged embedding optimization#325

Open
olyashok wants to merge 2 commits intotobi:mainfrom
cellect-ai:feat/remote-llm-backend
Open

feat: add remote LLM backend with averaged embedding optimization#325
olyashok wants to merge 2 commits intotobi:mainfrom
cellect-ai:feat/remote-llm-backend

Conversation

@olyashok
Copy link

@olyashok olyashok commented Mar 8, 2026

Summary

Adds an optional HTTP-based remote LLM backend (targeting llama.cpp servers) as an alternative to local node-llama-cpp, plus an embedding optimization that reduces query time significantly on large indexes.

New features:

  • src/llm-remote.ts: RemoteLLM class — embed/rerank/generate via HTTP (llama.cpp compatible)
  • hybridQuery, vectorSearchQuery, structuredSearch accept an optional llm override so callers can use a remote backend without changing global config
  • Average all expanded query embeddings into a single vector → one sqlite-vec scan instead of N, cutting query time from ~47s to ~12s on a 25 GB index
  • New CLI commands for remote mode; --local flag to force local node-llama-cpp
  • .qmd directory resolution for DB and collection config paths
  • server/docker-compose.yml for spinning up llama.cpp embed/rerank/generate services

Why: On large corpora (200k+ files) local embedding becomes a bottleneck. This lets QMD scale to server-side models while keeping local mode fully intact.

Test plan

  • Local mode (--local) still works unchanged
  • Remote mode connects to llama.cpp server and returns correct results
  • Averaged embeddings produce equivalent quality results to N-scan approach

🤖 AI-assisted (Claude) | Tested on local instance with ~25 GB index

Claude and others added 2 commits March 11, 2026 12:36
Wire remote HTTP-based LLM (embed/rerank/generate via llama.cpp servers)
as an alternative to local node-llama-cpp. Add `llm?: LLM` option to
hybridQuery, vectorSearchQuery, and structuredSearch so callers can
override the LLM backend. Average all expanded query embeddings into a
single vector for one sqlite-vec scan instead of N separate scans,
reducing query time from ~47s to ~12s on a 25GB index.

Key changes:
- llm-remote.ts: RemoteLLM class with HTTP embed/rerank/generate
- store.ts: LLM passthrough for expandQuery, embedBatch, rerank;
  averaged embedding scan in hybridQuery and vectorSearchQuery;
  .qmd directory resolution for DB path
- qmd.ts: withLLMSessionAuto, remote CLI commands, --local flag
- collections.ts: config path resolution via .qmd directory
- llm.ts: embedBatch added to LLM interface

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updated the embed method in RemoteLLM to implement a retry mechanism, allowing up to three attempts for transient errors during remote server requests. Improved error handling and logging for both individual and batch embedding processes, ensuring robustness in case of failures. Additionally, modified comments to clarify the behavior of batch embedding fallbacks.

Key changes:
- Added retry logic in embed method for transient errors
- Enhanced error handling and logging
- Updated comments for clarity on batch embedding behavior
@olyashok olyashok force-pushed the feat/remote-llm-backend branch from 7f4e1e4 to 575b9ea Compare March 11, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants