Releases: mostlydev/cllama
v0.3.3 — native Google Gemini provider support
Highlights
- Native
googleprovider — adds first-class Google Gemini provider support to the cllama provider registry. Operators can now route models throughgoogle/<model>refs (e.g.google/gemini-2.5-flash) using a nativeGEMINI_API_KEY(withGOOGLE_API_KEYaccepted as a lower-priority alias) instead of going through OpenRouter. Default API format is OpenAI-compatible against the Gemini OpenAI endpoint. - Gemini cost tracking — pricing entries for Gemini 2.5 Flash and Gemini 2.5 Pro are now wired into the cost telemetry path so direct-Google routing carries accurate per-token accounting alongside the existing OpenRouter route.
Companion to mostlydev/clawdapus#119.
v0.3.2 — Anthropic prompt cache fix
Fixes
- Cache-friendly prefix ordering for feed injection — Feeds and timestamps are now appended after the system prompt instead of prepended before it. Anthropic prompt caching is prefix-matched with a 5-min TTL, so prepending dynamic content was invalidating the cache on every request. A 3-agent pod was paying ~$60/week in unnecessary cache_creation costs; this fix eliminates that. Fixes mostlydev/clawdapus#122.
v0.3.1 — managed tool manifest state observability
Highlights
- log managed tool manifest state on every request: proxy telemetry now emits
manifest_present(bool) andtools_count(int) so operators can verify at runtime whether a per-agent tool manifest was loaded and how many tools it contained
This closes an observability gap that made it hard to diagnose cases where compiled tools.json existed on disk but tools were not being injected into upstream requests — there was no runtime signal telling operators which agents had live tool manifests.
Artifacts
- container image: `ghcr.io/mostlydev/cllama:v0.3.1`
- rolling tag: `ghcr.io/mostlydev/cllama:latest`
Validation
- `go test ./...`
v0.3.0 — managed tool mediation + memory plane + scoped history API
Highlights
Managed tool mediation
- load and inject compiled
tools.jsonmanifests into upstream LLM requests - execute managed tools via HTTP against declared services (OpenAI-compatible format)
- Anthropic-format tool mediation (parallel path to OpenAI)
- cross-turn continuity: replay hidden tool rounds into subsequent upstream requests so the LLM sees the transcript that produced each runner-visible reply
- re-stream final text as synthetic SSE after mediated loops complete; keepalive comments prevent runner timeouts during long loops
- budget limits: max rounds, per-tool timeout, total timeout, result size truncation with explicit
truncated: trueflag - body_key execution: wrap tool arguments as
{body_key: args}when declared in the tool descriptor - sanitize managed tool names for provider compatibility
Memory plane
- pre-turn recall and post-turn best-effort retain hooks
memory_optelemetry events with recall/retain outcome, latency, block count, injected bytes, policy-removal counts- secret-shaped value scrubbing on both retain payloads and recalled blocks
- tightened memory recall auth and history auth handling
Session history API
- scoped history read API for agents querying their own transcripts
- dedicated replay auth tokens separate from agent bearer tokens
- stable per-entry IDs
- index replay for
afterqueries (no full-rescan)
Provider fixes
- xAI env seeding regression coverage
Artifacts
- container image:
ghcr.io/mostlydev/cllama:v0.3.0 - rolling tag:
ghcr.io/mostlydev/cllama:latest
Validation
go test ./...
v0.2.5
Highlights
- enforce declared per-agent model policy in
cllama - normalize runner model requests against the compiled allowlist
- restrict provider failover to the pod-declared fallback chain
- add xAI routing/policy fixes needed for direct
xai/...model refs
Artifacts
- container image:
ghcr.io/mostlydev/cllama:v0.2.5 - rolling tag:
ghcr.io/mostlydev/cllama:latest
Validation
go test ./...
v0.2.3
Changes
- Unpriced request tracking: requests where the upstream provider returns no cost data are now counted separately as
unpriced_requestsin the cost API response and surfaced in the dashboard UI - Reported cost passthrough:
CostInfo.CostUSDis now*float64(nil = unpriced, not zero); provider-reported cost fields are propagated through the proxy - Timezone context:
time_context.goinjects timezone-aware current time for agents that declare aTZenvironment variable - Dashboard:
total_requestsandunpriced_requestsexposed in the costs API endpoint
v0.2.2 — provider token pool + runtime provider add
What's new
- Provider token pool: Multi-key pool per provider with states ready/cooldown/dead/disabled. Proxy retries across keys on 401/429/5xx with failure classification and Retry-After support.
- Runtime provider add:
POST /providers/addUI route — add a new provider (name, base URL, auth type, API key) at runtime with no restart. Persists to.claw-auth/providers.jsonwithsource: runtime. ProviderState.Source: New field (seed/runtime) survives JSON round-trips.- UI bearer auth: All routes gated by
CLLAMA_UI_TOKENwhen configured. - Key management routes:
POST /keys/addandPOST /keys/delete. - Webhook alerts:
CLLAMA_ALERT_WEBHOOKSandCLLAMA_ALERT_MENTIONSfor pool events.
v0.2.1 — Feed Auth
What's new
- Feed authentication:
FeedEntrynow supports anauthfield. When present, the feed fetcher sets anAuthorization: Bearerheader on the fetch request. This enables authenticated feeds from services likeclaw-apithat require bearer token auth.
Backward compatible — existing feeds.json without auth fields work unchanged.
v0.2.0 — Feed Injection (ADR-013 Milestone 2)
What's Changed
Features
-
Feed injection — The proxy now supports runtime feed injection into LLM requests. Feeds defined in agent context manifests are fetched with TTL-based caching and injected as system message content before forwarding to the upstream provider. Both OpenAI (
messages[]) and Anthropic (top-levelsystem) formats are supported.internal/feeds/manifest.go— feed manifest parsing from agent contextinternal/feeds/fetcher.go— HTTP fetcher with TTL-based cachinginternal/feeds/inject.go— system message injection for OpenAI and Anthropic formats
-
Agent context extensions —
AgentContextnow exposesContextDirfor feed manifest discovery and service auth loading. -
Proxy handler — New
WithFeedsoption wires feed injection into the proxy pipeline, gated by pod name. -
Cost logging improvements — Better tracking in the logging layer.
Docker Image
ghcr.io/mostlydev/cllama:v0.3.4— multi-arch (linux/amd64 + linux/arm64)ghcr.io/mostlydev/cllama:latest
Test Coverage
- Feed fetcher, injection, and manifest parsing
- Proxy handler tests for both OpenAI and Anthropic feed injection paths
- Agent context service auth loading
v0.1.0
cllama v0.1.0 — First Release
OpenAI-compatible governance proxy for AI agent pods. Zero external dependencies, ~15 MB distroless image.
Features
- OpenAI-compatible proxy on
:8080—POST /v1/chat/completionswith streaming - Anthropic Messages bridge —
POST /v1/messageswith native format translation - Multi-provider registry — OpenAI, Anthropic, OpenRouter, Ollama with automatic routing
- Per-agent bearer token auth — agents never see real provider API keys
- Real-time operator dashboard on
:8081— SSE-powered live view of all LLM calls with agent ID, model, tokens, cost, latency - Cost tracking — per-agent, per-model, per-provider usage extraction from upstream responses
- Vendor-prefixed model fallback — routes
anthropic/claude-*etc. through OpenRouter when direct provider key is unavailable
Container Image
docker pull ghcr.io/mostlydev/cllama:latestPublished publicly at ghcr.io/mostlydev/cllama:latest.