Proxy-first LLM gateway in Go for routing, limits, and quota enforcement.
GET /v1/modelsGET /v1/limitsPUT /v1/limitsPOST /v1/chat/completionsPOST /v1/responsesPOST /v1/completionsPOST /v1/embeddingsPOST /v1/messagesPOST /v1beta/models/{model}:generateContentPOST /v1beta/models/{model}:embedContentPOST /v1/models/{model}:generateContentPOST /v1/models/{model}:embedContentGET /openapi.jsonGET /openapi.yamlGET /docsGET /healthzGET /readyzGET /metrics- Provider-native request validation with raw upstream proxying
- SSE passthrough for streaming providers
- Model-based routing, capability filtering, fallback, circuit breaking, rpm/tpm/concurrency, quota reservation, rate limiting via
x/time/rateandredis_rate, metrics, and tracer hook points
Ingress is provider-native: existing OpenAI, Anthropic, and Gemini clients can keep their request contracts and only change the base URL.
Set provider keys:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...Gateway auth can use either static bearer tokens from config/config.yaml or JWTs. For JWT mode, configure auth.jwt with an HMAC secret or public key and map the key_id claim to the quota subject. Quota enforcement is key-scoped, and /v1/limits reads or writes the limits for the authenticated key_id.
Start the gateway:
go run ./cmd/llmgw -config config/config.example.yamlRun with Docker Compose:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
docker compose up --buildBuild and run the container directly:
docker build -t llmgw:local .
docker run --rm -p 8080:8080 \
-e OPENAI_API_KEY \
-e ANTHROPIC_API_KEY \
-e GEMINI_API_KEY \
-v "$PWD/config/config.yaml:/etc/llmgw/config.yaml:ro" \
llmgw:localThis repo includes .github/workflows/deploy-nebius.yml.
Required GitHub repository secrets:
NB_PROJECT_IDNB_SUBNET_IDNB_SERVICE_ACCOUNT_IDNB_PUBLIC_KEY_IDNB_PRIVATE_KEY_PEM
Optional secrets:
NB_ENDPOINT_AUTH_TOKENNB_REGISTRY_USERNAMENB_REGISTRY_PASSWORDOPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYLLMGW_JWT_SECRETLLMGW_POSTGRES_DSNLLMGW_REDIS_ADDRLLMGW_REDIS_PASSWORDLLMGW_BEARER_TOKEN
Optional repository variables:
NB_ENDPOINT_NAME(default:llmgw)NB_PLATFORM(default:cpu-d3)NB_PRESET(default:4vcpu-16gb)NB_CONTAINER_PORT(default:8080)
The workflow builds and pushes a Docker image to GHCR, then recreates the Nebius endpoint with the new image.
Generate the OpenAPI YAML without starting the server:
go run ./cmd/llmgw -config config/config.example.yaml -print-openapi > openapi.yamlThe checked-in openapi.yaml is a convenience snapshot. The live source of truth is generated from the running config and served at /openapi.yaml.
Reload config in place:
pkill -HUP -f llmgwList models:
curl -s http://localhost:8080/v1/modelsOpenAI chat completions:
curl -s http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"gpt-4o-mini",
"messages":[{"role":"user","content":"Say hello in one sentence."}]
}'OpenAI chat streaming:
curl -N http://localhost:8080/v1/chat/completions \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"gpt-4o-mini",
"stream":true,
"messages":[{"role":"user","content":"Count to 3."}]
}'OpenAI responses:
curl -s http://localhost:8080/v1/responses \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"gpt-4o-mini",
"input":"Summarize the purpose of this gateway in one line."
}'OpenAI legacy completions:
curl -s http://localhost:8080/v1/completions \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"gpt-4o-mini",
"prompt":"Write one short sentence about Go."
}'OpenAI embeddings:
curl -s http://localhost:8080/v1/embeddings \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"gpt-4o-mini",
"input":"gateway"
}'JWT-backed quota limits:
export LLMGW_JWT_SECRET=test-secret
TOKEN="$(python3 - <<'PY'
import jwt, time
print(jwt.encode({
'iss': 'llmgw',
'aud': 'gateway',
'sub': 'session-1',
'key_id': 'partner-key-123',
'iat': int(time.time()),
'exp': int(time.time()) + 3600,
}, 'test-secret', algorithm='HS256'))
PY
)"
curl -s http://localhost:8080/v1/limits \
-H "Authorization: Bearer $TOKEN"
curl -s -X PUT http://localhost:8080/v1/limits \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"rpm": 120,
"tpm": 500000,
"max_parallel": 8,
"daily_tokens": 2000000
}'Metrics:
curl -s http://localhost:8080/metricsOpenAPI YAML:
curl -s http://localhost:8080/openapi.yamlOpenAPI JSON:
curl -s http://localhost:8080/openapi.jsonInteractive docs:
open http://localhost:8080/docsProvider-native Anthropic request:
curl -s http://localhost:8080/v1/messages \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"model":"claude-3-7-sonnet-latest",
"messages":[{"role":"user","content":"Say hello in one sentence."}]
}'Provider-native Gemini request:
curl -s http://localhost:8080/v1beta/models/gemini-2.5-flash:generateContent \
-H 'Authorization: Bearer local-dev-token' \
-H 'Content-Type: application/json' \
-d '{
"contents":[{"role":"user","parts":[{"text":"Say hello in one sentence."}]}]
}'The gateway uses provider-native ingress paths for each family:
- OpenAI:
/v1/chat/completions,/v1/responses,/v1/completions,/v1/embeddings - Anthropic:
/v1/messages - Gemini:
/v1beta/models/{model}:generateContent,/v1beta/models/{model}:embedContent(also/v1/models/{model}:...)
Requests keep their provider wire format and route only to matching providers.
The core still uses Request for routing, quota, and policy decisions, but upstream execution is now proxy-first: validate the request shape, patch controlled fields like model, then forward the raw body to the selected upstream.
flowchart LR
Client["Client"] --> API["Public API<br/>Provider-native paths"]
API --> Decode["Validate + Decode Metadata"]
Decode --> Req["Request Interceptors<br/>auth, validation, ACL, scope resolution, quota reserve"]
Req --> Router["Model Router<br/>capability filtering"]
Router --> Attempt["Attempt Interceptors<br/>headers, concurrency, rpm/tpm, timeout, breaker"]
Attempt --> Proxy["Proxy Runtime<br/>patch model + auth + headers"]
Proxy --> Upstream["Upstream LLM API"]
Upstream --> Stream["Raw JSON / SSE passthrough<br/>usage extraction for quota"]
Stream --> Encode["Provider-Native JSON / SSE"]
Encode --> Client
Config["Immutable Config Snapshot"] --> Req
Config --> Router
Limits["Dynamic Limit Store<br/>key_id -> limits"] --> Req
Quota["Quota State<br/>memory / Redis"] --> Req
Quota --> Stream
The request flow is:
1. HTTP request lands on [`server.go`](/api/server.go).
2. The request is validated and decoded into [`Request`](/gateway/types.go) metadata plus the original raw body.
3. Request-scope interceptors run once for auth, request validation, token estimation, ACL checks, scope resolution, and quota reservation.
4. The router resolves the requested model name into candidate routes and filters by capabilities.
5. Attempt-scope interceptors run for each upstream attempt, handling provider headers, route/provider concurrency, rpm/tpm, timeout, retry, and circuit breaking.
6. The proxy runtime patches controlled fields like `model`, applies upstream auth, and forwards the raw request body to the selected provider.
7. The gateway passes provider-native output back to the client and extracts usage for quota settlement when it can.
8. Quota is committed or refunded when the call settles.
This keeps the hot path smaller while still supporting routing, fallbacks, streaming, and provider-specific request families.
The public surface also exposes generated OpenAPI documents at [`/openapi.yaml`](/api/spec.go) and [`/openapi.json`](/api/spec.go), plus Swagger UI at [`/docs`](/api/server.go). The spec is generated from code and the current config snapshot, so model enums and auth requirements stay aligned with the running service.
Config is an immutable snapshot stored behind an atomic pointer. Reload swaps the whole snapshot, so reads stay lock-free on the hot path.
Token validation currently uses effective-request inspection plus local estimation. Unary responses settle quotas from provider usage fields when present; streaming responses use passthrough plus fallback settlement.
## How Limits Work
The gateway has more than one kind of limit because different layers solve different problems.
### 1. Token key quotas
These are the limits attached to the authenticated `key_id`. They are enforced once per logical request and represent what a client token is allowed to consume.
Supported key-scoped quota fields:
- `rpm`
- `tpm`
- `max_parallel`
- `max_spend_micros`
- `daily_tokens`
- `monthly_tokens`
- `max_input_tokens`
- `max_output_tokens`
- `model_allowlist`
- `provider_allowlist`
Static defaults can still come from `quota.profiles` and `quota.keys` in config, but runtime overrides now take precedence and can be managed through `GET /v1/limits` and `PUT /v1/limits` for the authenticated `key_id`.
Quota enforcement is reservation-based:
1. estimate usage
2. reserve quota
3. call upstream
4. commit actual usage or refund unused reservation
### 2. Route and provider guards
These are operational safeguards, not client entitlement policy. They protect the gateway and upstream providers on each attempt.
Supported route guard fields:
- `rpm`
- `tpm`
- `concurrency`
- `provider_concurrency`
- `max_body_bytes`
In-memory mode uses local `golang.org/x/time/rate` plus in-process concurrency/breaker state. Redis mode adds shared cross-instance controls for route/provider concurrency and circuit-breaker state, and applies quota reservations atomically in Redis.
### 3. Capability caps
Capabilities describe what a route can actually do. They are used during routing and validation, not for spend accounting.
Currently enforced in the hot path:
- supported operations
- streaming
- route `max_output_tokens`
Other capability fields are still present in config and documentation, but they are not currently used as hard routing gates in the proxy-first runtime.
### 4. Request-level auth guards
These are simple request-shape requirements:
- `auth.max_body_bytes`
- `auth.require_user`
- `auth.require_project`
Short version:
- use key quotas to control client consumption
- use route/provider guards to protect infrastructure
- use capabilities to keep routing correct
## Tests
Run:
```bash
go test ./...
Hot-path benchmark:
go test -bench=. ./test/unit/gatewayThe observer keeps a small tracer hook interface in observer.go. By default it is a no-op tracer. If you want full OpenTelemetry integration, inject an adapter that satisfies observer.Tracer and forwards spans into your OTel SDK setup.
- Extend request validation and metadata decoding in
providers. - Add proxy path, auth-header, and usage-extraction rules in
proxy.go. - Keep the gateway core on
gateway.Request, route resolution, and quota settlement. Do not add a new SDK-heavy adapter tree unless proxying cannot express the provider family. - Register the provider in
main.goand add route entries in config.