-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/client disconnect cost #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ts disconnects, we want to stop the vllm inferencing, record usage based on number of output tokens and input tokens for that inference such that we are still charging for usage. In order to calculate usage, we will call tokenize vllm api to count number of tokens based on the model tokenizer
…t for tokenization
…oud-api into feature/client-disconnect-cost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
…oud-api into feature/client-disconnect-cost
|
@claude review |
Code Review - PR 234I've reviewed this PR focusing on critical issues. Found 4 CRITICAL issues that need fixing before merge: CRITICAL ISSUE 1: Race Condition - Double Billing Risk (crates/services/src/completions/mod.rs:248-308) The Drop handler has a race condition that can cause double billing. When the stream completes normally, self.accumulated_text.clear() is called, but Drop always executes afterward. Between clearing and Drop execution, if another event comes in, the accumulated_text might be repopulated. This means customers could be charged twice for the same tokens. Fix: Use an atomic flag to track if normal completion occurred. CRITICAL ISSUE 2: Missing Timeout on Tokenize API Calls (crates/inference_providers/src/vllm/mod.rs:342-379) The tokenize API calls in the Drop handler have no timeout. Drop handler spawns async task calling vLLM tokenize API without timeout. If vLLM is slow/hung, background tasks accumulate indefinitely leading to resource leak, memory exhaustion, and cascading failures in production. Fix: Add timeout to tokenize requests (5-10 seconds). CRITICAL ISSUE 3: Privacy Violation - Logging Customer Data (crates/services/src/completions/mod.rs:254-257) The debug log exposes the length of customer conversation content via accumulated_text.len() and input_text.len(). This violates CLAUDE.md privacy requirements. Text length is customer metadata that could reveal conversation patterns. Fix: Remove length logging or log only boolean has_accumulated flag. CRITICAL ISSUE 4: Potential Panic in Production (crates/services/src/completions/mod.rs:266-308) The Drop handler clones potentially large strings without error handling. If accumulated_text is huge (malicious client, long conversation), cloning could fail/OOM. Drop handlers should NEVER panic as this causes process crash and service disruption. Fix: Add size check and truncation (e.g., 1MB limit). Status: NEEDS CHANGES - The double-billing race condition and missing timeouts are particularly critical for production safety in a multi-tenant billing system. |
…nstead of relying on id from vllm to create a hash deterministically and added a lock for guardrails for double writing to the database
|
@claude review |
Code Review - PR #234I've reviewed the implementation for tracking usage when clients disconnect mid-stream. CRITICAL issues found that must be addressed before merge. CRITICAL ISSUES1. Race Condition: Double Usage RecordingLocation: crates/services/src/completions/mod.rs:145-174 The atomic flag is set to Better approach: Rely ONLY on database UNIQUE constraint for deduplication. 2. Missing Database UNIQUE ConstraintLocation: crates/database/src/migrations/sql/V0029__add_inference_id_type.sql The migration adds Impact: Double billing is possible if both normal path AND Drop handler execute. Required Fix: Add migration: CREATE UNIQUE INDEX IF NOT EXISTS idx_org_usage_inference_id_unique
ON organization_usage_log(inference_id)
WHERE inference_id IS NOT NULL;3. Drop Handler Fire-and-Forget Async WorkLocation: crates/services/src/completions/mod.rs:252-324 The Drop handler spawns async work ( Impact: During rolling deployments/restarts, usage tracking for disconnected clients may be silently lost. 4. Fallback to 0 Tokens = Free Tier AbuseWhen token counting fails, code returns 0 tokens. This creates abuse vector where users get free inference. Recommendation: Track failures in metrics, consider estimated fallback instead of 0. PRODUCTION SAFETY CONCERNS
POSITIVE ASPECTS✅ Good test coverage for ChatDelta accumulation REQUIRED ACTIONS
|
PierreLeGuen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, should use "stream_options": {"continuous_usage_stats": true} to track ongoing tokens.
Example:
curl https://cloud-api.near.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $token" \
-d '{
"model": "zai-org/GLM-4.6",
"messages": [
{"role": "user", "content": "What's the weather in New York and Paris today?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto",
"stream": true,
"stream_options": {"continuous_usage_stats": true}
}'
Add graceful handling of client disconnects for streaming. When clients disconnects, we want to stop the vllm inferencing, record usage based on number of output tokens and input tokens for that inference such that we are still charging for usage. In order to calculate usage, we will call tokenize vllm api to count number of tokens based on the model tokenizer.
Note
Records usage when clients disconnect mid-stream by counting tokens via provider tokenizers; services now return
inference_idused by the API header.InterceptStreamaccumulation and Drop-based fallback to record usage if clients disconnect early, counting both input and partial output tokens via model tokenizer.inference_id(UUID v4) per request for deduplication; prevent double-recording with atomic flag.InferenceProviderwithcount_tokens(text, model); implement in vLLM via/v1/tokenizeand inMockProvider(approximate).InferenceProviderPool::count_tokens_for_modelwith timeout and safe fallbacks.ChatDelta::accumulate_intoto collect all textual content (content, tool calls, reasoning, name) for accurate token counting.CompletionServiceTraitnow returns wrappers (StreamingCompletionResult,ChatCompletionResult) containinginference_id.inference_idfrom service and always setInference-Idheader.ChatDelta::accumulate_into; remove obsolete SSE inference ID extraction tests; adapt existing tests to new fields.Written by Cursor Bugbot for commit d6e87e7. This will update automatically on new commits. Configure here.