You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(vlm): add max_tokens parameter to VLM completion calls to prevent vLLM rejection (#689)
* fix(vlm): add max_tokens parameter to VLM completion calls to prevent vLLM rejection
Without max_tokens, vLLM allocates all context space to input tokens and
assigns 0 output tokens, rejecting requests with "You passed N input
tokens and requested 0 output tokens." Even when prompts fit, the model
has no guaranteed output space, leading to truncated or empty responses.
This adds max_tokens support across all VLM backends:
- VLMConfig: new max_tokens field (default 4096)
- VLMBase: reads max_tokens from config dict
- OpenAI, VolcEngine, LiteLLM backends: pass max_tokens in API calls
- Conditional inclusion (if self.max_tokens) so None disables the limit
Fixes#674
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix(vlm): default max_tokens to None to preserve provider behavior
Change default from 4096 to None so max_tokens is only sent when
explicitly configured. Prevents silently truncating outputs on
OpenAI/VolcEngine where omitting max_tokens lets the server choose.
Also use `is not None` instead of truthiness for max_tokens guards.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments