Skip to content

Conversation

@iboiko-habana
Copy link
Collaborator

…inear warmup

logger().warning("*******************************************************")
for flag in user_flags:
if flag in ("VLLM_PROMPT_BS_BUCKET_MAX"):
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add this flag only wouldn't I have those stars **** with no warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated

- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.
- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.
- `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have here 12 environment variables (even before your change there were more). Let's either don't mention a number or fix it to a proper value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated

- `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
- `{phase}` is either `PROMPT` or `DECODE`
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the env variables we list here dim can also be CTX

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated

- `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`. The default is `min_tokens`.
- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.
- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.
- `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a recomendation that VLLM_PROMPT_BS_BUCKET_MAX should be no bigger than max_num_batched_tokens as per issue seen in #331?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated

Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.
- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
- `VLLM_PROMPT_BS_BUCKET_MAX`: `(VLLM_PROMPT_BS_BUCKET_MAX * query) <=max_num_batched_tokens`- prefill batch size max. The default is`1`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve conflicts and we can merge

- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.
- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
- `VLLM_PROMPT_BS_BUCKET_MAX`: `(VLLM_PROMPT_BS_BUCKET_MAX * query) <=max_num_batched_tokens`- prefill batch size max. The default is`1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is query here? Is query the input seq length?
When default is 1, max_num_batched_tokens < query (chunked prefill) is allowed and often used mode.
Does this breaks chunked prefill (chunk being smaller than query) and all advantages of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants