Update of VLLM_PROMPT_BS_BUCKET_MAX logic, real bs change, not only linear warmup #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

iboiko-habana wants to merge 5 commits into vllm-project:main from iboiko-habana:iboiko/prefill_real_bs_upd

+5 −4

Collaborator

iboiko-habana commented Oct 8, 2025

…inear warmup


          Update of VLLM_PROMPT_BS_BUCKET_MAX logic, real bs change, not only l…

7a56062

…inear warmup

Signed-off-by: Iryna Boiko <[email protected]>

iboiko-habana requested review from adobrzyn, afierka-intel, kzawora-intel, mgawarkiewicz-intel, michalkuligowski, mswiniarsk, vivekgoe and xuechendi as code owners

October 8, 2025 08:59

adobrzyn reviewed

View reviewed changes

vllm_gaudi/extension/bucketing/exponential.py Outdated

    
                          logger().warning("*******************************************************")

                          for flag in user_flags:

                              if flag in ("VLLM_PROMPT_BS_BUCKET_MAX"):

                                  continue

Collaborator

adobrzyn Oct 8, 2025

If I add this flag only wouldn't I have those stars **** with no warning?

Collaborator Author

iboiko-habana Oct 8, 2025

Thanks. Updated

ksmusz suggested changes

View reviewed changes

docs/configuration/env_vars.md Outdated

    
              - `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.

              - `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.

              - `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.

              - `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).

Contributor

ksmusz Oct 8, 2025

We no longer have here 12 environment variables (even before your change there were more). Let's either don't mention a number or fix it to a proper value

Collaborator Author

iboiko-habana Oct 8, 2025

Thanks. Updated

docs/configuration/env_vars.md Outdated

    
              - `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.

              - `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).

                - `{phase}` is either `PROMPT` or `DECODE`

                - `{dim}` is either `BS`, `SEQ` or `BLOCK`

Contributor

ksmusz Oct 8, 2025

according to the env variables we list here dim can also be CTX

Collaborator Author

iboiko-habana Oct 8, 2025

Thanks. Updated

docs/configuration/env_vars.md Outdated

    
              - `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`. The default is `min_tokens`.

              - `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.

              - `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.

              - `VLLM_PROMPT_BS_BUCKET_MAX`: prefill batch size max. The default is `1`.

Contributor

ksmusz Oct 8, 2025

Should we add a recomendation that VLLM_PROMPT_BS_BUCKET_MAX should be no bigger than max_num_batched_tokens as per issue seen in #331?

Collaborator

PatrykWo Oct 8, 2025

Yes > @iboiko-habana

Collaborator Author

iboiko-habana Oct 8, 2025

Thanks. Updated

iboiko-habana added 3 commits

October 8, 2025 13:27


          After review

6e44955

Signed-off-by: Iryna Boiko <[email protected]>


          After review, CTX update

1b7a399

Signed-off-by: Iryna Boiko <[email protected]>


          After review, CTX update 1

63fb078

Signed-off-by: Iryna Boiko <[email protected]>

ksmusz suggested changes

View reviewed changes

docs/configuration/env_vars.md Outdated Show resolved Hide resolved


          After review2

d46a7a7

Signed-off-by: Iryna Boiko <[email protected]>

ksmusz approved these changes

View reviewed changes

michalkuligowski approved these changes

View reviewed changes

docs/configuration/env_vars.md

    
              - `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.

              - `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.

              - `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).

              - `VLLM_PROMPT_BS_BUCKET_MAX`: `(VLLM_PROMPT_BS_BUCKET_MAX * query) <=max_num_batched_tokens`- prefill batch size max. The default is`1`.

Collaborator

michalkuligowski Oct 14, 2025

please resolve conflicts and we can merge

nngokhale reviewed

View reviewed changes

docs/configuration/env_vars.md

    
              - `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.

              - `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.

              - `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).

              - `VLLM_PROMPT_BS_BUCKET_MAX`: `(VLLM_PROMPT_BS_BUCKET_MAX * query) <=max_num_batched_tokens`- prefill batch size max. The default is`1`.

Contributor

nngokhale Oct 28, 2025

What is query here? Is query the input seq length?
When default is 1, max_num_batched_tokens < query (chunked prefill) is allowed and often used mode.
Does this breaks chunked prefill (chunk being smaller than query) and all advantages of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

PatrykWo PatrykWo left review comments

adobrzyn adobrzyn left review comments

michalkuligowski michalkuligowski approved these changes

kzawora-intel Awaiting requested review from kzawora-intel kzawora-intel is a code owner

xuechendi Awaiting requested review from xuechendi xuechendi is a code owner

mswiniarsk Awaiting requested review from mswiniarsk mswiniarsk is a code owner

mgawarkiewicz-intel Awaiting requested review from mgawarkiewicz-intel mgawarkiewicz-intel is a code owner

vivekgoe Awaiting requested review from vivekgoe vivekgoe is a code owner

afierka-intel Awaiting requested review from afierka-intel afierka-intel is a code owner

+2 more reviewers

nngokhale nngokhale left review comments

ksmusz ksmusz approved these changes

Labels

None yet