Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: V1 higher memory usage #12529

Open
1 task done
wedobetter opened this issue Jan 28, 2025 · 8 comments
Open
1 task done

[Performance]: V1 higher memory usage #12529

wedobetter opened this issue Jan 28, 2025 · 8 comments
Labels
performance Performance-related issues v1

Comments

@wedobetter
Copy link

Proposal to improve performance

No response

Report of performance regression

Hardware: 4x RTX 3070 = 32GB VRAM

Issue: I was able to run Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 with 12K context length with 0.6.x, now with 0.7.0 + VLLM_USE_V1=1 I cannot push the context length higher than 3K or encountering a CUDA OOM error.
Of course, I can reconfigure it to avoid OOM, my question is: Is V1 expected to consume more memory?

Some of the libraries:

flashinfer==0.1.6+cu124torch2.4
torch==2.5.1
transformers==4.48.1
vllm==0.7.0

VLLM command

        - vllm
        - serve
        - Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
        - --gpu-memory-utilization=1
        - --tensor-parallel-size=4
        - --load-format=auto
        - --enforce-eager
        - --swap-space=0
        - --max-model-len=12K
        - --max-num-batched-tokens=12K
        - --disable-fastapi-docs
        - --trust-remote-code
        - --enable-auto-tool-choice
        - --tool-call-parser=hermes

Thanks

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@wedobetter wedobetter added the performance Performance-related issues label Jan 28, 2025
@WoosukKwon
Copy link
Collaborator

Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower pu-memory-utilization like 0.95 or 0.9? Generally, We don't recommend 1.0 because our memory profiling is not 100% accurate (since we do dynamic memory allocation during run time).

@mgoin
Copy link
Member

mgoin commented Jan 29, 2025

I think you could also try reducing --max-num-batched-tokens=12K, this could be much smaller like 2k and might reduce peak activations for memory and from torch.compile compilation

@simon-mo simon-mo added the v1 label Jan 29, 2025
@DarkLight1337
Copy link
Member

I have noticed similar issues for VLMs as well. See Slack thread. cc @ywang96

@wedobetter
Copy link
Author

I think you could also try reducing --max-num-batched-tokens=12K, this could be much smaller like 2k and might reduce peak activations for memory and from torch.compile compilation

As I stated, I know how to reconfigure the execution parameters, my observation was OOM encountered just by upgrading from latest 0.6.x to 0.7.0 VLLM and enabling V1, while keeping all the runtime parameters the same

@wedobetter
Copy link
Author

wedobetter commented Jan 29, 2025

Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower pu-memory-utilization like 0.95 or 0.9? Generally, We don't recommend 1.0 because our memory profiling is not 100% accurate (since we do dynamic memory allocation during run time).

Thanks for your input. I generally set that to 1, fearing that CPU offloading can significantly affect performance but I have probably confused gpu-memory-utilization with cpu_offload_gb options.

@robertgshaw2-redhat
Copy link
Collaborator

Thanks for the feedback. Note for contributors -- another place to look is torch.compile and the number of cudagraphs we use

@Leon-Sander
Copy link

(since we do dynamic memory allocation during run time).

Can you elaborate a bit more on what exactly is dynamic?

@focuzz8
Copy link

focuzz8 commented Jan 31, 2025

I can confirm increased memory consumption with Qwen/Qwen2.5-32B-Instruct-AWQ
Hardware: 2x RTX 4070 Ti = 32GB VRAM
I suspect that @robertgshaw2-redhat is correct and problem is related to the torch.compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues v1
Projects
None yet
Development

No branches or pull requests

8 participants