-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: V1 higher memory usage #12529
Comments
Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower |
I think you could also try reducing |
I have noticed similar issues for VLMs as well. See Slack thread. cc @ywang96 |
As I stated, I know how to reconfigure the execution parameters, my observation was OOM encountered just by upgrading from latest 0.6.x to 0.7.0 VLLM and enabling V1, while keeping all the runtime parameters the same |
Thanks for your input. I generally set that to 1, fearing that CPU offloading can significantly affect performance but I have probably confused gpu-memory-utilization with cpu_offload_gb options. |
Thanks for the feedback. Note for contributors -- another place to look is torch.compile and the number of cudagraphs we use |
Can you elaborate a bit more on what exactly is dynamic? |
I can confirm increased memory consumption with Qwen/Qwen2.5-32B-Instruct-AWQ |
Proposal to improve performance
No response
Report of performance regression
Hardware: 4x RTX 3070 = 32GB VRAM
Issue: I was able to run
Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
with 12K context length with 0.6.x, now with0.7.0 + VLLM_USE_V1=1
I cannot push the context length higher than 3K or encountering a CUDA OOM error.Of course, I can reconfigure it to avoid OOM, my question is: Is V1 expected to consume more memory?
Some of the libraries:
VLLM command
Thanks
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: