-
Notifications
You must be signed in to change notification settings - Fork 132
[Bugfix] fix GPU VRAM calculation problem #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Rein Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/entrypoints/omni_stage.py
Outdated
| # Create our lock file | ||
| try: | ||
| with open(lock_file, "w") as f: | ||
| f.write(f"{_os.getpid()}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new sequential init uses while _os.path.exists(lock_file) and then creates the lock with open(lock_file, "w") only after that check. If two stage processes on the same CUDA device start at the same moment while no lock file exists, both will fall through the wait loop and each will create/truncate the file, so they proceed to initialize concurrently and the non_torch_memory miscalculation you’re trying to prevent can still happen. The lock needs an atomic creation (e.g., os.O_EXCL) or the existence check must be combined with the creation step; the same pattern is repeated in the async worker block.
Useful? React with 👍 / 👎.
vllm_omni/entrypoints/omni_stage.py
Outdated
| lock_file = f"/tmp/vllm_omni_device_{device_id}_init.lock" | ||
|
|
||
| # Wait for other instances to finish initialization | ||
| max_wait_time = 300 # 5 minutes max wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part seems duplicated with the input arg --init-sleep-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, using init_sleep time sets the time after which the instance will start init. I've configured it to retry continuously within this max_wait_time. If an instance already in the process of initializing has finished, this instance will immediately begin initialization.
vllm_omni/entrypoints/omni_stage.py
Outdated
| if torch.cuda.is_available(): | ||
| # Get the current device ID (logical device 0 after set_stage_devices) | ||
| device_id = torch.cuda.current_device() | ||
| lock_file = f"/tmp/vllm_omni_device_{device_id}_init.lock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there any method to delete the temp lock files in case some stage init fails? Otherwise, the GPU will be always locked.
- What if Stage A is on GPU 1 and Stage B is TP with size 2 on GPU 0,1?
- Or can we do it in another way:
- We determine the initialization in the OmniLLM before init the stages
- We read the required GPU Ids from the stage configs and deploy an initialization sorting algorithm that for stages without GPU overlapping, we init them in parallel and for stages with GPU overlapping, we do it one by one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I set the max_wait_time to remove those stale tmp lock file
# Check if the lock file is stale (older than 5 minutes)
if _time.time() - _os.path.getmtime(lock_file) > max_wait_time:
_os.remove(lock_file)
break
- because the device id I got is logical id, so even if stage A is on GPU 1, stage B with TP on GPU-0,1 , both of them lock file device id is 0. This is a bug, I fixed it in my next commit to use physical device id, will push soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In the current design, I ensure that they are initialized in parallel on different devices, but sequentially on the same device.
Signed-off-by: Rein Yang <[email protected]>
|
could you please add to the faq docs about usage of memory placing different stages using the same device? |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Placing multiple instances of the same model on the same device can cause memory calculation errors. Because non_torch_memory mistakenly includes memory consumed by other Omni Instances. As a result, the value of non_kv_cache_memory is more than real value, leave less memory for kv_cache to trigger assert.
When we calculate non_torch_memory:
cuda_memory = total_memory - free_memory ====> reflects memory used by all processes on the GPU
torch_memory = torch.cuda.memory_reserved() ====> returns memory reserved by PyTorch’s CUDA allocator for the current process only
non_torch_memory = cuda_memory - torch_memory
Hence, if there are any processes don't belong current instance will disrupt the current instance's memory calculations.
Use qwen3-omni as example:



we init thinker tp:0 on device 0:
init thinker tp:1 on device 1:



and talker is init on device 1 simultaneously:



Because the two processes on device 1 interfere with each other during initialization, even with sufficient memory, their calculated non_torch_mem results are disordered, leading to an error in available_kvcache_mem.
I added a file lock during the initialization phase to ensure that multiple instances initialized on the same device can be started in sequence.
Test Plan
verify if memory calculation is correct
Test Result
Thinker:


Talker:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)