-
Notifications
You must be signed in to change notification settings - Fork 316
fix: prevent divide-by-zero in calculate_ray_np when Ray cluster not ready #864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xyuzh
wants to merge
1
commit into
datajuicer:main
Choose a base branch
from
xyuzh:fix/prevent-divide-by-zero-in-calculate-ray-np
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -242,6 +242,21 @@ def calculate_ray_np(operators): | |||||||||||||||||||||||||||||||||||||||||||||||
| total_gpu = ray_gpu_count() | ||||||||||||||||||||||||||||||||||||||||||||||||
| available_mem = sum(ray_available_memories()) * _OPS_MEMORY_LIMIT_FRACTION / 1024 # Convert MB to GB | ||||||||||||||||||||||||||||||||||||||||||||||||
| available_gpu_mem = sum(ray_available_gpu_memories()) * _OPS_MEMORY_LIMIT_FRACTION / 1024 # Convert MB to GB | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
| # Validate cluster resources to prevent divide-by-zero | ||||||||||||||||||||||||||||||||||||||||||||||||
| if total_cpu == 0: | ||||||||||||||||||||||||||||||||||||||||||||||||
| raise RuntimeError( | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Ray cluster has no CPU resources available (ray_cpu_count() returned 0). " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "This typically indicates the Ray cluster is not properly initialized. " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Please ensure the Ray cluster has active worker nodes." | ||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
| if available_mem == 0: | ||||||||||||||||||||||||||||||||||||||||||||||||
| raise RuntimeError( | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Ray cluster has no memory resources available. " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Please verify the Ray cluster status with ray.cluster_resources()." | ||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
| resource_configs = {} | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
| for op_idx, op in enumerate(operators): | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -268,6 +283,18 @@ def calculate_ray_np(operators): | |||||||||||||||||||||||||||||||||||||||||||||||
| cpu_required_frac, gpu_required_frac = 0, 0 | ||||||||||||||||||||||||||||||||||||||||||||||||
| # GPU operator calculations | ||||||||||||||||||||||||||||||||||||||||||||||||
| if op.use_cuda(): | ||||||||||||||||||||||||||||||||||||||||||||||||
| if total_gpu == 0: | ||||||||||||||||||||||||||||||||||||||||||||||||
| raise RuntimeError( | ||||||||||||||||||||||||||||||||||||||||||||||||
| f"Op[{op._name}] requires GPU but no GPUs are available in Ray cluster " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "(ray_gpu_count() returned 0). " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Please ensure GPU nodes are configured in the Ray cluster." | ||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||
| if available_gpu_mem == 0: | ||||||||||||||||||||||||||||||||||||||||||||||||
| raise RuntimeError( | ||||||||||||||||||||||||||||||||||||||||||||||||
| f"Op[{op._name}] requires GPU but no GPU memory is available. " | ||||||||||||||||||||||||||||||||||||||||||||||||
| "Please verify GPU nodes are properly configured." | ||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+286
to
+296
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to the previous comment, it would be more user-friendly to check for all GPU-related resource issues at once and report them together. This provides a more comprehensive error message if both GPUs and GPU memory are unavailable.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
| gpu_req = op.num_gpus | ||||||||||||||||||||||||||||||||||||||||||||||||
| gpu_mem_req = op.memory | ||||||||||||||||||||||||||||||||||||||||||||||||
| if not gpu_req and not gpu_mem_req: | ||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a better user experience, it's good practice to collect all validation errors and report them together. This allows the user to see all resource issues at once, rather than fixing them one by one. You can accumulate error messages in a list and raise a single
RuntimeErrorif any issues are found.