-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[DOC-127] MVP for OSS Ray labels #54254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
Not stale |
|
||
- NodeAffinitySchedulingStrategy when `soft=false`. Use the default `ray.io/node-id` label instead. | ||
- The `accelerator_type` option for tasks and actors. Use the default `ray.io/accelerator-type` label instead. | ||
- Custom resources such as the `special_hardware` pattern. Use custom labels instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Just to be more specific, the pattern that we want to replace is mainly using custom resources as labels, mentioned in the section here.
Pointing this out mainly because the custom resource can still be useful when the resources are used as resources with a quantity attached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this to call out this behavior AND change the existing resources docs to point to labels in this instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one more update regarding the autoscaler support.
|
||
- NodeAffinitySchedulingStrategy when `soft=false`. Use the default `ray.io/node-id` label instead. | ||
- The `accelerator_type` option for tasks and actors. Use the default `ray.io/accelerator-type` label instead. | ||
- Custom resources such as the `special_hardware` pattern. Use custom labels instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this to call out this behavior AND change the existing resources docs to point to labels in this instance.
|
||
You can add custom labels to your nodes using the `--labels` or `--labels-file` parameter when running `ray start`. See the following examples: | ||
|
||
<!-- INSERT EXAMPLES --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetli19 @MengjinYan Still haven't really seen much here beyond what you can accomplish with defaults.
Maybe an example where we mark something like CPU size to allow users to specify tasks to specific VM types in a heterogenous CPU environment?
Could use help coding something up as I'm pretty novice at Ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that's exactly the example we give below. So here we just need to show define the config for that value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one example we can give is to set cpu family to indicate that the node is with AMD CPU.
# Start a head node with label name "cpu-family" & value "amd"
ray start --head --labels="cpu-family=amd"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last nit comment and a missed code example.
|
||
(defaults)= | ||
## Default node labels | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the offline discussion, we might want to add some disclaimer similar to the following:
Ray reserves all labels ray.io and anyscale.io namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyscale.com?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be anyscale.com
not io
why would Ray reserve anyscale.com
labels though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that's a good point. From Ray's perspective, we should only reserve ray.io
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me we should add validation that users can't set ray.io/
labels if we haven't already @ryanaoleary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes We previously had that for NodeLabelSchedulingPolic
here, but we loosened this restriction for labels passed in with the new API (--labels
and --labels-file
) since I think we discussed passing labels from KubeRay with the ray.io/
prefix in the rayStartParams
for autoscaling. For ray-project/kuberay#4106, should we require that labels passed to the top-level Labels
field can't begin with ray.io/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MengjinYan LMK if this needs to happen for this release or you want to add later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dstrodtman I think for this release, we should still say that ray.io is reserved but we probably don't need to mention user shouldn't set ray.io
label if we haven't set it already.
Autoscaler V2 supports label-based scheduling. To enable autoscaler to scale up nodes to fulfill label requirements, you need to create multiple worker groups for different label requirement combinations and specify the all the corresponding labels in the `rayStartParams` field in the Ray cluster configuration. For example: | ||
|
||
```{python} | ||
rayStartParams: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ray-project/kuberay#4106 is merged we can direct users to specify the top-level Labels
field under the worker or head group with their desired labels with KubeRay v1.5+, but for now rayStartParams
is the only option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MengjinYan LMK if this needs to happen for this release or you want to add later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dstrodtman I think we can add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through and everything LGTM!
This is passing in my local build, and I think ready to merge. Note there's an open PR to resolve some docs issues that might fail some tests here: #57163 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last suggestion on the namespace reservation. Thanks for fixing the build error!
Co-authored-by: Mengjin Yan <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
@jjyao @edoakes @angelinalg ping to merge |
(cherry picked from commit 44a9732)
(cherry picked from commit 44a9732) Co-authored-by: Douglas Strodtman <[email protected]>
Signed-off-by: Josh Kodi <[email protected]>
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.