Skip to content

Conversation

@noemotiovon
Copy link
Contributor

Description

  • _get_current_node_accelerators now detects and returns all accelerator types per node, not just a single (AcceleratorManager, count) tuple.

  • Ray node labels now support multiple accelerator managers.

  • Default labels from all accelerators are merged.

  • Conflicts between accelerator default labels are logged.

  • User-specified and autoscaler labels are merged with default accelerator labels, with warnings on overrides.

This improves Ray’s handling of heterogeneous accelerator nodes and enables more flexible scheduling.

Related issues

Related to #58206

- _get_current_node_accelerators now detects and returns all accelerator types per node, not just a single (AcceleratorManager, count) tuple.

- Ray node labels now support multiple accelerator managers.

- Default labels from all accelerators are merged.

- Conflicts between accelerator default labels are logged.

- User-specified and autoscaler labels are merged with default accelerator labels, with warnings on overrides.

This improves Ray’s handling of heterogeneous accelerator nodes and enables more flexible scheduling.

Signed-off-by: noemotiovon <[email protected]>
@noemotiovon
Copy link
Contributor Author

@ryanaoleary, could you please help me check whether this PR is reasonable? It’s an enhancement to PR #53360.

@ryanaoleary
Copy link
Contributor

@ryanaoleary, could you please help me check whether this PR is reasonable? It’s an enhancement to PR #53360.

Is it possible for a Ray node to have multiple accelerator managers? My understanding was that we would only ever detect one accelerator type (and therefore AcceleratorManager) per node. This PR seems reasonable to me if there's a use-case for it, I'm just not sure I understand when it would occur.

@noemotiovon
Copy link
Contributor Author

Yeah, as far as I know, a Ray node typically only has one type of accelerator, so there should only be a single AcceleratorManager per node. However, I’ve seen some projects use Ray in a way where they describe other resources using the GPU resource count — which is technically an incorrect usage. That said, if we allow a node to register multiple accelerator tags, it would incidentally support their use case as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants