Skip to content

Conversation

ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Oct 3, 2025

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • [N/A] New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg
Copy link
Contributor Author

ayushdg commented Oct 3, 2025

@lbliii Could you point me to where I should also update this in the docs?

echo "RAY_ADDRESS: $RAY_ADDRESS"



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT or XENNA_RESPECT_CUDA_VISIBLE_DEVICES.. I'm not sure about the other two here

os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))
os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))
# We set some env vars for Xenna here. This is only used for Xenna clusters.
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.

The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants