Initial slurm deployment scripts #1168

ayushdg · 2025-10-03T22:39:00Z

Description

Adds initial example slurm scripts for single and multi-node runs.

Usage

N/A

Checklist

I am familiar with the Contributing Guide.
[N/A] New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg · 2025-10-03T22:39:55Z

@lbliii Could you point me to where I should also update this in the docs?

praateekmahajan · 2025-10-03T23:05:45Z

tutorials/deployment/slurm/ray-sbatch-job.sh

+echo "RAY_ADDRESS: $RAY_ADDRESS"
+
+
+


Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT or XENNA_RESPECT_CUDA_VISIBLE_DEVICES.. I'm not sure about the other two here

Curator/nemo_curator/core/utils.py

Lines 117 to 121 in e4f9571

os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))

os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))

# We set some env vars for Xenna here. This is only used for Xenna clusters.

os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.

The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere

Initial slurm deployment scripts

d7dd76d

Signed-off-by: Ayush Dattagupta <[email protected]>

copy-pr-bot bot deployed to test October 3, 2025 22:39 Active

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 3, 2025 22:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci October 3, 2025 22:39 Inactive

praateekmahajan reviewed Oct 3, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to nemo-ci October 4, 2025 00:00 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial slurm deployment scripts #1168

Initial slurm deployment scripts #1168

ayushdg commented Oct 3, 2025

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

praateekmahajan Oct 3, 2025

Uh oh!

ayushdg Oct 3, 2025

Uh oh!

Uh oh!

	os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT))
	os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT))

	# We set some env vars for Xenna here. This is only used for Xenna clusters.
	os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port)

Initial slurm deployment scripts #1168

Are you sure you want to change the base?

Initial slurm deployment scripts #1168

Conversation

ayushdg commented Oct 3, 2025

Description

Usage

Checklist

Uh oh!

ayushdg commented Oct 3, 2025

Uh oh!

praateekmahajan Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

ayushdg Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!