-
Notifications
You must be signed in to change notification settings - Fork 184
Initial slurm deployment scripts #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
@lbliii Could you point me to where I should also update this in the docs? |
echo "RAY_ADDRESS: $RAY_ADDRESS" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are starting the head node differently than RayClient, we should make sure we carry the same env variables, which could be XENNA_RAY_METRICS_PORT
or XENNA_RESPECT_CUDA_VISIBLE_DEVICES
.. I'm not sure about the other two here
Curator/nemo_curator/core/utils.py
Lines 117 to 121 in e4f9571
os.environ["DASHBOARD_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_DASHBOARD_METRIC_PORT)) | |
os.environ["AUTOSCALER_METRIC_PORT"] = str(get_free_port(DEFAULT_RAY_AUTOSCALER_METRIC_PORT)) | |
# We set some env vars for Xenna here. This is only used for Xenna clusters. | |
os.environ["XENNA_RAY_METRICS_PORT"] = str(ray_metrics_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify is the suggestion here to also enable prometheus grafana metrics etc? Or to ensure that env variables are exported across the head and the client.
The current setup doesn't export the RAY METRICS PORT or CUDA_VISIBLE_DEVICES anywhere
Description
Adds initial example slurm scripts for single and multi-node runs.
Usage
N/A
Checklist