-
Notifications
You must be signed in to change notification settings - Fork 74
Update SkyPilot to use volume mounts and added support for multi-node #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ansjindal <[email protected]>
if cloud == "k8s": | ||
# VolumeConfig region and zone required even though they are marked as optional | ||
# validation fails otherwise | ||
config["cloud"] = "kubernetes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value needs to be kubernetes - based on the providers supported list in skypilot
Can you rebase on latest main? Skypilot was just updated to 0.10.0 there. |
Signed-off-by: ansjindal <[email protected]>
Signed-off-by: ansjindal <[email protected]>
d6727d8
to
26bec77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some basic unit tests for the codecov CI check to pass? Thanks a lot for your contribution.
nemo_run/core/execution/skypilot.py
Outdated
launcher = self.launcher | ||
# Dynamic rendezvous has an error in Skypilot Kubernetes currently | ||
if ( | ||
launcher | ||
and isinstance(launcher, (Torchrun, FaultTolerance)) | ||
and self.cloud == "kubernetes" | ||
): | ||
launcher.rdzv_backend = "static" | ||
launcher.rdzv_port = 49500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this part anymore, its fixed in the latest version of Skypilot.
Signed-off-by: ansjindal <[email protected]>
Signed-off-by: ansjindal <[email protected]>
is there something needed more on this PR? |
volume_mounts
to use pre-created 'nemo-workspace' PVC mapping to K8s Volume Mounts.volumes
to use volumes in the pod.SkypilotExecutor
did not support the launcher API (supports_launcher_transform() = False
), preventing it from using torchrun for multi-node distributed training.This PR enables launcher API support in
SkypilotExecutor
Changes Made
supports_launcher_transform() -> True
launcher="torchrun"
whennum_nodes > 1
.Older PR got closed by mistake: #296
Example:
Some logs: