Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the time for creating pods and services when creating new Job. #2361

Open
lishangyuzi opened this issue Dec 23, 2024 · 1 comment
Open

Comments

@lishangyuzi
Copy link

What you would like to be added?

For the creation events of large-scale jobs, such as PyTorch jobs, use multiple goroutines to initiate createPod and createSvc, which can reduce the time for jobs to create Kubernetes (k8s) resources.

Why is this needed?

If there are a large number of pods under a large-scale job, initializing the pods and services (svc) will take a lot of time. This will seriously affect the startup of a large-scale training job.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@lishangyuzi lishangyuzi changed the title Optimize the time for creating pods and services when creating new Jobs. Optimize the time for creating pods and services when creating new Job. Dec 23, 2024
@lishangyuzi
Copy link
Author

The places where pods and svcs are created are as follows. I've tried to modify it to use multiple goroutines for processing.

err = jc.createNewPod(job, rt, index, spec, masterRole, replicas)

err = jc.CreateNewService(job, rtype, spec, strconv.Itoa(index))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant