Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

Closed
2 tasks done
yubofredwang opened this issue Feb 3, 2023 · 8 comments · Fixed by flyteorg/flyteidl#386 or flyteorg/flyteplugins#345
Labels
enhancement New feature or request

Comments

@yubofredwang
Copy link
Contributor

Motivation: Why do you think this is important?

It is a common user scenario for users to set different resources for different type of replicas in Kubeflow Jobs such as TFJob and MPIJob.

For example, a user would want to run TFJob using the following resources:

PS:
    replicas: 1
    template:
        spec:
            containers:
                resources:
                    limits:
                    cpu: "1"
Worker:
     replicas: 2
     template:
         spec:
             containers:
                  resources:
                  limits:
                      nvidia.com/gpu: 1
                      memory: "10G"

However, this is not currently supported in Flyte.

Goal: What should the final outcome look like, ideally?

Users should be able to override the resources specified in task definition by providing extra resources configs in task config in TfJob.

@task(
    task_config=TfJob(
         worker: {num=2, requests=Resources(cpu="1", mem="2Gi"), limits=Resources(cpu="1", mem="4Gi")},
         ps: {num=1, requests=Resources(cpu="1", mem="2Gi"), limits=Resources(cpu="1", mem="4Gi"),
    )
    cache=True,
    cache_version="1.0",
)
def mnist_tensorflow_job(hyperparameters: Hyperparameters) -> training_outputs:

Describe alternatives you've considered

We can make the resources field in task function to accept a new type TFJobResources, and implement different handling for related backend plugins. However, this requires lots of code changes and undermine consistencies of task definitions.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@yubofredwang yubofredwang added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Feb 3, 2023
@welcome
Copy link

welcome bot commented Feb 3, 2023

Thank you for opening your first issue here! 🛠

@pingsutw
Copy link
Member

pingsutw commented Feb 3, 2023

Thank you for reporting this. we also need separate config for spark task (driver, worker), and ray task (head node, worker node). we can add it in separate PR.

@hamersaw
Copy link
Contributor

hamersaw commented Feb 3, 2023

This could be implemented similar to the current Dask integration. Essentially, for the DaskScheduler and DaskWorkerGroup definitions we allow users to pass in resources - the DaskJob inherits from the Flyte @task requests and limits fields (which are also used as the default in case specific resources are not passed to DaskScheduler and DaskWorkerGroup).

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Feb 3, 2023
@yubofredwang
Copy link
Contributor Author

@hamersaw Since now we enabled passing PodTemplate as task metadata #3123 , I think we follow the same fashion for Kubeflow operators. I have a draft changes here. Please let me know if you have any suggestions with the general approach. I will create a PR once we settle down with the approach.

@hamersaw
Copy link
Contributor

@yubofredwang this is awesome, I think it looks great so far! A few thoughts:

  1. I like having the separate resource request and limits configuration because the main goal of Flyte is to abstract the complexities of k8s from users. The addition of PodTemplate support is mostly for power users who need to configure every aspect of the Pod. But for general users, I think they would much rather use a simple requests=Resources(cpu=2), for example, to set resources rather than specifying an entire PodTemplate.
  2. Should we support both the pod_template and pod_template_name? There was plenty discussion of this on the issue around the original integration. I think it makes sense, but am open to discussion.
  3. This won't be backwards compatible right? I don't think we can just delete the existing KF plugin defintions, but there is a task version metadata that maybe for v2 of these plugins the configuration is like this. Not sure if we can support both versions in flytekit or if all existing users will need to update with this change. Backwards compatability is unfortunately necessary and always complicates things.

@yubofredwang
Copy link
Contributor Author

@hamersaw thanks for the feedback!

  1. We can still keep resource as what the config is having right now and the resource field, if set, will override the resources set in the pod template.
  2. The reason I am adding the support for template is that, we often have to introduce different command, env and image for different worker groups. Especially in the case of MPI Job, where Launcher's command and Worker's command is always different. I don't know which is more aligned with Flyte's principal: providing individual fields for users to set which is more straightforward vs allow users to set podtemplate which has more flexibility.
  3. I can make it backward compatible where the override order is the existing fields overrides the new added fields.

@hamersaw
Copy link
Contributor

I think this is ready for a PR and we can continue to iterate there? Does that sound good? Quick replies:

1 / 2. Lets put the resource requests / limits and image as their own configuration values. This aligns then with the @task decorator in Flyte. Then we can use the pod_template or pod_template_name to set command and env if necessary - I admit I'm not fully sure how this works because the KF operators set some of these things automatically, just need to make sure our settings will be carried over.
3. This is perfect! I think the existing fields are just numbers of different replicas right? It should be relatively easy and not too messy to keep these around.

@yubofredwang
Copy link
Contributor Author

yubofredwang commented Mar 28, 2023

@hamersaw thanks for the feedback.
I gave it a second thought. Maybe let's not introduce pod_template in this PR and wait till someone else request that. However, I notice that there were plans from @cosmicBboy to create issues to support pod_template in Spark, Ray, Dask in the pod_template issue #3123. Not sure what's the plan over the pod_template support for those plugins

I am working on formalizing the other changes related. Will create a PR once done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants