[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

yubofredwang · 2023-02-03T02:10:34Z

Motivation: Why do you think this is important?

It is a common user scenario for users to set different resources for different type of replicas in Kubeflow Jobs such as TFJob and MPIJob.

For example, a user would want to run TFJob using the following resources:

PS:
    replicas: 1
    template:
        spec:
            containers:
                resources:
                    limits:
                    cpu: "1"
Worker:
     replicas: 2
     template:
         spec:
             containers:
                  resources:
                  limits:
                      nvidia.com/gpu: 1
                      memory: "10G"

However, this is not currently supported in Flyte.

Goal: What should the final outcome look like, ideally?

Users should be able to override the resources specified in task definition by providing extra resources configs in task config in TfJob.

@task(
    task_config=TfJob(
         worker: {num=2, requests=Resources(cpu="1", mem="2Gi"), limits=Resources(cpu="1", mem="4Gi")},
         ps: {num=1, requests=Resources(cpu="1", mem="2Gi"), limits=Resources(cpu="1", mem="4Gi"),
    )
    cache=True,
    cache_version="1.0",
)
def mnist_tensorflow_job(hyperparameters: Hyperparameters) -> training_outputs:

Describe alternatives you've considered

We can make the resources field in task function to accept a new type TFJobResources, and implement different handling for related backend plugins. However, this requires lots of code changes and undermine consistencies of task definitions.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

welcome · 2023-02-03T02:10:36Z

Thank you for opening your first issue here! 🛠

pingsutw · 2023-02-03T07:27:45Z

Thank you for reporting this. we also need separate config for spark task (driver, worker), and ray task (head node, worker node). we can add it in separate PR.

hamersaw · 2023-02-03T13:51:09Z

This could be implemented similar to the current Dask integration. Essentially, for the DaskScheduler and DaskWorkerGroup definitions we allow users to pass in resources - the DaskJob inherits from the Flyte @task requests and limits fields (which are also used as the default in case specific resources are not passed to DaskScheduler and DaskWorkerGroup).

yubofredwang · 2023-03-26T00:24:40Z

@hamersaw Since now we enabled passing PodTemplate as task metadata #3123 , I think we follow the same fashion for Kubeflow operators. I have a draft changes here. Please let me know if you have any suggestions with the general approach. I will create a PR once we settle down with the approach.

hamersaw · 2023-03-27T18:52:28Z

@yubofredwang this is awesome, I think it looks great so far! A few thoughts:

I like having the separate resource request and limits configuration because the main goal of Flyte is to abstract the complexities of k8s from users. The addition of PodTemplate support is mostly for power users who need to configure every aspect of the Pod. But for general users, I think they would much rather use a simple requests=Resources(cpu=2), for example, to set resources rather than specifying an entire PodTemplate.
Should we support both the pod_template and pod_template_name? There was plenty discussion of this on the issue around the original integration. I think it makes sense, but am open to discussion.
This won't be backwards compatible right? I don't think we can just delete the existing KF plugin defintions, but there is a task version metadata that maybe for v2 of these plugins the configuration is like this. Not sure if we can support both versions in flytekit or if all existing users will need to update with this change. Backwards compatability is unfortunately necessary and always complicates things.

yubofredwang · 2023-03-27T19:10:16Z

@hamersaw thanks for the feedback!

We can still keep resource as what the config is having right now and the resource field, if set, will override the resources set in the pod template.
The reason I am adding the support for template is that, we often have to introduce different command, env and image for different worker groups. Especially in the case of MPI Job, where Launcher's command and Worker's command is always different. I don't know which is more aligned with Flyte's principal: providing individual fields for users to set which is more straightforward vs allow users to set podtemplate which has more flexibility.
I can make it backward compatible where the override order is the existing fields overrides the new added fields.

hamersaw · 2023-03-28T13:11:26Z

I think this is ready for a PR and we can continue to iterate there? Does that sound good? Quick replies:

1 / 2. Lets put the resource requests / limits and image as their own configuration values. This aligns then with the @task decorator in Flyte. Then we can use the pod_template or pod_template_name to set command and env if necessary - I admit I'm not fully sure how this works because the KF operators set some of these things automatically, just need to make sure our settings will be carried over.
3. This is perfect! I think the existing fields are just numbers of different replicas right? It should be relatively easy and not too messy to keep these around.

yubofredwang · 2023-03-28T23:37:36Z

@hamersaw thanks for the feedback.
I gave it a second thought. Maybe let's not introduce pod_template in this PR and wait till someone else request that. However, I notice that there were plans from @cosmicBboy to create issues to support pod_template in Spark, Ray, Dask in the pod_template issue #3123. Not sure what's the plan over the pod_template support for those plugins

I am working on formalizing the other changes related. Will create a PR once done.

yubofredwang added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Feb 3, 2023

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Feb 3, 2023

yubofredwang mentioned this issue Mar 31, 2023

Refactor kf-operator plugins configs and support setting different specs for different replica groups flyteorg/flyteidl#386

Merged

8 tasks

yubofredwang mentioned this issue Apr 27, 2023

Change kubeflow plugins to allow settings specs for different replica flyteorg/flyteplugins#345

Merged

8 tasks

hamersaw closed this as completed in flyteorg/flyteidl#386 May 4, 2023

yubofredwang mentioned this issue May 16, 2023

Change flytekit Pytorch, TFJob and MPI plugins to use new kubeflow config flyteorg/flytekit#1627

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

yubofredwang commented Feb 3, 2023

welcome bot commented Feb 3, 2023

pingsutw commented Feb 3, 2023

hamersaw commented Feb 3, 2023

yubofredwang commented Mar 26, 2023

hamersaw commented Mar 27, 2023

yubofredwang commented Mar 27, 2023

hamersaw commented Mar 28, 2023

yubofredwang commented Mar 28, 2023 •

edited

Loading

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs #3308

Comments

yubofredwang commented Feb 3, 2023

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome bot commented Feb 3, 2023

pingsutw commented Feb 3, 2023

hamersaw commented Feb 3, 2023

yubofredwang commented Mar 26, 2023

hamersaw commented Mar 27, 2023

yubofredwang commented Mar 27, 2023

hamersaw commented Mar 28, 2023

yubofredwang commented Mar 28, 2023 • edited Loading

yubofredwang commented Mar 28, 2023 •

edited

Loading