Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Make API specification more restricting #2198

Merged
merged 6 commits into from
Aug 9, 2024

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Aug 7, 2024

What this PR does / why we need it:
I updated the below API definitions for the Training v2 API:

  1. trainingRuntimeRef
    1. I removed the namespace field since we should use the TrainingRuntime deployed in the same namespace with the TrainJob. Otherwise, users should define the ClusterTrainingRuntime instead of the namespace-scoped TrainingRuntime resource.
    2. I renamed the version field with apiVersion since we are sometimes confused if the version indicates the runtime revision or API version.
  2. managedBy
    1. I added limitations and restrictions based on discussions in Add support for the managedBy field #2193. The managedBy field is acceptable only for either an empty value, "training-operator.kubeflow.org/trainjob-controller" or "kueue.x-k8s.io/multikueue".

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Related: #2170

Checklist:

  • Docs included if any changes are user facing

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 7, 2024

/assign @kubeflow/wg-training-leads

@coveralls
Copy link

coveralls commented Aug 7, 2024

Pull Request Test Coverage Report for Build 10307520952

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.008%) to 33.494%

Totals Coverage Status
Change from base Build 10284610088: 0.008%
Covered Lines: 3951
Relevant Lines: 11796

💛 - Coveralls

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 7, 2024

cc: @mimowo

// Namespace for the runtime. In that case, user should use TrainingRuntime
Namespace *string `json:"namespace,omitempty"`
// APIVersion is the apiVersion for the runtime.
// Defaults to the v2alpha1.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Defaults to the v2alpha1.
// Defaults to the kubeflow.org/v2alpha1.

I'm wondering if we should align with the YAML apiVersion like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIVersion would indeed look more consistent with how objects are declared in yamls example. OTOH consider if you need a version at all - will you support other versions than the current? Maybe just APIGroup is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do we have plans to support different runtime version APIs in TrainJob ? e.g. v2alpha1 and v2beta1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that we need to support multiple versions of TrainingRuntimes when we graduate it from Beta to Stable so that we align with the Kubernetes deprecation policy.

But, I don't think that we need to support multiple versions when we graduate the API from alpha to beta.
So, I'm wondering if we should drop this version field in the alpha and beta stages, and then if we should revisit this discussion to mitigate the GA graduation risks.

@andreyvelich @mimowo WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let's do it in the future versions.
What are the benefits you see to add the APIGroup in the runtime reference if that will be always kubeflow.org ?
To give users possibility to build custom APIs for the training runtimes ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the benefits you see to add the APIGroup in the runtime reference if that will be always kubeflow.org ?
To give users possibility to build custom APIs for the training runtimes ?

Yes, once we provide the possibility to specify the apiGroup, vendors and users can implement the CRD instead of TrainingRuntime/ClusterTrainingRuntime and custom controller to build Jobs.

So, I'm thinking that adding the apiGroup in the first iteration would be worth it. @andreyvelich WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Oh, I actually still have not added the apiGroup field in the KEP.
So, based on their reactions, I opened the PR to add the apiGroup in the API design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you are right. Let's add it in the followup PR.

Copy link

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

// Namespace for the runtime. In that case, user should use TrainingRuntime
Namespace *string `json:"namespace,omitempty"`
// APIVersion is the apiVersion for the runtime.
// Defaults to the v2alpha1.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIVersion would indeed look more consistent with how objects are declared in yamls example. OTOH consider if you need a version at all - will you support other versions than the current? Maybe just APIGroup is enough.

// 'kueue.x-k8s.io/multikueue'.
// The built-in TrainJob controller reconciles TrainJob which don't have this
// field at all or the field value is the reserved string
// 'training-operator.kubeflow.org/trainjob-controller', but delegates reconciling TrainJobs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training-operator.kubeflow.org/trainjob-controller is a bit long. What about the alternative: kubeflow.org/training-operator-v2?

So, to match the naming of v1 we tentatively go with for v1: kubeflow.org/training-operator (#2193)?

Copy link
Member

@andreyvelich andreyvelich Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just do kubeflow.org/training-operator or kubeflow.org/trainjob-controller for V2?
Since we are planning to support only managedBy: kueue.x-k8s.io/multikueue in V1 version: #2193 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could probably use kubeflow.org/training-operator both for v1 and v2. It is non-ambigues since API v1 can only be reconciled by the controller in the same version.

Copy link
Member Author

@tenzen-y tenzen-y Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm ok with either name kubeflow.org/training-operator or training-operator.kubeflow.org/trainjob-controller because I just wanted to align with the managedBy specifications in the batch/job and jobset.

In the batch/job and jobset, we use the ${OPERATOR_NAME}.${API_GROUP}/${CONTROLLER_NAME} like jobset.sigs.k8s.io/jobset-controller and kubernetes.io/job-controller.

@mimowo Do you know the reason why we selected the above managedBy format in the batch/job and jobset?

Copy link
Member Author

@tenzen-y tenzen-y Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I found the jobset and batch/job actually use the ${API_GROUP}/${CONTROLLER_NAME}.
So, never mind the above my comment.

Copy link
Member Author

@tenzen-y tenzen-y Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, let's use the kubeflow.org/training-operator as a reserved name so that we can align with batch/job and jobset.

EDIT: Actually, we reconcile the TrainJob by trainjob-controller instead of the training-opertor, and then the training-operator v2 contains multiple controllers like trainjob-controller and trainingruntime-controller.
So, I updated this section with kubeflow.org/trainjob-controller. If you have any suggestions, please let me know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we should use the trainjob-controller to make it consistent with the CRD name, JobSet is doing the same: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L46

Name string `json:"name"`

// Namespace for the runtime. In that case, user should use TrainingRuntime
Namespace *string `json:"namespace,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with that!

// Namespace for the runtime. In that case, user should use TrainingRuntime
Namespace *string `json:"namespace,omitempty"`
// APIVersion is the apiVersion for the runtime.
// Defaults to the v2alpha1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do we have plans to support different runtime version APIs in TrainJob ? e.g. v2alpha1 and v2beta1


// Kind for the runtime. TrainingRuntime or ClusterTrainingRuntime
// Kind for the runtime, which must be TrainingRuntime or ClusterTrainingRuntime.
// Defaults to the TrainingRuntime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make Defaults to ClusterTrainingRuntime since in OSS we will implement them ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with you.
I try to refine this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Kind *string `json:"kind,omitempty"`

// Version for the runtime. For example: v2alpha1
Version *string `json:"version,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I still think that the runtime revision control would be useful.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @tenzen-y!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 94140ed into kubeflow:master Aug 9, 2024
38 checks passed
@tenzen-y tenzen-y deleted the update-kep-2170 branch August 9, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants