Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs #2313

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

akshaychitneni
Copy link
Contributor

What this PR does / why we need it:

Add cel validations on runtime crds for v2 apis

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Oct 28, 2024

Pull Request Test Coverage Report for Build 11560677304

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11542660312: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@andreyvelich
Copy link
Member

Fixes: #2219

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @akshaychitneni !
I left a few comments.

@@ -173,6 +176,8 @@ type TorchMLPolicySource struct {
// Supported values: `auto`, `cpu`, `gpu`, or int value.
// TODO (andreyvelich): Add kubebuilder validation.
// Defaults to `auto`.
// +kubebuilder:default="auto"
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this message, similar to torch distributed message: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L438 ?

Suggested change
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value"
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"

@@ -173,6 +176,8 @@ type TorchMLPolicySource struct {
// Supported values: `auto`, `cpu`, `gpu`, or int value.
// TODO (andreyvelich): Add kubebuilder validation.
// Defaults to `auto`.
// +kubebuilder:default="auto"
// +kubebuilder:validation:XValidation:rule="self in ['auto', 'cpu', 'gpu'] || type(self) == int", message="NumProcPerNode must be auto,cpu,gpu strings or int value"
NumProcPerNode *string `json:"numProcPerNode,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @akshaychitneni Should we use the intstr for NumProcPerNode here ?

@@ -209,13 +214,15 @@ type MPIMLPolicySource struct {

// Implementation name for the MPI to create the appropriate hostfile.
// Defaults to OpenMPI.
// +kubebuilder:default="OpenMPI"
MPIImplementation *MPIImplementation `json:"mpiImplementation,omitempty"`

// Directory where SSH keys are mounted.
SSHAuthMountPath *string `json:"SSHAuthMountPath,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gomega.Expect(k8sClient.Create(ctx, created)).Should(gomega.Succeed())
gomega.Expect(created).Should(gomega.BeComparableTo(wantTrainingRuntime(), util.IgnoreObjectMetadata))
},
ginkgo.Entry("Should succeed to default TorchMLPolicySource.NumProcPerNode=auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you add the test case for MPI default values as well ?

@@ -0,0 +1,128 @@
/*
Copy link
Member

@andreyvelich andreyvelich Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @akshaychitneni What do you think about moving the CEL validation and defaults integration tests to the webhook.v2 directory?

I noticed that we do it for:

I think, it makes more sense given that we don't have controller for TrainingRuntime and ClusterTrainingRuntime currently. Additionally, CEL is part of Kubernetes admission control process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants