Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Implement JobSet, PlainML, and Torch Plugins #2308

Merged
merged 21 commits into from
Oct 31, 2024

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Oct 24, 2024

Part of: #2290.

I implemented JobSet and PlainML plugins, and I was able to submit our first TrainJob and ClusterTrainingRuntime 🎉

Working ClusterTrainingRuntime
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: first-runtime
spec:
  mlPolicy:
    numNodes: 5
  template:
    spec:
      replicatedJobs:
        - name: trainer-node
          template:
            spec:
              template:
                spec:
                  containers:
                    - name: trainer
                      image: python:alpine3.20
                      command:
                        - "/bin/sh"
                        - -c
                        - "echo 'Hello From Training Runtime'"
---
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: first-job
spec:
  runtimeRef:
    name: first-runtime
    kind: ClusterTrainingRuntime
    apiGroup: kubeflow.org
  trainer:
    numNodes: 2

A few details:

  • We discussed with @tenzen-y that Info object doesn't need to have Obj class, since we can directly pass the JobTemplateSpec into RunComponentBuilderPlugins() API.
  • I added the Trainer struct in Info object which represents values combined from MLPolicy and TrainJob. Those values will be inserted into JobSet.
  • I need to update the unit tests in this PR.
  • I need to implement torch plugin.
  • I need to implement env variables enforcement.

/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92 @kuizhiqing @shravan-achar

@andreyvelich
Copy link
Member Author

/hold for review

@andreyvelich andreyvelich changed the title KEP-2170: Implement JobSet and PlainML Plugins KEP-2170: Implement JobSet, PlainML, and Torch Plugins Oct 24, 2024
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Oct 25, 2024
@coveralls
Copy link

coveralls commented Oct 25, 2024

Pull Request Test Coverage Report for Build 11617265788

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11542660312: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@andreyvelich
Copy link
Member Author

Torch Runtime is working 🎉

Simple Torch runtime
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-runtime
spec:
  mlPolicy:
    numNodes: 2
    torch:
      numProcPerNode: "1"
  template:
    spec:
      replicatedJobs:
        - name: trainer-node
          template:
            spec:
              template:
                spec:
                  containers:
                    - name: trainer
                      image: docker.io/andreyvelichkevich/test-torch
---
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: first-job
spec:
  runtimeRef:
    name: torch-runtime
    kind: ClusterTrainingRuntime
    apiGroup: kubeflow.org
  trainer:
    numNodes: 3
    numProcPerNode: "5"
    env:
      - name: TRAIN_JOB
        value: env

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay review.
I did not review any testings yet.

pkg/constants/constants.go Outdated Show resolved Hide resolved
pkg/constants/constants.go Show resolved Hide resolved
pkg/runtime.v2/framework/core/framework.go Outdated Show resolved Hide resolved
pkg/runtime.v2/framework/plugins/torch/torch.go Outdated Show resolved Hide resolved
pkg/runtime.v2/framework/plugins/torch/torch.go Outdated Show resolved Hide resolved
pkg/runtime.v2/framework/interface.go Outdated Show resolved Hide resolved
TotalRequests map[string]TotalResourceRequest
}

type Trainer struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the Policy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am still getting some data from Policy in the EnforceMLPolicy:

Do you think, that we should just remove Policy API from Info object and bypass the TrainingRuntime to the enforcement APIs ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the Info object to remove Policies:

type Info struct {
Labels map[string]string
Annotations map[string]string
Trainer
*Scheduler
}
.
Right now, we pass MLPolicy and PodGroupPolicy spec from the Runtime directly to the Enforce() APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the Trainer needs to have scheduler since we need to create the PodGroup and do some else scheduling logic.
If we create the PodGroup against a whole of TrainJob, that will not schedule to any Node for ever. Because we need to start the initializer before the Trainer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Does it mean that we should apply the coscheduling PodLabel only to the Trainer Job, not to the Initializer Job ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we should apply the coscheduling PodLabel only to the Trainer Job, not to the Initializer Job ?

Yes. When users want to take the intializers separate from trainer, they want to specify another PodGroup name to initializers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, do you want to update this change in the followup PR once I add the builder for Dataset and Model ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, do you want to update this change in the followup PR once I add the builder for Dataset and Model ?

This can be followed up.
I will take it during you impl Datase initializer since this is obviously regressions as I mentioned the above.
The current impl bring us a whole of Job will be stuck forever.

pkg/util.v2/testing/wrapper.go Outdated Show resolved Hide resolved
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the issue-2290-torch-plugin branch from 545a6c0 to cf694c5 Compare October 29, 2024 15:50
Signed-off-by: Andrey Velichkevich <[email protected]>
MinResources(corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("60"),
corev1.ResourceCPU: resource.MustParse("31"), // Every replica has 1 CPU = 31 CPUs in total.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a utility function to calculate the quantity passing in the obj

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have it here to calculate PodRequests based on the PodSpec: https://github.com/kubeflow/training-operator/blob/master/pkg/runtime.v2/runtime.go#L121-L124.

Do you mean having another utility just for the unit tests ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes utility to compute value in tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure the reason why @akshaychitneni requests to implement the util.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni Please can you give an example of code snippet that you want to use instead of:

corev1.ResourceCPU: resource.MustParse("31")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to calculating 31. Probably it is simpler as it is (numNodes + 1) * (Num of CPU configured in the runtime) in this case

Copy link
Member Author

@andreyvelich andreyvelich Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I think it depends on the unit test. For example, we have unit tests where we add CPU + Memory for the Pod requests:

corev1.ResourceCPU: resource.MustParse("1"),
corev1.ResourceMemory: resource.MustParse("4Gi"),

Additionally, in the future we might want to define different resources for initializer containers and trainer to verify that we correctly calculate MinResource for the PodGroup.

Let me add the TODO statement to see if we can create helper function for our tests to calculate these values based on number of replicas and resources per replica.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, container resource computation is not easy, we need to consider runtimeClass (not training runtime, it's core API runtimeClass), limitRange, and initContainer and so on. So, I would like to keep manually resource creation for the testing.

If we provide such utils with complex computing, the lib could potentially bug, and we will miss the bugs in main testing cases.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted the production code review comments, first.

pkg/constants/constants.go Show resolved Hide resolved
hack/python-sdk-v2/openapi-generator-cli.jar Outdated Show resolved Hide resolved
pkg/constants/constants.go Outdated Show resolved Hide resolved
pkg/runtime.v2/core/trainingruntime.go Show resolved Hide resolved

info := runtime.NewInfo(opts...)

if err := r.framework.RunEnforceMLPolicyPlugins(info, trainJob, mlPolicy); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to surface the mlPlicy to the JobPipeline framework directly since the framework is not only for the TrainingRuntime.
So, could you move the mlPolicy propagation into the runtime initialization as we disscussed before?

func (c *CoScheduling) Build(ctx context.Context, info *runtime.Info, trainJob *kubeflowv2.TrainJob) (client.Object, error) {
if info == nil || info.PodGroupPolicy == nil || info.PodGroupPolicy.Coscheduling == nil || trainJob == nil {
func (c *CoScheduling) Build(ctx context.Context, runtimeJobTemplate client.Object, info *runtime.Info, trainJob *kubeflowv2.TrainJob) (client.Object, error) {
if info == nil || info.Scheduler == nil || info.Scheduler.ScheduleTimeoutSeconds == nil || trainJob == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if info == nil || info.Scheduler == nil || info.Scheduler.ScheduleTimeoutSeconds == nil || trainJob == nil {
if info == nil || info.Scheduler == nil || trainJob == nil {

As we discussed berore, they need to specify the empty cosheduler parameters when they do not want to modify the schedulerTimeoutSeconds, right?

Copy link
Member Author

@andreyvelich andreyvelich Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to allow this ? Otherwise, runtime might look as follows:

spec:
  mlPolicy:
    numNodes: 10
    torch:
  podGroupPolicy:
    coscheduling:

Is it a valid object from the API perspective ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, they can specify the empty object like this:

spec:
  mlPolicy:
    numNodes: 10
    torch:
  podGroupPolicy:
    coscheduling: {}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it back to your original condition check:

if info == nil || info.RuntimePolicy.PodGroupPolicy == nil || trainJob == nil {

PodRequests: info.TotalRequests[rName].PodRequests,
// For other Jobs like the Initializer, replica is always equal to 1.
// TODO (andreyvelich): Add support for total requests from the TrainJob's ResourcesPerNode.
if rName == constants.JobTrainerNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the reason why we do not store thre totalRequests for the initializers?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for the unit testings.

TotalRequests map[string]TotalResourceRequest
}

type Trainer struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the Trainer needs to have scheduler since we need to create the PodGroup and do some else scheduling logic.
If we create the PodGroup against a whole of TrainJob, that will not schedule to any Node for ever. Because we need to start the initializer before the Trainer.

MinResources(corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("60"),
corev1.ResourceCPU: resource.MustParse("31"), // Every replica has 1 CPU = 31 CPUs in total.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure the reason why @akshaychitneni requests to implement the util.

pkg/runtime.v2/core/trainingruntime_test.go Show resolved Hide resolved
pkg/runtime.v2/core/trainingruntime_test.go Outdated Show resolved Hide resolved
pkg/runtime.v2/core/trainingruntime_test.go Outdated Show resolved Hide resolved
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for the Integration testings

g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(originImageName))
if rJob.Name == constants.JobTrainerNode {
g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(originImageName))
}
}
}, util.Timeout, util.Interval).Should(gomega.Succeed())
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a case if we can build and deploy Torch JobSet like ginkgo.It("Should succeed to create TrainJob with Torch TrainingRuntime", func() {

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
var objs []client.Object
for _, plugin := range f.componentBuilderPlugins {
obj, err := plugin.Build(ctx, info, trainJob)
obj, err := plugin.Build(ctx, runtimeJobTemplate, info, trainJob)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx context is not getting used anywhere throughout the code, is it necessary to pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 28 to 38
type Info struct {
Obj client.Object
// Labels and Annotations to add to the RuntimeJobTemplate.
Labels map[string]string
PodLabels map[string]string
Annotations map[string]string
Policy
TotalRequests map[string]TotalResourceRequest
// Original policy values from the runtime.
RuntimePolicy RuntimePolicy
// Trainer parameters to add to the RuntimeJobTemplate.
Trainer
// Scheduler parameters to add to the RuntimeJobTemplate.
*Scheduler
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the original runtime policies values in the Info object for now, since we can't pass the specific policies to the EnforceMLPolicy and EnforcePodGroupPolicy() APIs. Similar to what we do today:

In the future, we will discuss how we can optimize the Info object structure.

cc @tenzen-y

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is at least a requirement. Let's revisit here later. I can take that in the next week.

}
info.PodLabels[schedulerpluginsv1alpha1.PodGroupLabel] = trainJob.Name

info.Scheduler.PodLabels = make(map[string]string, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that we forcefully override the Pod labels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use PodLabels in the Scheduler right now, that is why we just creating a new Map here.
In the future, if we should do any Pod labels merge for the Info object, we should revisit it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, this is pluggabble mechanism, so we should consider the other plugin.
So, as before initialization impl, we should initialize map only when the pod label is emtpy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let me change it.

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort!
Let's open some follow-up issue!
/lgtm
/apprve

@tenzen-y
Copy link
Member

/approve

@tenzen-y
Copy link
Member

/hold cancel

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7c5ea70 into kubeflow:master Oct 31, 2024
43 checks passed
@andreyvelich andreyvelich deleted the issue-2290-torch-plugin branch October 31, 2024 19:03
saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024
* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix lint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline error

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <[email protected]>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
google-oss-prow bot pushed a commit that referenced this pull request Dec 9, 2024
* Added test for create-pytorchjob.ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* fix yaml syntax

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix uses path

Signed-off-by: sailesh duddupudi <[email protected]>

* Add actions/checkout

Signed-off-by: sailesh duddupudi <[email protected]>

* Add bash to action.yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* Install pip dependencies step

Signed-off-by: sailesh duddupudi <[email protected]>

* Add quotes for args

Signed-off-by: sailesh duddupudi <[email protected]>

* Add jupyter

Signed-off-by: sailesh duddupudi <[email protected]>

* Add nbformat_minor: 5 to fix invalid format error

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix job name

Signed-off-by: sailesh duddupudi <[email protected]>

* test papermill-args-yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args1

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args2

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args3

Signed-off-by: sailesh duddupudi <[email protected]>

* Parameterize sdk install

Signed-off-by: sailesh duddupudi <[email protected]>

* Remove unnecessary output

Signed-off-by: sailesh duddupudi <[email protected]>

* nbformat normailze

Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Training Client Conditions related unit tests (#2253)

* test: add unit test for get_job_conditions function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_created function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_running function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_restarting function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_failed function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_succeded function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: improve job condition unit tests efficiency

Signed-off-by: Bobbins228 <[email protected]>

---------

Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for list_jobs method of the training_client (#2267)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)

Generate clientset, informers, listers and open api spec
for v2alpha1 APIs.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Use torchrun to create PyTorchJob from function (#2276)

* [SDK] Use torchrun to create PyTorchJob from function

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update PyTorchJob SDK example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add consts for entrypoint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add check for num procs per worker

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for get_job_logs method of the training_client (#2275)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [v2alpha] Move GV related codebase (#2281)

Move GV related codebase in v2alpha

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement runtime framework (#2248)

* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <[email protected]>

* Remove grep dependency

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase the error message

Signed-off-by: Yuki Iwai <[email protected]>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <[email protected]>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <[email protected]>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <[email protected]>

* Add TODO comments

Signed-off-by: Yuki Iwai <[email protected]>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add DeepSpeed Example with Pytorch Operator (#2235)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Deepspeed demo dependencies (#2294)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add manifests for Kubeflow Training V2 (#2289)

* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)

* FSDP Example with PyTorchJob and T5 Fine-Tuning

Signed-off-by: Andrey Velichkevich <[email protected]>

* Modify text

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)

* KEP-2170: Implement TrainJob Reconciler to manage objects

Signed-off-by: Yuki Iwai <[email protected]>

* Mode dep-crds to manifests/external-crds

Signed-off-by: Yuki Iwai <[email protected]>

* Rename run with runtime

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Remove Prometheus Monitoring doc (#2301)

Signed-off-by: Sophie <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Decouple JobSet from TrainJob (#2296)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Initialize runtimes before the manager starts (#2306)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)

* Generate SDK models for the Training V2 APIs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create pyproject.toml config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Create model and dataset initializers (#2303)

* KEP-2170: Create model and dataset initializers

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add abstract classes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add storage URI to config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update .gitignore

Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix the misspelling for initializer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add .pt and .pth to ignore_patterns

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)

* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix lint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline error

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <[email protected]>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement Initializer builders in the JobSet plugin  (#2316)

* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update manifests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Use var for envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add the TrainJob state transition design (#2298)

* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <[email protected]>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <[email protected]>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <[email protected]>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <[email protected]>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Update tf job examples to tf v2 (#2270)

* mnist with summaries updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* tf_sample updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Remove old example - estimator-API, this example has been replaced by distribution_strategy

Signed-off-by: yelias <[email protected]>

* Small fix

Signed-off-by: yelias <[email protected]>

* Remove unsupported powerPC dockerfiles

Signed-off-by: yelias <[email protected]>

* Fix typo in copyright

Signed-off-by: yelias <[email protected]>

---------

Signed-off-by: yelias <[email protected]>
Co-authored-by: yelias <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add TrainJob conditions (#2322)

* KEP-2170: Implement TrainJob conditions

Signed-off-by: Yuki Iwai <[email protected]>

* Fix API comments

Signed-off-by: Yuki Iwai <[email protected]>

* Make condition message constants

Signed-off-by: Yuki Iwai <[email protected]>

* Stop connecting condition type and reason in JobSet plugin

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)

This commit pins the Gloo repository to a specific commit (43b7acbf) in
the JAX Dockerfile to prevent build failures caused by a recent bug
introduced in the Gloo codebase. By locking the version of Gloo to
a known working commit, we ensure that the JAX build remains stable and
functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc
file due to an undefined __NR_gettid constant, which was introduced
after the pinned commit. By using this commit, we bypass the issue and
allow the build to complete successfully.

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [fix] Resolve v2alpha API exceptions (#2317)

Resolve v2alpha API exceptions by adding necessary listType validations.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Kubernetes to v1.30.7 (#2332)

* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <[email protected]>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <[email protected]>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Ignore cache exporting errors in the image building workflows (#2336)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add Torch Distributed Runtime (#2328)

* KEP-2170: Add Torch Distributed Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add pip list

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Refine the server-side apply installation args (#2337)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add openapi-generator CLI option to skip SDK v2 test generation (#2338)

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade kustomization files to Kustomize v5 (#2326)

Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin accelerate package version in trainer (#2340)

* Pin accelerate package version in trainer

Signed-off-by: Gavrish Prabhu <[email protected]>

* include new line to pass pre-commit hook

Signed-off-by: Gavrish Prabhu <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Replace papermill command with bash script

Signed-off-by: sailesh duddupudi <[email protected]>

* Typo fix

Signed-off-by: sailesh duddupudi <[email protected]>

* Move Checkout step outside action.yaml file

Signed-off-by: sailesh duddupudi <[email protected]>

* Add newline EOF in script

Signed-off-by: sailesh duddupudi <[email protected]>

* Pass python dependencies as args and pin versions

Signed-off-by: sailesh duddupudi <[email protected]>

* Update Usage

Signed-off-by: sailesh duddupudi <[email protected]>

* Install dependencies in yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* fix ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* set bash flags

Signed-off-by: sailesh duddupudi <[email protected]>

* Update script args and add more kubernetes versions for tests

Signed-off-by: sailesh duddupudi <[email protected]>

* add gang-scheduler-name to  template

Signed-off-by: sailesh duddupudi <[email protected]>

* move go setup to template

Signed-off-by: sailesh duddupudi <[email protected]>

* remove -p parameter from script

Signed-off-by: sailesh duddupudi <[email protected]>

---------

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: Akshay Chitneni <[email protected]>
Signed-off-by: Sophie <[email protected]>
Signed-off-by: yelias <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: Gavrish Prabhu <[email protected]>
Co-authored-by: Mark Campbell <[email protected]>
Co-authored-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Varsha <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: yu lin <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Sophie Hsu <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Co-authored-by: YosiElias <[email protected]>
Co-authored-by: yelias <[email protected]>
Co-authored-by: Sandipan Panda <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Oksana Bazylieva <[email protected]>
Co-authored-by: Gavrish Prabhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants