Update DRA testing to stable API version and prepare it to test more … #3641

emerbe · 2025-10-13T08:22:51Z

…types of drivers

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

k8s-ci-robot · 2025-10-13T08:23:01Z

Hi @emerbe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

emerbe · 2025-10-13T09:08:18Z

/assign @mortent

alaypatel07 · 2025-10-13T14:30:21Z

/cc @alaypatel07

I can help with reviews here

clusterloader2/testing/dra/job.yaml

clusterloader2/pkg/dependency/dra/dra.go

emerbe · 2025-10-14T07:36:14Z

/cc @alaypatel07

I can help with reviews here

Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten.

Feel free to review it.

emerbe · 2025-10-17T06:43:21Z

Hello, @alaypatel07 could you please take a look?
I'd appreciate it.

clusterloader2/pkg/dependency/dra/dra.go

alaypatel07 · 2025-10-22T15:06:47Z

clusterloader2/pkg/dependency/dra/dra.go

 	return true, nil
 }

+func getReadyNodesCount(config *dependency.Config) (int, error) {


Is there a reason why we need this? I think this check will be erroneous.

Just because a node is not ready does not necessarily imply the driver pod on that node is not running.

I've changed it based on my tests.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

I understand the issue, but I am saying is this a reliable check for it?

When the Node was NotReady, was there a dra driver plugin pod running there?

Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?

Both ways would work I believe.

Changed as you suggested, PTAL

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml

clusterloader2/testing/dra/config.yaml

alaypatel07 · 2025-10-23T16:41:53Z

clusterloader2/testing/dra/job.yaml

-  ttlSecondsAfterFinished: 300
+  # In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant.
+  # A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks.
+  ttlSecondsAfterFinished: 3600 # 1 hour


Which measurement depends on this config?

The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.

My understanding of the problem is:

job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.

There are 10 jobs created sequentially

I've checked and it takes around 10 minutes to complete.

As the first jobs complete, the ttlSecondsAfterFinished timer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.

I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.

I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.

I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files. I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.

OK, moved to the second PR.

alaypatel07 · 2025-10-23T16:43:21Z

/ok-to-test

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml

alaypatel07

Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.

Other things look good to me, thanks @emerbe

alaypatel07 · 2025-10-24T18:06:56Z

Thanks @emerbe, changes looks good here.

/lgtm
/hold for #3629 since this PR has urgent deadline

alaypatel07 · 2025-10-24T18:07:40Z

/assign @mborsz

PTAL when you get a chance

mborsz · 2025-10-30T08:39:17Z

/approve

alaypatel07 · 2025-10-30T14:18:32Z

@emerbe can you please rebase this, once rebased, we can remove the hold and let this go through.

…types of drivers

emerbe · 2025-10-30T14:56:12Z

/hold cancel
Rebased on master now that #3629 is merged.

alaypatel07 · 2025-10-30T14:59:31Z

/lgtm

k8s-ci-robot · 2025-10-30T14:59:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alaypatel07, emerbe, mborsz, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~clusterloader2/OWNERS~~ [mborsz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alaypatel07 · 2025-10-30T18:47:08Z

@emerbe after merging this the dra jobs are failing see here:
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1983938951760580608/build-log.txt

I1030 17:02:03.285546   21869 clusterloader.go:245] --------------------------------------------------------------------------------
I1030 17:02:03.830950   21869 framework.go:276] Applying templates for "manifests/dra-example-driver/*.yaml"
I1030 17:02:34.316595   21869 shared_informer.go:349] "Waiting for caches to sync" controller="PodsIndexer"
I1030 17:02:34.417120   21869 shared_informer.go:356] "Caches are synced" controller="PodsIndexer"
E1030 17:04:34.555246   21869 wait_for_controlled_pods.go:627] WaitForControlledPodsRunning: error for test-m4h5sx-1/long-running-0: got context deadline exceeded while waiting for 1 pods to be running in namespace(test-m4h5sx-1), controlledBy(long-running-0) - summary of pods : Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 1 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady

It looks like the driver is not getting installed?

When I compare this to a successful run before the PR:

I1030 17:26:52.555317   17727 simple_test_executor.go:58] AutomanagedNamespacePrefix: test-wowpao
I1030 17:26:52.591242   17727 simple_test_executor.go:77] Setting up 1 dependencies
I1030 17:26:52.598594   17727 dra.go:59] DRATestDriver: Installing DRA example driver
I1030 17:26:52.601728   17727 framework.go:276] Applying templates for "manifests/*.yaml"
I1030 17:26:52.601742   17727 framework.go:287] Applying manifests/clusterrole.yaml
I1030 17:26:52.605891   17727 framework.go:287] Applying manifests/clusterrolebinding.yaml
I1030 17:26:52.609437   17727 framework.go:287] Applying manifests/deviceclass.yaml
I1030 17:26:52.613990   17727 framework.go:287] Applying manifests/kubeletplugin.yaml
I1030 17:26:52.620517   17727 framework.go:287] Applying manifests/resourceQuota.yaml
I1030 17:26:52.624298   17727 framework.go:287] Applying manifests/serviceaccount.yaml
I1030 17:26:52.627495   17727 framework.go:287] Applying manifests/validatingadmissionpolicy.yaml
I1030 17:26:52.638692   17727 framework.go:287] Applying manifests/validatingadmissionpolicybinding.yaml
I1030 17:26:52.642957   17727 dra.go:89] DRATestDriver: checking if DRA driver dra-example-driver-kubeletplugin is healthy

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload/1983945747292229632/build-log.txt

emerbe · 2025-10-31T11:05:20Z

@emerbe after merging this the dra jobs are failing see here: https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1983938951760580608/build-log.txt

I1030 17:02:03.285546   21869 clusterloader.go:245] --------------------------------------------------------------------------------
I1030 17:02:03.830950   21869 framework.go:276] Applying templates for "manifests/dra-example-driver/*.yaml"
I1030 17:02:34.316595   21869 shared_informer.go:349] "Waiting for caches to sync" controller="PodsIndexer"
I1030 17:02:34.417120   21869 shared_informer.go:356] "Caches are synced" controller="PodsIndexer"
E1030 17:04:34.555246   21869 wait_for_controlled_pods.go:627] WaitForControlledPodsRunning: error for test-m4h5sx-1/long-running-0: got context deadline exceeded while waiting for 1 pods to be running in namespace(test-m4h5sx-1), controlledBy(long-running-0) - summary of pods : Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 1 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady

It looks like the driver is not getting installed?

When I compare this to a successful run before the PR:

I1030 17:26:52.555317   17727 simple_test_executor.go:58] AutomanagedNamespacePrefix: test-wowpao
I1030 17:26:52.591242   17727 simple_test_executor.go:77] Setting up 1 dependencies
I1030 17:26:52.598594   17727 dra.go:59] DRATestDriver: Installing DRA example driver
I1030 17:26:52.601728   17727 framework.go:276] Applying templates for "manifests/*.yaml"
I1030 17:26:52.601742   17727 framework.go:287] Applying manifests/clusterrole.yaml
I1030 17:26:52.605891   17727 framework.go:287] Applying manifests/clusterrolebinding.yaml
I1030 17:26:52.609437   17727 framework.go:287] Applying manifests/deviceclass.yaml
I1030 17:26:52.613990   17727 framework.go:287] Applying manifests/kubeletplugin.yaml
I1030 17:26:52.620517   17727 framework.go:287] Applying manifests/resourceQuota.yaml
I1030 17:26:52.624298   17727 framework.go:287] Applying manifests/serviceaccount.yaml
I1030 17:26:52.627495   17727 framework.go:287] Applying manifests/validatingadmissionpolicy.yaml
I1030 17:26:52.638692   17727 framework.go:287] Applying manifests/validatingadmissionpolicybinding.yaml
I1030 17:26:52.642957   17727 dra.go:89] DRATestDriver: checking if DRA driver dra-example-driver-kubeletplugin is healthy

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload/1983945747292229632/build-log.txt

@alaypatel07 I think that the culprit here is your fork being out of sync with master.
The test uses https://github.com/alaypatel07/perf-tests/tree/dra-extended-resources

Branch dra-extended-resources in your fork has old manifest directory structure that does not contains subdirectory dra-example-driver

alaypatel07 · 2025-10-31T14:14:51Z

@emerbe the tests with my branch are green: https://testgrid.k8s.io/sig-scalability-dra#gce-dra-extended-resources-with-workload-master-scalability-100.

The failures are happening on https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-100-canary which is configured with master branch.

emerbe · 2025-10-31T16:03:19Z

@alaypatel07 Thanks for the hint.

I see that kube-scheduler says:
I1031 13:04:08.733031 12 schedule_one.go:1077] "Unable to schedule pod; no fit; waiting" pod="test-rdudq7-2/long-running-39-0-ww6jj" err="0/101 nodes are available: 1 node(s) had untolerated taint(s), 100 Insufficient example.com/gpu. no new claims to deallocate, preemption: 0/101 nodes are available: 101 Preemption is not helpful for scheduling."

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1984241196414799872/artifacts/cluster-info/kube-system/kube-scheduler-control-plane-us-east1-b-155g/kube-scheduler.log

Will be investigating further

emerbe · 2025-10-31T17:27:17Z

@alaypatel07 The problem is connected with CL2_ENABLE_EXTENDED_RESOURCES.

When CL2_ENABLE_EXTENDED_RESOURCES is set to default value which is false then tests are passing.
I've double checked that locally.

The cluster which I've created with CL2_ENABLE_EXTENDED_RESOURCES=true is below:

~/github/dra-example-driver main ?2 ❯ kubectl get resourceclaims --all-namespaces                                                                                                                                                         ○ kind-test-cluster mbucki@mbucki 17:07:52

No resources found

~/github/dra-example-driver main ?2 ❯ kubectl get resourceslices                                                                                                                                                                          ○ kind-test-cluster mbucki@mbucki 17:07:58

NAME                                         NODE                   DRIVER            POOL                   AGE

test-cluster-worker-gpu.example.com-vdc7p    test-cluster-worker    gpu.example.com   test-cluster-worker    22s

test-cluster-worker2-gpu.example.com-fvg44   test-cluster-worker2   gpu.example.com   test-cluster-worker2   22s

test-cluster-worker3-gpu.example.com-kl744   test-cluster-worker3   gpu.example.com   test-cluster-worker3   22s

test-cluster-worker4-gpu.example.com-v98nj   test-cluster-worker4   gpu.example.com   test-cluster-worker4   22s

test-cluster-worker5-gpu.example.com-fpdph   test-cluster-worker5   gpu.example.com   test-cluster-worker5   22s

~/github/dra-example-driver main ?2 ❯ kubectl get pods -n dra-example-driver                                                                                                                                                              ○ kind-test-cluster mbucki@mbucki 17:08:00

NAME                                     READY   STATUS    RESTARTS   AGE

dra-example-driver-kubeletplugin-bfw4l   1/1     Running   0          52s

dra-example-driver-kubeletplugin-m79z4   1/1     Running   0          52s

dra-example-driver-kubeletplugin-nlcfk   1/1     Running   0          52s

dra-example-driver-kubeletplugin-wdz8f   1/1     Running   0          52s

dra-example-driver-kubeletplugin-whcz4   1/1     Running   0          52s

but still pods can not be scheduled:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28s   default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 Insufficient example.com/gpu. no new claims to deallocate, preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

I guess there is incompatibility between usage of ExtendedResources and version 0.2.0 of DraExampleDriver which maybe expects ResourceClaimTemplates to be created(which is not when flag CL2_ENABLE_EXTENDED_RESOURCES=true)?

I'm not an expert in Drivers(including dra-example-driver) so will be grateful if you could take a look. Are you sure that using

          resources:
          {{ if .ExtendedResource }}
            limits:
              example.com/gpu: "1"

will work well with new driver version?

alaypatel07 · 2025-10-31T17:57:37Z

@alaypatel07 The problem is connected with CL2_ENABLE_EXTENDED_RESOURCES.

When CL2_ENABLE_EXTENDED_RESOURCES is set to default value which is false then tests are passing.
I've double checked that locally.

@emerbe good catch, I think the env var should not be enabled for the test, I will remove the following:

      - name: CL2_ENABLE_EXTENDED_RESOURCES
        value: "true"
``

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 13, 2025

k8s-ci-robot requested review from mborsz and wojtek-t October 13, 2025 08:22

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 13, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025

k8s-ci-robot assigned mortent Oct 13, 2025

emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 71bf9f6 to 785b4f7 Compare October 13, 2025 09:23

k8s-ci-robot requested a review from alaypatel07 October 13, 2025 14:30

mortent reviewed Oct 13, 2025

View reviewed changes

clusterloader2/testing/dra/job.yaml Outdated Show resolved Hide resolved

clusterloader2/pkg/dependency/dra/dra.go Show resolved Hide resolved

emerbe force-pushed the dra-update-to-stable-api branch from 785b4f7 to dbe248b Compare October 14, 2025 07:37

emerbe requested a review from mortent October 15, 2025 08:11

mortent approved these changes Oct 15, 2025

View reviewed changes

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/dra.go Outdated Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved

alaypatel07 reviewed Oct 22, 2025

View reviewed changes

clusterloader2/testing/dra/config.yaml Outdated Show resolved Hide resolved

emerbe force-pushed the dra-update-to-stable-api branch from dbe248b to 49e3932 Compare October 23, 2025 11:47

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml Outdated Show resolved Hide resolved

alaypatel07 reviewed Oct 23, 2025

View reviewed changes

emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 5e6b282 to 7673d03 Compare October 24, 2025 17:35

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2025

k8s-ci-robot assigned alaypatel07 Oct 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2025

k8s-ci-robot assigned mborsz Oct 24, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025

Update DRA testing to stable API version and prepare it to test more …

05f01bf

…types of drivers

emerbe force-pushed the dra-update-to-stable-api branch from 7673d03 to 05f01bf Compare October 30, 2025 14:54

k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 30, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025

k8s-ci-robot merged commit d57e351 into kubernetes:master Oct 30, 2025
7 checks passed

Update DRA testing to stable API version and prepare it to test more … #3641

Update DRA testing to stable API version and prepare it to test more … #3641

Conversation

emerbe commented Oct 13, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

k8s-ci-robot commented Oct 13, 2025

Uh oh!

emerbe commented Oct 13, 2025

Uh oh!

alaypatel07 commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

emerbe commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emerbe commented Oct 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alaypatel07 commented Oct 23, 2025

Uh oh!

Uh oh!

alaypatel07 left a comment

Choose a reason for hiding this comment

Uh oh!

alaypatel07 commented Oct 24, 2025

Uh oh!

alaypatel07 commented Oct 24, 2025

Uh oh!

mborsz commented Oct 30, 2025

Uh oh!

alaypatel07 commented Oct 30, 2025

Uh oh!

emerbe commented Oct 30, 2025

Uh oh!

alaypatel07 commented Oct 30, 2025

Uh oh!

k8s-ci-robot commented Oct 30, 2025

Uh oh!

Uh oh!

alaypatel07 commented Oct 30, 2025

Uh oh!

emerbe commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alaypatel07 commented Oct 31, 2025

Uh oh!

emerbe commented Oct 31, 2025

Uh oh!

emerbe commented Oct 31, 2025

Uh oh!

alaypatel07 commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

emerbe commented Oct 14, 2025 •

edited

Loading

emerbe commented Oct 31, 2025 •

edited

Loading