Skip to content

Conversation

@emerbe
Copy link
Contributor

@emerbe emerbe commented Oct 13, 2025

…types of drivers

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 13, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 13, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @emerbe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025
@emerbe
Copy link
Contributor Author

emerbe commented Oct 13, 2025

/assign @mortent

@emerbe emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 71bf9f6 to 785b4f7 Compare October 13, 2025 09:23
@alaypatel07
Copy link
Contributor

/cc @alaypatel07

I can help with reviews here

@emerbe
Copy link
Contributor Author

emerbe commented Oct 14, 2025

/cc @alaypatel07

I can help with reviews here

Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten.

Feel free to review it.

@emerbe emerbe force-pushed the dra-update-to-stable-api branch from 785b4f7 to dbe248b Compare October 14, 2025 07:37
@emerbe emerbe requested a review from mortent October 15, 2025 08:11
@emerbe
Copy link
Contributor Author

emerbe commented Oct 17, 2025

Hello, @alaypatel07 could you please take a look?
I'd appreciate it.

return true, nil
}

func getReadyNodesCount(config *dependency.Config) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need this? I think this check will be erroneous.

Just because a node is not ready does not necessarily imply the driver pod on that node is not running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it based on my tests.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.

I understand the issue, but I am saying is this a reliable check for it?

When the Node was NotReady, was there a dra driver plugin pod running there?

Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways would work I believe.

Changed as you suggested, PTAL

@emerbe emerbe force-pushed the dra-update-to-stable-api branch from dbe248b to 49e3932 Compare October 23, 2025 11:47
ttlSecondsAfterFinished: 300
# In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant.
# A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks.
ttlSecondsAfterFinished: 3600 # 1 hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which measurement depends on this config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.

My understanding of the problem is:

  • job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.
  • There are 10 jobs created sequentially
  • I've checked and it takes around 10 minutes to complete.
  • As the first jobs complete, the ttlSecondsAfterFinished timer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.

I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.

I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.

I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files. I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved to the second PR.

@alaypatel07
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
Copy link
Contributor

@alaypatel07 alaypatel07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.

Other things look good to me, thanks @emerbe

@emerbe emerbe force-pushed the dra-update-to-stable-api branch 2 times, most recently from 5e6b282 to 7673d03 Compare October 24, 2025 17:35
@alaypatel07
Copy link
Contributor

Thanks @emerbe, changes looks good here.

/lgtm
/hold for #3629 since this PR has urgent deadline

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2025
@alaypatel07
Copy link
Contributor

/assign @mborsz

PTAL when you get a chance

@mborsz
Copy link
Member

mborsz commented Oct 30, 2025

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 30, 2025
@alaypatel07
Copy link
Contributor

@emerbe can you please rebase this, once rebased, we can remove the hold and let this go through.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
@emerbe emerbe force-pushed the dra-update-to-stable-api branch from 7673d03 to 05f01bf Compare October 30, 2025 14:54
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 30, 2025
@emerbe
Copy link
Contributor Author

emerbe commented Oct 30, 2025

/hold cancel
Rebased on master now that #3629 is merged.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2025
@alaypatel07
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alaypatel07, emerbe, mborsz, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d57e351 into kubernetes:master Oct 30, 2025
7 checks passed
@alaypatel07
Copy link
Contributor

@emerbe after merging this the dra jobs are failing see here:
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1983938951760580608/build-log.txt

I1030 17:02:03.285546   21869 clusterloader.go:245] --------------------------------------------------------------------------------
I1030 17:02:03.830950   21869 framework.go:276] Applying templates for "manifests/dra-example-driver/*.yaml"
I1030 17:02:34.316595   21869 shared_informer.go:349] "Waiting for caches to sync" controller="PodsIndexer"
I1030 17:02:34.417120   21869 shared_informer.go:356] "Caches are synced" controller="PodsIndexer"
E1030 17:04:34.555246   21869 wait_for_controlled_pods.go:627] WaitForControlledPodsRunning: error for test-m4h5sx-1/long-running-0: got context deadline exceeded while waiting for 1 pods to be running in namespace(test-m4h5sx-1), controlledBy(long-running-0) - summary of pods : Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 1 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 

It looks like the driver is not getting installed?

When I compare this to a successful run before the PR:

I1030 17:26:52.555317   17727 simple_test_executor.go:58] AutomanagedNamespacePrefix: test-wowpao
I1030 17:26:52.591242   17727 simple_test_executor.go:77] Setting up 1 dependencies
I1030 17:26:52.598594   17727 dra.go:59] DRATestDriver: Installing DRA example driver
I1030 17:26:52.601728   17727 framework.go:276] Applying templates for "manifests/*.yaml"
I1030 17:26:52.601742   17727 framework.go:287] Applying manifests/clusterrole.yaml
I1030 17:26:52.605891   17727 framework.go:287] Applying manifests/clusterrolebinding.yaml
I1030 17:26:52.609437   17727 framework.go:287] Applying manifests/deviceclass.yaml
I1030 17:26:52.613990   17727 framework.go:287] Applying manifests/kubeletplugin.yaml
I1030 17:26:52.620517   17727 framework.go:287] Applying manifests/resourceQuota.yaml
I1030 17:26:52.624298   17727 framework.go:287] Applying manifests/serviceaccount.yaml
I1030 17:26:52.627495   17727 framework.go:287] Applying manifests/validatingadmissionpolicy.yaml
I1030 17:26:52.638692   17727 framework.go:287] Applying manifests/validatingadmissionpolicybinding.yaml
I1030 17:26:52.642957   17727 dra.go:89] DRATestDriver: checking if DRA driver dra-example-driver-kubeletplugin is healthy

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload/1983945747292229632/build-log.txt

@emerbe
Copy link
Contributor Author

emerbe commented Oct 31, 2025

@emerbe after merging this the dra jobs are failing see here: https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1983938951760580608/build-log.txt

I1030 17:02:03.285546   21869 clusterloader.go:245] --------------------------------------------------------------------------------
I1030 17:02:03.830950   21869 framework.go:276] Applying templates for "manifests/dra-example-driver/*.yaml"
I1030 17:02:34.316595   21869 shared_informer.go:349] "Waiting for caches to sync" controller="PodsIndexer"
I1030 17:02:34.417120   21869 shared_informer.go:356] "Caches are synced" controller="PodsIndexer"
E1030 17:04:34.555246   21869 wait_for_controlled_pods.go:627] WaitForControlledPodsRunning: error for test-m4h5sx-1/long-running-0: got context deadline exceeded while waiting for 1 pods to be running in namespace(test-m4h5sx-1), controlledBy(long-running-0) - summary of pods : Pods: 1 out of 1 created, 0 running (0 updated), 0 pending scheduled, 1 not scheduled, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 

It looks like the driver is not getting installed?

When I compare this to a successful run before the PR:

I1030 17:26:52.555317   17727 simple_test_executor.go:58] AutomanagedNamespacePrefix: test-wowpao
I1030 17:26:52.591242   17727 simple_test_executor.go:77] Setting up 1 dependencies
I1030 17:26:52.598594   17727 dra.go:59] DRATestDriver: Installing DRA example driver
I1030 17:26:52.601728   17727 framework.go:276] Applying templates for "manifests/*.yaml"
I1030 17:26:52.601742   17727 framework.go:287] Applying manifests/clusterrole.yaml
I1030 17:26:52.605891   17727 framework.go:287] Applying manifests/clusterrolebinding.yaml
I1030 17:26:52.609437   17727 framework.go:287] Applying manifests/deviceclass.yaml
I1030 17:26:52.613990   17727 framework.go:287] Applying manifests/kubeletplugin.yaml
I1030 17:26:52.620517   17727 framework.go:287] Applying manifests/resourceQuota.yaml
I1030 17:26:52.624298   17727 framework.go:287] Applying manifests/serviceaccount.yaml
I1030 17:26:52.627495   17727 framework.go:287] Applying manifests/validatingadmissionpolicy.yaml
I1030 17:26:52.638692   17727 framework.go:287] Applying manifests/validatingadmissionpolicybinding.yaml
I1030 17:26:52.642957   17727 dra.go:89] DRATestDriver: checking if DRA driver dra-example-driver-kubeletplugin is healthy

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-gce-100-node-dra-extended-resources-with-workload/1983945747292229632/build-log.txt

@alaypatel07 I think that the culprit here is your fork being out of sync with master.
The test uses https://github.com/alaypatel07/perf-tests/tree/dra-extended-resources
Screenshot 2025-10-31 at 12 02 00

Branch dra-extended-resources in your fork has old manifest directory structure that does not contains subdirectory dra-example-driver
Screenshot 2025-10-31 at 12 02 35

@alaypatel07
Copy link
Contributor

@emerbe
Copy link
Contributor Author

emerbe commented Oct 31, 2025

@alaypatel07 Thanks for the hint.

I see that kube-scheduler says:
I1031 13:04:08.733031 12 schedule_one.go:1077] "Unable to schedule pod; no fit; waiting" pod="test-rdudq7-2/long-running-39-0-ww6jj" err="0/101 nodes are available: 1 node(s) had untolerated taint(s), 100 Insufficient example.com/gpu. no new claims to deallocate, preemption: 0/101 nodes are available: 101 Preemption is not helpful for scheduling."

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-gce-100-node-dra-with-workload-ipalias-using-cl2/1984241196414799872/artifacts/cluster-info/kube-system/kube-scheduler-control-plane-us-east1-b-155g/kube-scheduler.log

Will be investigating further

@emerbe
Copy link
Contributor Author

emerbe commented Oct 31, 2025

@alaypatel07 The problem is connected with CL2_ENABLE_EXTENDED_RESOURCES.

When CL2_ENABLE_EXTENDED_RESOURCES is set to default value which is false then tests are passing.
I've double checked that locally.

The cluster which I've created with CL2_ENABLE_EXTENDED_RESOURCES=true is below:

~/github/dra-example-driver main ?2 ❯ kubectl get resourceclaims --all-namespaces                                                                                                                                                         ○ kind-test-cluster mbucki@mbucki 17:07:52

No resources found

~/github/dra-example-driver main ?2 ❯ kubectl get resourceslices                                                                                                                                                                          ○ kind-test-cluster mbucki@mbucki 17:07:58

NAME                                         NODE                   DRIVER            POOL                   AGE

test-cluster-worker-gpu.example.com-vdc7p    test-cluster-worker    gpu.example.com   test-cluster-worker    22s

test-cluster-worker2-gpu.example.com-fvg44   test-cluster-worker2   gpu.example.com   test-cluster-worker2   22s

test-cluster-worker3-gpu.example.com-kl744   test-cluster-worker3   gpu.example.com   test-cluster-worker3   22s

test-cluster-worker4-gpu.example.com-v98nj   test-cluster-worker4   gpu.example.com   test-cluster-worker4   22s

test-cluster-worker5-gpu.example.com-fpdph   test-cluster-worker5   gpu.example.com   test-cluster-worker5   22s

~/github/dra-example-driver main ?2 ❯ kubectl get pods -n dra-example-driver                                                                                                                                                              ○ kind-test-cluster mbucki@mbucki 17:08:00

NAME                                     READY   STATUS    RESTARTS   AGE

dra-example-driver-kubeletplugin-bfw4l   1/1     Running   0          52s

dra-example-driver-kubeletplugin-m79z4   1/1     Running   0          52s

dra-example-driver-kubeletplugin-nlcfk   1/1     Running   0          52s

dra-example-driver-kubeletplugin-wdz8f   1/1     Running   0          52s

dra-example-driver-kubeletplugin-whcz4   1/1     Running   0          52s

but still pods can not be scheduled:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28s   default-scheduler  0/6 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 Insufficient example.com/gpu. no new claims to deallocate, preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

I guess there is incompatibility between usage of ExtendedResources and version 0.2.0 of DraExampleDriver which maybe expects ResourceClaimTemplates to be created(which is not when flag CL2_ENABLE_EXTENDED_RESOURCES=true)?

I'm not an expert in Drivers(including dra-example-driver) so will be grateful if you could take a look. Are you sure that using

          resources:
          {{ if .ExtendedResource }}
            limits:
              example.com/gpu: "1"

will work well with new driver version?

@alaypatel07
Copy link
Contributor

@alaypatel07 The problem is connected with CL2_ENABLE_EXTENDED_RESOURCES.

When CL2_ENABLE_EXTENDED_RESOURCES is set to default value which is false then tests are passing.
I've double checked that locally.

@emerbe good catch, I think the env var should not be enabled for the test, I will remove the following:

      - name: CL2_ENABLE_EXTENDED_RESOURCES
        value: "true"
``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants