- 
                Notifications
    You must be signed in to change notification settings 
- Fork 610
Update DRA testing to stable API version and prepare it to test more … #3641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DRA testing to stable API version and prepare it to test more … #3641
Conversation
| Hi @emerbe. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with  Once the patch is verified, the new status will be reflected by the  I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. | 
| /assign @mortent | 
71bf9f6    to
    785b4f7      
    Compare
  
    | /cc @alaypatel07 I can help with reviews here | 
| 
 Great @alaypatel07 ! I was planning to ask you for the review after we finish the initial round with Morten. Feel free to review it. | 
785b4f7    to
    dbe248b      
    Compare
  
    | Hello, @alaypatel07 could you please take a look? | 
| return true, nil | ||
| } | ||
|  | ||
| func getReadyNodesCount(config *dependency.Config) (int, error) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we need this? I think this check will be erroneous.
Just because a node is not ready does not necessarily imply the driver pod on that node is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it based on my tests.
Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.
so the GetClientSets().GetClient().ResourceV1().ResourceSlices() was not equal to workerCount and the test didn't start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous check often failed, especially on the big scale as resourceSlices have not been created for not ready nodes.
I understand the issue, but I am saying is this a reliable check for it?
When the Node was NotReady, was there a dra driver plugin pod running there?
Instead of Node Count, can we check if resourceslice count == driver plugin pod count ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both ways would work I believe.
Changed as you suggested, PTAL
        
          
                clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml
          
            Show resolved
            Hide resolved
        
      dbe248b    to
    49e3932      
    Compare
  
            
          
                clusterloader2/testing/dra/job.yaml
              
                Outdated
          
        
      | ttlSecondsAfterFinished: 300 | ||
| # In tests involving a large number of sequentially created, short-lived jobs, the spin-up time may be significant. | ||
| # A TTL of 1 hour should be sufficient to retain the jobs long enough for measurement checks. | ||
| ttlSecondsAfterFinished: 3600 # 1 hour | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which measurement depends on this config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The measurement that failed was: WaitForFinishedJobs with job-type = short-lived.
It failed when I've been running those tests in 5k nodes scale.
My understanding of the problem is:
- job.yaml has ttl set to 300 seconds so 300s after a job completes, Kubernetes will automatically delete it.
- There are 10 jobs created sequentially
- I've checked and it takes around 10 minutes to complete.
- As the first jobs complete, the ttlSecondsAfterFinishedtimer starts. Since this timer is shorter than the total time it takes for all jobs to be created and finished, the initial jobs are being deleted before the final jobs are even created.
I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state. The measurement will eventually time out and fail because it can't find the jobs it's looking for.
I've modified the test to increase ttlSecondsAfterFinished to 3600 seconds. It made test to pass.
I can also parametrize this and set it default to 300 but I didn't see a harm to increase it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that creates a scenario where the WaitForFinishedJobs measurement can never meet its condition of "all jobs finished" because some jobs are being deleted while others are still in a pending or running state.
I think we should remove this change here and take it to the PR that modifies the cl2/testing/* files.  I would be curious to see if WaitForFinishedJobs can account for deleted jobs that have completed. For all it cares is job that are present should be in Finished state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved to the second PR.
| /ok-to-test | 
        
          
                clusterloader2/pkg/dependency/dra/manifests/dra-example-driver/deviceclass.yaml
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added couple of comments, one about the assertion for resourceslicecount and second about ttl config change.
Other things look good to me, thanks @emerbe
5e6b282    to
    7673d03      
    Compare
  
    | /assign @mborsz PTAL when you get a chance | 
| /approve | 
| @emerbe can you please rebase this, once rebased, we can remove the hold and let this go through. | 
7673d03    to
    05f01bf      
    Compare
  
    | /hold cancel | 
| /lgtm | 
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alaypatel07, emerbe, mborsz, mortent The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
| @emerbe after merging this the dra jobs are failing see here: It looks like the driver is not getting installed? When I compare this to a successful run before the PR:  | 
| 
 @alaypatel07 I think that the culprit here is your fork being out of sync with master. Branch  | 
| @emerbe the tests with my branch are green: https://testgrid.k8s.io/sig-scalability-dra#gce-dra-extended-resources-with-workload-master-scalability-100. The failures are happening on https://testgrid.k8s.io/sig-scalability-dra#gce-dra-with-workload-master-scalability-100-canary which is configured with master branch. | 
| @alaypatel07 Thanks for the hint. I see that kube-scheduler says: Will be investigating further | 
| @alaypatel07 The problem is connected with  When  The cluster which I've created with  but still pods can not be scheduled: I guess there is incompatibility between usage of  I'm not an expert in Drivers(including  will work well with new driver version? | 
| 
 
 @emerbe good catch, I think the env var should not be enabled for the test, I will remove the following:  | 


…types of drivers
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR updates DRA testing to use V1 stable API since DRA is available in 1.34.
Also it modifies test logic a bit so it's parametrized to simplify running test with different drivers.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: