Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -630,7 +630,7 @@ Pods that are processed by Permit or PreBind plugins get NominatedNodeName durin
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.
The feature can be disabled in Alpha version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.
The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.

###### What happens if we reenable the feature if it was previously rolled back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### Are there any tests for feature enablement/disablement?

This feature is only changing when a NominatedNodeName field will be set - it doesn't introduce a new API.

Is that correct? NominatedNodeName sounds like a new field in the Pod API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

We will do the following manual test after implementing the feature:

What was the result of the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NominatedNodeName field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NominatedNodeName field wasn't added by this KEP (it was added long time ago). KEP's purpose is to extend usage of this field.

The idea behind enablement/disablement tests is that depending on the FG the functionality is or is not working. So the question is more around ensuring that when you turn off the appropriate FG the functionality doesn't set NNN, or in case of kube-apiserver doesn't clear it, and vice versa when it's on. Especially at beta stage, where we need to ensure that users can safely turn off this on-by-default (beta) functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

We will do the following manual test after implementing the feature:

As @stlaz pointed out, this description is required for beta promotion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below in ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement? I assume the functionality has been already tested, can you update that section with findings for the throughput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh
Addressing only the comments about upgrade->downgrade->upgrade testing (I will address the rest later today):

The idea behind enablement/disablement tests is that depending on the FG the functionality is or is not working. So the question is more around ensuring that when you turn off the appropriate FG the functionality doesn't set NNN, or in case of kube-apiserver doesn't clear it, and vice versa when it's on. Especially at beta stage, where we need to ensure that users can safely turn off this on-by-default (beta) functionality.

Thank you for bringing my attention to this point, I missed this earlier when editing the KEP.
And now I'm not really sure what this test should look like.

So the trouble is, this is not a typical promotion from alpha to beta.

The scope of this KEP in alpha allowed NNN to be set by components other than kube-scheduler and established the semantics for this field. The designed behavior was that the NNN field would be cleared in some situations, but not after a failed scheduling attempt.

But after the v1.34 release there was a change of plans in sig-scheduling, and with the idea of Gang Scheduling coming up (and with that - ideas for new approaches to resource reservation), it seems that NNN might not be the mechanism we want to invest in right now, as a means for other components to suggest pod placement to kube-scheduler.

At the same time using NNN as "set in kube-scheduler, read-only in CA" seems like a good and worthwhile approach to solve the buggy scenario "If pod P is scheduled to bind on node N, but binding P takes a long time, and N is otherwise empty, CA might turn down N before P gets bound".
So the decision was to narrow down the scope of this KEP significantly, and get it to beta.

Please note that before the alpha KEP the scheduler's code would clear NNN after a failed scheduling attempt. So what this hoping-to-be-beta KEP does vs pre-alpha is:

  • introduces setting NNN in PreBinding phase, i.e. when scheduler expects that the entire prebinding + binding process may take a significant amount of time (gated by nominatedNodeNameForExpectationEnabled)
  • makes kube-apiserver clear NNN when the pod gets bound (gated by ClearingNominatedNodeNameAfterBinding)

And what this beta-KEP does vs alpha-KEP is:

  • reverts the logic that does not clear NNN upon failed scheduling attempt (the logic was gated by nominatedNodeNameForExpectationEnabled)

With all that, in beta-KEP the NNN should be set when a pod is either waiting for preemption to complete (which had been the case before alpha-KEP), or during prebinding/binding phases. And it should be cleared after binding in api-server.

Can you please help me with the following questions?

  1. Since the implementation has yet to change, is it ok if I run the test after implementing the new version?

  2. IIUC the upgrade->downgrade->upgrade test scenario should be as follows, can you verify that?

  3. upgrade

  4. request scheduling of a pod that will need a long preBinding phase

  5. check that NNN gets set for that pod

  6. before binding completes, restart the scheduler with nominatedNodeNameForExpectationEnabled = false

  7. check that the pod gets scheduled and bound successfully to the same node

  8. and upgrade again?
    Or:
    6a. request scheduling another pod with expected long preBind
    7a. check that NNN does not get set in PreBind
    8a. restart the scheduler with nominatedNodeNameForExpectationEnabled = true
    9a. check that the pod gets scheduled and bound somewhere

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh

Below in ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement? I assume the functionality has been already tested, can you update that section with findings for the throughput?

As it turns out, there are no tests running with this feature enabled (probably because the original plan was to launch in beta in 1.34, and the FG would be on by default, and all tests would run it).
I can run perf tests now and update the numbers here, but since the implementation is going to change after KEP update, perhaps I can run the test on the actual implementation, when it's ready? WTYD?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Since the implementation has yet to change, is it ok if I run the test after implementing the new version?

Yes, you'll want to update the doc afterwards.

2. IIUC the upgrade->downgrade->upgrade test scenario should be as follows, can you verify that?

3. upgrade

4. request scheduling of a pod that will need a long preBinding phase

5. check that NNN gets set for that pod

6. before binding completes, restart the scheduler with nominatedNodeNameForExpectationEnabled = false

7. check that the pod gets scheduled and bound successfully to the same node

8. and upgrade again?
   Or:
   6a. request scheduling another pod with expected long preBind
   7a. check that NNN does not get set in PreBind
   8a. restart the scheduler with nominatedNodeNameForExpectationEnabled = true
   9a. check that the pod gets scheduled and bound somewhere

Both options are fine by me. But it seems the versions with (a) will be easier to perform.

As it turns out, there are no tests running with this feature enabled (probably because the original plan was to launch in beta in 1.34, and the FG would be on by default, and all tests would run it).
I can run perf tests now and update the numbers here, but since the implementation is going to change after KEP update, perhaps I can run the test on the actual implementation, when it's ready? WTYD?

Yes, it can be updated in followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will make sure to update the doc with all the results

Expand Down Expand Up @@ -752,8 +752,8 @@ No.
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Yes - but it should be negligible impact.
The memory usage in kube-scheduler is supposed to increase by external components starting to use this
because when `NominatedNodeName` is added on the pods, the scheduler's internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary.
The memory usage in kube-scheduler is supposed to increase because when `NominatedNodeName` is added on the pods, the scheduler's
internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ approvers:

stage: alpha

latest-milestone: "v1.34"
latest-milestone: "v1.35"

milestone:
alpha: "v1.34"
Expand Down