Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Controller Revision (Implementation of KEP #238) #277

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Edwinhr716
Copy link
Contributor

@Edwinhr716 Edwinhr716 commented Dec 6, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it

This PR adds controller revision. This allows the controller to store previous iterations of the LWS object, which then makes it possible to select the correct pod template spec to use if a replica is restarted during rolling update.

Which issue(s) this PR fixes

Fixes #238
Fixes #239
Fixes #240
Fixes #281

Special notes for your reviewer

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Edwinhr716
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g December 6, 2024 19:11
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 6, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2024
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, I didn't look at tests.

api/leaderworkerset/v1/leaderworkerset_types.go Outdated Show resolved Hide resolved
api/leaderworkerset/v1/leaderworkerset_types.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Show resolved Hide resolved
log := ctrl.LoggerFrom(ctx).WithValues("leaderworkerset", klog.KObj(lws))
ctx = ctrl.LoggerInto(ctx, log)
controllerHistory := history.NewHistory(k8sClient, ctx)
selector, err := metav1.LabelSelectorAsSelector(&metav1.LabelSelector{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be selecting the revisions of this specific lws? An empty selector selects everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the lws name as a label to the revision, and only select the revisions of that specific lws

pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments on simplifying the history pkg

pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
test/testutils/util.go Outdated Show resolved Hide resolved
@@ -288,3 +293,24 @@ func MakeLeaderPodSpecWithTPUResource() corev1.PodSpec {
Subdomain: "default",
}
}

func RawLWSTemplate(lws *leaderworkerset.LeaderWorkerSet) runtime.RawExtension {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should be using a function like this one in our tests; our tests shouldn't rely on a copy the function that we partially try to test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it is not needed, but there are cases where we do need to duplicate functions as otherwise a dependency cycle is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the commented out code please.

@Edwinhr716
Copy link
Contributor Author

Addressed majority of comments, still need to add more tests and rebase to be ready to merge

@ahg-g
Copy link
Contributor

ahg-g commented Dec 23, 2024

Great, i will look later today, please ensure we have sufficient integration test coverage

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2024
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 23, 2024

@Edwinhr716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lws-test-e2e-main-1-28 0ff8f71 link true /test pull-lws-test-e2e-main-1-28

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}

func (r *LeaderWorkerSetReconciler) leaderWorkerSetUpdated(ctx context.Context, sts *appsv1.StatefulSet, lws *leaderworkerset.LeaderWorkerSet) (bool, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

return ctrl.Result{}, err
}

if err := r.createControllerRevisionIfNonExist(ctx, leaderSts, lws); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return the revision that the function below leaderWorkerSetUpdated and rollingUpdateParameters will use.

// Creates a Controller Revision if the leader statefulset exists but no revisions have been created yet. This happens when updating from a version that doesn't
// support controller revision
func (r *LeaderWorkerSetReconciler) createControllerRevisionIfNonExist(ctx context.Context, sts *appsv1.StatefulSet, lws *leaderworkerset.LeaderWorkerSet) error {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the empty line

return err
}

if !existingControllerRevisions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an existing controller revision doesn't mean we shouldn't create, what if there is no one with a matching hash label?

return revisions[0], nil
}

func ExistingControllerRevisions(ctx context.Context, k8sClient client.Client, lws *leaderworkerset.LeaderWorkerSet) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this function, the function above that looks up a revision of a specific hash should be the one we use.

return ctrl.Result{}, err
}

templateHash := getLeaderWorkerTemplateHash(leaderSts, lws, lwsUpdated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not create a template hash here, it should come from the revision returned as suggested in line 109 above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need a hashing function that treats NetworkConfig == nil and NetworkConfig.SubdomainShared as the same though?

When updating from a version that doesn't have networkConfig the steps would be:
Create a controller revision where the NetworkConfig si nil. When checking if LWS was updated, it will determine that it isn't because the hash is the same.

Another change is made, similar to #240 (comment) triggering the lws webhook, and defaulting NetworkConfig. A new controller revision will be created, as the data will now be different. Now, the hash won't match the hash that is in the leaderSts, triggering a rolling update when it isn't supposed to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not comparing hashes anymore though, the hash is just a key to the controller revision. We depend on semantic diff of the revisions.

if err := revisionutils.CreateLeaderWorkerSetRevision(ctx, r.Client, lws, templateHash); err != nil {
log.Error(err, "Creating LWS Revision")
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be creating here, the creation will be done at the beginning of the reconcile function via createControllerRevisionIfNonExist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this one + removing templateHash requires changes to the way we set the templateHash label in the revision. With these changes, we rely on leader sts to determine what templateHash to use. That is why createControllerRevisionIfNonExist checks if the leaderSts exists before attempting to create a new one. When a leaderSts doesn't exist however, we need to determine what templateHash to use. But the templateHash depends on a revision being created.

So in the case where the leaderSts doesn't exist yet, we would need to set the value of templateHash to the hashed controller revision, basically having two labels with the same value.

Or, we can change the controller.kubernetes.io/hash label to be our selector instead. In the case where we pass a templateHash, we set that label equal to the templateHash value.

@@ -387,6 +410,9 @@ func (r *LeaderWorkerSetReconciler) updateConditions(ctx context.Context, lws *l
conditions = append(conditions, makeCondition(leaderworkerset.LeaderWorkerSetUpgradeInProgress))
} else if updatedAndReadyCount == int(*lws.Spec.Replicas) {
conditions = append(conditions, makeCondition(leaderworkerset.LeaderWorkerSetAvailable))
if err := revisionutils.TruncateHistory(ctx, r.Client, lws, templateHash); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow up, we can store the current and updated revisions for debugging purposes. The current one can be set here, while the updated one can be set in the if block above.

@@ -118,8 +119,12 @@ func (r *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R
log.V(2).Info("defer the creation of the worker statefulset because leader pod is not ready.")
return ctrl.Result{}, nil
}

statefulSet, err := constructWorkerStatefulSetApplyConfiguration(pod, leaderWorkerSet)
currentRevision, err := revisionutils.GetLeaderWorkerSetRevisionFromTemplateHash(ctx, r.Client, &leaderWorkerSet, pod.Labels[leaderworkerset.TemplateRevisionHashKey])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I would not call it current because it could be current or the updated one, call it revision.

@@ -288,3 +293,24 @@ func MakeLeaderPodSpecWithTPUResource() corev1.PodSpec {
Subdomain: "default",
}
}

func RawLWSTemplate(lws *leaderworkerset.LeaderWorkerSet) runtime.RawExtension {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the commented out code please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
3 participants