Skip to content

fix(controller): clean stale status entries during node deletion reco… #258

Open
SeeyaVhora wants to merge 1 commit into
kubernetes-sigs:mainfrom
SeeyaVhora:fix/stale-status-reconciliation
Open

fix(controller): clean stale status entries during node deletion reco… #258
SeeyaVhora wants to merge 1 commit into
kubernetes-sigs:mainfrom
SeeyaVhora:fix/stale-status-reconciliation

Conversation

@SeeyaVhora
Copy link
Copy Markdown

Summary

This PR fixes a stale status lifecycle issue during node deletion reconciliation in the NodeReadinessRule controller.

Previously, deleted nodes could remain temporarily or indefinitely persisted inside NodeEvaluations and FailedNodes status fields due to reconciliation ordering and cleanup semantics.

The controller would:

  1. first patch stale status entries to the API,
    1. then execute a secondary cleanup patch to remove them.
      This created:
  • redundant status patching,
    • unnecessary etcd/API churn,
    • temporary stale-status persistence windows,
    • and indefinite persistence of deleted nodes in FailedNodes.

Root Cause

cleanupDeletedNodes() executed after updateRuleStatus() and independently patched the API using its own retry loop.

As a result:

  • stale deleted-node entries were written during the first status update,
    • cleanup correctness depended on a second reconciliation patch,
    • and FailedNodes entries for deleted nodes were never cleaned consistently.

Changes

Reconciliation Ordering Fix

Reordered reconciliation flow so cleanupDeletedNodes() executes before updateRuleStatus().

This ensures the single authoritative status patch never contains stale deleted-node entries.

In-Memory Status Cleanup

Refactored cleanupDeletedNodes() to mutate the in-memory rule.Status state directly instead of performing an independent GET/PATCH retry loop.

Persistence is now fully owned by updateRuleStatus().

Lifecycle Consistency

Extended deleted-node cleanup semantics to FailedNodes in addition to NodeEvaluations, ensuring both status fields remain consistent during node churn and autoscaling scenarios.

Regression Coverage

Added deterministic regression coverage validating that deleted nodes are absent from the synchronous post-reconcile status state.


Before vs After

Behavior Before After
Status patch flow Stale entries patched first, cleanup patched later Cleaned state patched once
cleanupDeletedNodes() behavior Independent GET + PATCH retry loop In-memory cleanup only
Status ownership Multiple status patch paths Single authoritative status writer
FailedNodes cleanup Deleted nodes could persist indefinitely Deleted nodes cleaned consistently
API patching during cleanup Multiple status patches Redundant cleanup patch removed
Reconciliation consistency Temporary stale-status persistence possible Stale deleted-node entries never persisted
Complexity O(N) O(N)
Controller architecture Correct but patch-heavy Preserved with cleaner lifecycle semantics

Impact

  • Eliminates redundant cleanup status patching during node deletion reconciliation while ensuring stale deleted-node entries are never persisted to status.
    • Preserves reconciliation idempotency and controller ownership boundaries.
    • Maintains existing time/space complexity characteristics.
    • Improves lifecycle consistency for high-churn autoscaling environments.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 16, 2026

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit f0a4cf2
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/6a08ea2f99e8b0000808e8ac

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SeeyaVhora
Once this PR has been reviewed and has the lgtm label, please assign tallclair for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 16, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: SeeyaVhora / name: Seeya Vhora (f0a4cf2)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @SeeyaVhora!

It looks like this is your first PR to kubernetes-sigs/node-readiness-controller 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/node-readiness-controller has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 16, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @SeeyaVhora. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 16, 2026
@ajaysundark
Copy link
Copy Markdown
Contributor

/cc @ajaysundark

/ok-to-test

@k8s-ci-robot k8s-ci-robot requested a review from ajaysundark May 17, 2026 01:03
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 17, 2026
@ajaysundark
Copy link
Copy Markdown
Contributor

@SeeyaVhora Can you start by creating an issue describing in detail how to reproduce the symptoms you aim to fix here?

@SeeyaVhora
Copy link
Copy Markdown
Author

@SeeyaVhora Can you start by creating an issue describing in detail how to reproduce the symptoms you aim to fix here?

Thank you for your response @ajaysundark
Created issue #261 with detailed reproduction steps, observed behavior, and root-cause analysis for the stale deleted-node status lifecycle issue.

Please let me know if any additional details or adjustments would be helpful.

@SeeyaVhora SeeyaVhora changed the title fix(controller): clean stale status entries during node deletion reco… fix(controller): clean stale status entries during node deletion reco… Fixes #261 May 17, 2026
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 17, 2026
@SeeyaVhora SeeyaVhora changed the title fix(controller): clean stale status entries during node deletion reco… Fixes #261 fix(controller): clean stale status entries during node deletion reco… May 17, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 17, 2026
@SeeyaVhora
Copy link
Copy Markdown
Author

hello @ajaysundark @mrunalp @SergeyKanzhelev

Please look into this PR and review it.
Let me know if any changes to make. details in #261 issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants