Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MHC created CR isn't deleted by the remediator #347

Merged
merged 4 commits into from
Mar 11, 2025

Conversation

mshitrit
Copy link
Member

@mshitrit mshitrit commented Sep 9, 2024

Why we need this PR

When a CR is created by MHC it's not deleted by the remediator (SNR) in case the remediator supports multiple templates.
The reason is that when MHC creates the CR it creates a generated name for the CR which does not match the machine name.
When the same apply for node base remediation the additional (node name and template name) info is stored in annotations on the CR, but for machine based remediation the remediator doesn't access those annotations.

Changes made

  • When creating an MHC remediation CR using the machine name and not generated name.
  • Multiple support annotation "NodeName" will now hold the correct node name (instead of the machine name) or empty value in case there is an issue getting the node name.

Which issue(s) this PR fixes

ECOPROJECT-2077

Test plan

Added a regression test

This is blocked by ECOPROJECT-2187
Context:

  • this PR fixes an issue (MHC didn't work for remediators which support multiple-same-kind remediation)
  • but it also uncovers another issue: MHC isn't watching remediation CRs, so it doesn't delete the node lease when the finalizer on a remediation CR is removed
  • because of the latter, e2e test on the PR can't get green (see 4.16 test), so this PR can't be merged without the watch issue being fixed

Copy link
Contributor

openshift-ci bot commented Sep 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved label Sep 9, 2024
@mshitrit mshitrit changed the title MHC created CR isn't deleted by the remediator [WIP] MHC created CR isn't deleted by the remediator Sep 9, 2024
@mshitrit
Copy link
Member Author

mshitrit commented Sep 9, 2024

/test ?

Copy link
Contributor

openshift-ci bot commented Sep 9, 2024

@mshitrit: The following commands are available to trigger required jobs:

  • /test 4.12-ci-bundle-my-bundle
  • /test 4.12-images
  • /test 4.12-openshift-e2e
  • /test 4.12-test
  • /test 4.13-ci-bundle-my-bundle
  • /test 4.13-images
  • /test 4.13-openshift-e2e
  • /test 4.13-test
  • /test 4.14-ci-bundle-my-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test
  • /test 4.15-ci-bundle-my-bundle
  • /test 4.15-images
  • /test 4.15-openshift-e2e
  • /test 4.15-test
  • /test 4.16-ci-bundle-my-bundle
  • /test 4.16-images
  • /test 4.16-openshift-e2e
  • /test 4.16-test
  • /test 4.17-ci-bundle-my-bundle
  • /test 4.17-images
  • /test 4.17-openshift-e2e
  • /test 4.17-test

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mshitrit
Copy link
Member Author

mshitrit commented Sep 9, 2024

/test 4.16-openshift-e2e

1. CR is created with regular (non generated) machine name when using a multiple support template
2. Multiple template Annotation placed on the CR contains the correct node/template name

Signed-off-by: Michael Shitrit <[email protected]>
This severs two purpuses:
- It will be set on NodeName annotation instead of previouslly set machine Name for MHC use case.
- Indication that for MHC remediation, in which case the normal (non generated) Machine name is set as the CR name

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit mshitrit force-pushed the wrong-cr-name-machine branch from 4d047c2 to e96b911 Compare September 10, 2024 11:10
@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

…t machine name when checking NodeName annotation.

Signed-off-by: Michael Shitrit <[email protected]>
@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

slintes
slintes previously approved these changes Sep 11, 2024
@openshift-ci openshift-ci bot added lgtm and removed lgtm labels Sep 11, 2024
@mshitrit
Copy link
Member Author

/test 4.16-openshift-e2e

@mshitrit
Copy link
Member Author

/test 4.17-openshift-e2e

1 similar comment
@mshitrit
Copy link
Member Author

mshitrit commented Mar 3, 2025

/test 4.17-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed and even noted in the description, this is blocked by https://issues.redhat.com//browse/ECOPROJECT-2187, running tests does not make any sense IMHO

@slintes slintes dismissed their stale review March 3, 2025 13:01

see above

@mshitrit
Copy link
Member Author

mshitrit commented Mar 3, 2025

as discussed and even noted in the description, this is blocked by https://issues.redhat.com//browse/ECOPROJECT-2187, running tests does not make any sense IMHO

You are probably right, but for some reason I had a vague memory that before the SNR issue we had green e2e for this PR.
I'm probably wrong, but wanted to scratch that off.

@slintes
Copy link
Member

slintes commented Mar 8, 2025

I'd like to understand why the e2e test is green actually. I would expect that it fails, because the MHC controller does not watch remediation CRs, so when the finalizer is removed by the remediator, no reconcile is triggered for lease deletion.

The test logs are suspicious IMHO, there is a 1,5 minute delay between CR deletion and lease deletion. The latter should happen much quicker:

MHC test, see time diff between lease duration step start, and exiting the test

  �[1mSTEP:�[0m waiting for healthy node condition �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/mhc_e2e_test.go:144 @ 03/03/25 13:45:36.705�[0m
  �[1mSTEP:�[0m waiting for remediation CR deletion, else cleanup fails �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/mhc_e2e_test.go:147 @ 03/03/25 13:46:36.013�[0m
  �[1mSTEP:�[0m ensuring lease removed �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/mhc_e2e_test.go:153 @ 03/03/25 13:46:36.062�[0m
  < Exit �[1m[It]�[0m remediates a host �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/mhc_e2e_test.go:121 @ 03/03/25 13:48:02.005 (3m38.504s)�[0m

NHC test for comparison, see time diff between lease duration step start, and CR deletion step start

  �[1mSTEP:�[0m waiting for healthy node �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/nhc_e2e_test.go:246 @ 03/03/25 13:31:19.06�[0m
  �[1mSTEP:�[0m ensuring lease is removed �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/nhc_e2e_test.go:249 @ 03/03/25 13:32:19.459�[0m
  �[1mSTEP:�[0m waiting for remediation CR deletion, else cleanup fails �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/nhc_e2e_test.go:257 @ 03/03/25 13:32:24.561�[0m
  < Exit �[1m[It]�[0m Remediates a host �[38;5;243m- /go/src/github.com/medik8s/node-healthcheck-operator/e2e/nhc_e2e_test.go:199 @ 03/03/25 13:32:24.61 (3m20.229s)�[0m

I think it's coincidence that the last e2e test run succeeded, something else must have triggered reconcile, and once again increasing test timeouts without a reason shadows an issue... Unfortunately pod logs gathering failed, so we can't check NHC logs. Let's try again.

/test 4.18-openshift-e2e

@slintes
Copy link
Member

slintes commented Mar 8, 2025

unfortunately MHC logs aren't helpful, they don't reveal what's triggering the reconcile. However, there still is too long delay between CR deletion, and lease deletion test step succeeding. I assume a Node or Machine update is triggering the MHC reconcile....

@mshitrit
Copy link
Member Author

mshitrit commented Mar 9, 2025

I assume a Node or Machine update is triggering the MHC reconcile....

This is a fair assumption, however I think this shouldn't be blocked for merge by ECOPROJECT-2187

  • IIUC This PR does no introduce any new issues
    • MHC not watching the remediation is valid issue indeed, but IIUC it applies regardless, whether multiple templates are used or a single template is used, so it isn't affected or created by this PR.
  • It does improves MHC dealing with multiple templates - it'll make sure the workload is rescheduled (even if the node compute isn't necessarily recovered).
  • Since this PR does not create new issues, does not effect the e2e tests (and improves some functionality) I think we can unblock it and fix the watch issue as part of ECOPROJECT-2187

@slintes
Copy link
Member

slintes commented Mar 10, 2025

This PR does no introduce any new issues

It does, it hides the issue by increasing the test timeout. Please revert that, then we can merge.

@mshitrit mshitrit force-pushed the wrong-cr-name-machine branch from e51cac5 to b45a5c5 Compare March 10, 2025 08:42
@openshift-ci openshift-ci bot added the lgtm label Mar 10, 2025
Copy link
Contributor

openshift-ci bot commented Mar 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mshitrit mshitrit marked this pull request as ready for review March 11, 2025 08:35
@openshift-ci openshift-ci bot requested review from razo7 and slintes March 11, 2025 08:35
@mshitrit mshitrit changed the title [WIP] MHC created CR isn't deleted by the remediator MHC created CR isn't deleted by the remediator Mar 11, 2025
Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

@mshitrit: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • 4.14-openshift-e2e

Only the following failed contexts/checkruns were expected:

  • ci/prow/4.14-ci-bundle-my-bundle
  • ci/prow/4.14-images
  • ci/prow/4.14-openshift-e2e
  • ci/prow/4.14-test
  • ci/prow/4.15-ci-bundle-my-bundle
  • ci/prow/4.15-images
  • ci/prow/4.15-openshift-e2e
  • ci/prow/4.15-test
  • ci/prow/4.16-ci-bundle-my-bundle
  • ci/prow/4.16-images
  • ci/prow/4.16-openshift-e2e
  • ci/prow/4.16-test
  • ci/prow/4.17-ci-bundle-my-bundle
  • ci/prow/4.17-images
  • ci/prow/4.17-openshift-e2e
  • ci/prow/4.17-test
  • ci/prow/4.18-ci-bundle-my-bundle
  • ci/prow/4.18-images
  • ci/prow/4.18-openshift-e2e
  • ci/prow/4.18-test
  • pull-ci-medik8s-node-healthcheck-operator-main-4.14-ci-bundle-my-bundle
  • pull-ci-medik8s-node-healthcheck-operator-main-4.14-images
  • pull-ci-medik8s-node-healthcheck-operator-main-4.14-openshift-e2e
  • pull-ci-medik8s-node-healthcheck-operator-main-4.14-test
  • pull-ci-medik8s-node-healthcheck-operator-main-4.15-ci-bundle-my-bundle
  • pull-ci-medik8s-node-healthcheck-operator-main-4.15-images
  • pull-ci-medik8s-node-healthcheck-operator-main-4.15-openshift-e2e
  • pull-ci-medik8s-node-healthcheck-operator-main-4.15-test
  • pull-ci-medik8s-node-healthcheck-operator-main-4.16-ci-bundle-my-bundle
  • pull-ci-medik8s-node-healthcheck-operator-main-4.16-images
  • pull-ci-medik8s-node-healthcheck-operator-main-4.16-openshift-e2e
  • pull-ci-medik8s-node-healthcheck-operator-main-4.16-test
  • pull-ci-medik8s-node-healthcheck-operator-main-4.17-ci-bundle-my-bundle
  • pull-ci-medik8s-node-healthcheck-operator-main-4.17-images
  • pull-ci-medik8s-node-healthcheck-operator-main-4.17-openshift-e2e
  • pull-ci-medik8s-node-healthcheck-operator-main-4.17-test
  • pull-ci-medik8s-node-healthcheck-operator-main-4.18-ci-bundle-my-bundle
  • pull-ci-medik8s-node-healthcheck-operator-main-4.18-images
  • pull-ci-medik8s-node-healthcheck-operator-main-4.18-openshift-e2e
  • pull-ci-medik8s-node-healthcheck-operator-main-4.18-test
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

Tests are failing due to ECOPROJECT-2187
/override 4.14-openshift-e2e
/override ci/prow/4.15-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mshitrit
Copy link
Member Author

Tests are failing due to ECOPROJECT-2187

/override ci/prow/4.14-openshift-e2e
/override ci/prow/4.15-openshift-e2e
/override ci/prow/4.16-openshift-e2e
/override ci/prow/4.17-openshift-e2e
/override ci/prow/4.18-openshift-e2e

Copy link
Contributor

openshift-ci bot commented Mar 11, 2025

@mshitrit: Overrode contexts on behalf of mshitrit: ci/prow/4.14-openshift-e2e, ci/prow/4.15-openshift-e2e, ci/prow/4.16-openshift-e2e, ci/prow/4.17-openshift-e2e, ci/prow/4.18-openshift-e2e

In response to this:

Tests are failing due to ECOPROJECT-2187

/override ci/prow/4.14-openshift-e2e
/override ci/prow/4.15-openshift-e2e
/override ci/prow/4.16-openshift-e2e
/override ci/prow/4.17-openshift-e2e
/override ci/prow/4.18-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 4bfd53b into medik8s:main Mar 11, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants