Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GracefulShutdown upgrade may get stuck in between versions #2230

Open
Crumby opened this issue Mar 14, 2025 · 2 comments
Open

GracefulShutdown upgrade may get stuck in between versions #2230

Crumby opened this issue Mar 14, 2025 · 2 comments

Comments

@Crumby
Copy link
Collaborator

Crumby commented Mar 14, 2025

There's a slight chance for 'TestOperandUpgrades' to fail in between individual version upgrades due to new StatefulSet version not being created.

Operator log:
infinispan-operator-controller-manager-5776c64f76-l889k.log

Test log:

2025/03/14 19:22:43 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/caches/volatileCache?action=size
2025/03/14 19:23:06 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/container/health/status
2025/03/14 19:23:07 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/caches/persistentCache?action=size
2025/03/14 19:23:07 INFO Next Operand 15.0.3
2025/03/14 19:23:09 INFO Upgrading Operand to 15.0.3
2025/03/14 19:23:38 INFO Upgrade complete
2025/03/14 19:23:42 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/container/health/status
2025/03/14 19:23:42 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/caches/persistentCache?action=size
2025/03/14 19:23:55 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/caches/volatileCache?action=size
2025/03/14 19:24:14 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/container/health/status
2025/03/14 19:24:14 INFO GET: https://test-operand-upgrades-external-datagrid.apps.paris.rhdg-qe.rhocf-dev.net:443/rest/v2/caches/persistentCache?action=size
2025/03/14 19:24:14 INFO Next Operand 15.0.4
2025/03/14 19:24:16 INFO Upgrading Operand to 15.0.4
2025/03/14 19:34:17 ERROR timed out waiting for the condition
2025/03/14 19:34:18 ERROR Encountered error: previous terminated container "infinispan" in pod "14-0-13" not found
2025/03/14 19:34:19 ERROR Encountered error: previous terminated container "infinispan" in pod "14-0-17" not found
...
2025/03/14 19:34:22 ERROR Encountered error: unable to write metrics to file for pod '15-0-3:'datagrid': pod not running. status.phase='Succeeded'
2025/03/14 19:34:22 ERROR Encountered error: unable to write metrics to file for pod 'upgrade-backup:'datagrid': pod not running. status.phase='Succeeded'
goroutine 303 [running]:
runtime/debug.Stack()
	/usr/lib/golang/src/runtime/debug/stack.go:24 +0x5e
github.com/infinispan/infinispan-operator/test/e2e/utils.TestKubernetes.CleanNamespaceAndLogWithPanic({0xc000d8ecc0}, 0xc0000d2820, {0xc00004c012, 0x8}, {0x15ca3a0, 0xc00058eee0})
	/home/pdrobek/git/infinispan-operator/test/e2e/utils/kubernetes.go:226 +0x1825
github.com/infinispan/infinispan-operator/test/e2e/utils.TestKubernetes.CleanupOLMTest({0x43fce0?}, 0xc0000d2820, {0xc000c22618, 0x13}, {0x186a191, 0x13}, {0xc00004a197, 0x13}, {0xc00004c035, 0xa})
	/home/pdrobek/git/infinispan-operator/test/e2e/utils/olm.go:178 +0x178
panic({0x15ca3a0?, 0xc00058eee0?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/infinispan/infinispan-operator/test/e2e/utils.ExpectNoError(...)
	/home/pdrobek/git/infinispan-operator/test/e2e/utils/asserts.go:21
github.com/infinispan/infinispan-operator/test/e2e/utils.TestKubernetes.WaitForInfinispanStateWithTimeout({0xc000123520?}, {0xc000b9bb78, 0x15}, {0xc000bfbb48, 0x8}, 0x8bb2c97000, 0xc0006c3e30)
	/home/pdrobek/git/infinispan-operator/test/e2e/utils/kubernetes.go:697 +0x116
github.com/infinispan/infinispan-operator/test/e2e/utils.TestKubernetes.WaitForInfinispanState(...)
	/home/pdrobek/git/infinispan-operator/test/e2e/utils/kubernetes.go:682
github.com/infinispan/infinispan-operator/test/e2e/upgrade.TestOperandUpgrades(0xc0000d2820)
	/home/pdrobek/git/infinispan-operator/test/e2e/upgrade/upgrade_operands_test.go:88 +0xcb1
testing.tRunner(0xc0000d2820, 0x194a750)
	/usr/lib/golang/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 1
	/usr/lib/golang/src/testing/testing.go:1742 +0x390

    kubernetes.go:227: timed out waiting for the condition 
        
2025/03/14 19:34:29 ERROR Encountered error: previous terminated container "manager" in pod "infinispan-operator-controller-manager-5776c64f76-l889k" not found
2025/03/14 19:34:30 ERROR Encountered error: previous terminated container "olm-operator" in pod "olm-operator-5569b74676-mdggw" not found
2025/03/14 19:34:30 ERROR Encountered error: previous terminated container "catalog-operator" in pod "catalog-operator-bbcfb64d6-lnsdn" not found
--- FAIL: TestOperandUpgrades (1691.07s)
@Crumby
Copy link
Collaborator Author

Crumby commented Mar 20, 2025

@ryanemerson here is complete content outputted by the failed test: TestOperandUpgrades.zip. This one is quite large as it failed quite lately but the failure can pretty much happen between any two version. What seems to be common is that it seemingly stops reconciling when the reconciler gets to the test-operand-upgrades-cert-secret readiness check. No idea if that on it's own is an issue or the step after.

2025-03-20T08:59:38.354Z	INFO	controllers.Infinispan	Done	{"infinispan": "namespace-for-testing/test-operand-upgrades", "requeue": true, "requeueAfter": "0s", "error": null}
2025-03-20T08:59:38.354Z	DEBUG	events	Warning	{"object": {"kind":"Infinispan","namespace":"namespace-for-testing","name":"test-operand-upgrades","uid":"1ba90c20-0af7-4f26-9757-d6dc5d78891e","apiVersion":"infinispan.org/v1","resourceVersion":"45557"}, "reason": "ResourceNotReady", "message": "Secret resource 'test-operand-upgrades-cert-secret' not ready"}

Crumby pushed a commit to Crumby/infinispan-operator that referenced this issue Mar 20, 2025
Crumby pushed a commit to Crumby/infinispan-operator that referenced this issue Mar 26, 2025
@Crumby
Copy link
Collaborator Author

Crumby commented Mar 26, 2025

I believe I finally managed to fully understand what's happening and why does it seems that the reconciler hits the max wait limit out of nowhere.

First part that needs to be understood is how k8s controller runtime prioritizes reconcilation requests and how it handles the exponential backoff:

  • Each Results which requeues the request without explicit delay (delay == 0), or returns an error, doubles the back off up to a max limit
  • RequeueAfter with delay > 0 should reset the limit (not really that important for this issue)
  • Requeued request exist in "future queue" and will be added to "execution queue" after the delay (or back off)
  • Reconciliation events are always immediately added to "execution queue" and executed immediately once the current reconciliation is finished
  • Each request in each queue exists only once. If the request for a particular resource is already in the queue it gets dropped

Second part that needs to be understood is Infinispan controller configuration and events it listens for, mainly:

  • The reconciler is triggered by all changes to Infinispan resources (including status changes)
  • The reconciler is triggered by all changes to the owned StatefulSet

So what happens during reconciliation is that pipeline is updating Infinispan status and StatefulSet which then trigger new events triggering reconciliation executed immediately and at the same it reques the request which is slowly increasing the back off due to requeue with 0s delay.

Due to reconciler explicitly reacting to all the events, the quickly increasing backoff isn't visible. Even is often reset by explicit delay. So what's happening is that all the reconciliation events triggered by the changes to Infinispan and StatefulSet that are immediately executed also double the backoff and once requed request gets requed again it hits the max time limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant