(fr-6) Use dedicated startup probe with higher failure threshold#604
(fr-6) Use dedicated startup probe with higher failure threshold#604Deydra71 wants to merge 1 commit into
Conversation
The startup probe shared the same configuration as liveness/readiness probes, giving Horizon only ~40s to start. In resource-constrained environments startup exceeds this window causing CrashLoopBackOff. Introduce formatStartupProbe() with FailureThreshold: 12 (allowing ~120s for startup), matching the pattern used by Cinder, Glance, and Manila operators. Signed-off-by: Veronika Fisarova <vfisarov@redhat.com> Assisted-by: Claude Opus 4.6
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Deydra71 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Build failed (check pipeline). Post ❌ openstack-k8s-operators-content-provider FAILURE in 9m 45s |
|
recheck |
| func formatStartupProbe() *corev1.Probe { | ||
|
|
||
| return &corev1.Probe{ | ||
| TimeoutSeconds: 5, |
There was a problem hiding this comment.
@Deydra71 o/
We recently introduced a new module [1] to provide an interface for probes.
Using cinder here as just a simple example [2] that shows how to use the interface, you can basically create a Probeset struct with something like:
apiProbes, err := probes.CreateProbeSet(
int32(cinder.CinderPublicPort),
&scheme,
instance.Spec.Override.Probes,
cinder.GetDefaultProbesAPI(timeout),
)or, for more advanced usage, like mariadb does [3], it is possible to also pass a command and the handler type.
In general I think we could improve this code in main and add the probe interface as well, so we can take advantage of the override in case we need to tune this statefulset in production.
Let me know if that aligns with the goal of this patch, otherwise we can create a dedicated follow up that enhances the interface in main and then we backport to fr6 as well.
[1] github.com/openstack-k8s-operators/lib-common/modules/common/probes
[2] https://github.com/openstack-k8s-operators/cinder-operator/blob/main/internal/cinderapi/statefuleset.go#L56C2-L62C3
[3] https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/internal/mariadb/statefulset.go#L138
There was a problem hiding this comment.
I see, I overlooked the lib-common module :/
Looking at the examples I think it's pretty straightforward to update it in main, and then create backport only from the new one (and close this one).
I can work on it this week. @fmount Is there any Jira ttracking the implementation of probes.OverrideSpec across controlplane? So far I could find only support in storage operators and mariadb
There was a problem hiding this comment.
but keep in mind that this will be a CRD change. we can not just backport it and expect it to show up without an openstack-operator change. we need to plan for when it has to be released. otherwise we have to do an short term update and a longer term transition to this
There was a problem hiding this comment.
Hi,
Is there any Jira ttracking the implementation of probes.OverrideSpec across controlplane? So far I could find only support in storage operators and mariadb
Because the interface has grown over the time, the idea is to add overrides only where we need, and I assume we can create stories under https://redhat.atlassian.net/browse/OSPRH-2490 to track the work.
I agree that we need to coordinate to make sure it will be available on a specific maintenance release and bump the openstack-operator (fr6) to get the CRD change as well.
For cinder we have an associated bug that will go out soon, not sure we want to take a similar approach, or we just close 2490 w/ FR6 and work on dedicated items (e.g. a new horizon bug that creates the bug-epic and the associated stream).
The startup probe shared the same configuration as liveness/readiness probes, giving Horizon only ~40s to start. In resource-constrained environments startup exceeds this window causing CrashLoopBackOff.
Introduce
formatStartupProbe()with FailureThreshold: 12 (allowing ~120s for startup), matching the pattern used by Cinder, Glance, and Manila operators.Assisted-by: Claude Opus 4.6
Note: This is a repeated issue seen in SKMO CI runs that's deploying horizon in main region.