-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pebble does not try to restart service if exited too quickly for the first time #240
Comments
There are a bunch of controls you have over the restarting of a service: services:
# (Optional) Defines what happens when the service exits with a nonzero
# exit code. Possible values are: "restart" (default) which restarts
# the service after the backoff delay, "shutdown" which shuts down and
# exits the Pebble server, and "ignore" which does nothing further.
on-failure: restart | shutdown | ignore
# (Optional) Defines what happens when each of the named health checks
# fail. Possible values are: "restart" (default) which restarts
# the service once, "shutdown" which shuts down and exits the Pebble
# server, and "ignore" which does nothing further.
on-check-failure:
<check name>: restart | shutdown | ignore
# (Optional) Initial backoff delay for the "restart" exit action.
# Default is half a second ("500ms").
backoff-delay: <duration>
# (Optional) After each backoff, the backoff delay is multiplied by
# this factor to get the next backoff delay. Must be greater than or
# equal to one. Default is 2.0.
backoff-factor: <factor>
# (Optional) Limit for the backoff delay: when multiplying by
# backoff-factor to get the next backoff delay, if the result is
# greater than this value, it is capped to this value. Default is
# half a minute ("30s").
backoff-limit: <duration>
# (Optional) The amount of time afforded to this service to handle
# SIGTERM and exit gracefully before SIGKILL terminates it forcefully.
# Default is 5 seconds ("5s").
kill-delay: <duration> Have you tried any of those? :) |
I believe this is related to the fact that Pebble does not support one-shot services (and maybe also a bug as IMO, regardless of the service's nature, Pebble should catch its errors). Something similar was initially reported here and implemented via Although this could be worth a discussion, I think the most immediate workaround for that issue is to force a |
@jnsgruk, I did try the available options:
|
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Yeah @jnsgruk as @phoevos mentioned, those controls are applied only to things pebble things have previously started. See this section of pebble docs, specifically:
iirc, a service that exits with an error like this goes to a different state than something that has crashed and needs restarting. Pebble does not try to restart from this state, even if there are health checks. From reading the docs and discussion in this thread, it sounds like this is the intended behaviour rather than a bug. But to many this intended behaviour is not expected and maybe worth revisiting |
imo this is different from one-shot commands. one-shot commands are intended to be single execution. The problem here is that a long-running service may legitimately error out quickly and users will expect pebble to restart it. This is typical in kubernetes controllers. Commonly you create a controller and some configmaps that the controller needs at the same time. If the controller wins the race and starts before the configmaps, it errors out and k8s restarts it again soon. |
Yes, @ca-scribner is correct: this is not related to exec (one-shot commands), as @beliaev-maksim is describing a long-running service, just one that exits quickly if it can't connect to a dependency. This is actually happening by design: when you start a service manually, with Similar with except ErrorWithStatus as err:
self.model.unit.status = err.status
self.logger.error(f"Failed to handle {event} with error: {err}")
return This means Juju won't retry the hook, because you've exited the Python code successfully. I think the simplest thing to do would be to change the There are two other options I can think of:
I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free. You could either do the wait in your service's code if you have control over that, or add a simple wrapper script that waits. For example, I just tested it with this, and the service goes into the auto-restart / backoff loop as we want here:
It's possible we should revisit this design. But the current idea is that either the start (or replan) fails and the error is reported that way, or if the service exits later Pebble's auto-restart behaviour kicks in. The problem with your charm's current approach is that it's catching the exception and masking the error. |
Hi @benhoyt,
I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed. Also, do not forget that user theoretically may set model setting to not retry failed apps.
yes, I also support this idea more. However, I cannot say you get it really for free. Will not we get an overhead of 1s on each service start then ?
I do not think this is the root problem. Again, see note about charm unhandled exceptions above. If we want to make the code fully backwards compatible we can add new pebble option that will ignore all the errors and will always try to restart according to the settings. |
I'm still digesting the comments about how to make it work in a charm, so I'll come back to those. Regarding the design choice on handling <1s errors different from >1s errors: was there a constraint that drove the design that way? The only thing I can think of would be avoiding thrashing a process that'll never succeed be restarting it over and over. Was that the issue, or was there something else? What caught me off guard here I think is that I don't know of another process manager that uses a similar pattern. For example with kubernetes, a Pod may die instantly (image cannot pull, process starts and dies, etc) and it is handled the same as one that dies after hours. This method could lead to thrashing the restarts, but they avoid that with an exponential backoff. Doesn't mean restarting after a quick death is the best thing to do, but to me it feels like the intuitive thing to do unless something really gets in the way. |
@beliaev-maksim Yeah, that's fair push-back. Though I did say "simplest" thing, not "best". :-) But you're right, and charms shouldn't go into error state when they know what's wrong and they're really just waiting on something. I guess my point is that you should either handle the class MyCharm(ops.CharmBase):
def _on_config_changed(self, event):
if not self.ensure_started() # or perhaps an @ensure_started decorator?
return
...
def _on_other_hook(self, event):
if not self.ensure_started()
return
...
def ensure_started(self) -> bool:
"""Ensure service is started and return True, or return False if it can't be started."""
if self.container.get_service("mysvc").is_running():
return True
try:
self.container.replan()
return True
except pebble.ChangeError:
self.unit.status = ops.WaitingStatus("waiting for dependency to start")
return False The drawback is you don't know when the next event is going to come in (and right now we have no way of telling Juju, "please wake me up in 10 seconds, will ya", though we might have something equivalent when we get workload events / Pebble Notices). But I think the other approach of the wrapper script with a
Well, kind of, but only when the service can't start and you need to wait for the dependency anyway. When the service starts properly, it's not an issue because it won't get to the sleep. So that's my recommendation for now, with the current system. As you point out, this might not be the best design, and we could definitely consider a different approach where the service start always succeeds (or maybe there's a config option, to make it backwards-compatible). I think it's a little tricky, and requires adding config, so it probably needs a small spec. It's not on my roadmap so realistically I probably won't get to it soon, but if one of you wants to start with a small OPnnn spec, we could discuss further from there? Still, that's not going to be discussed and implemented overnight, so you would likely want to use the |
@benhoyt yes, for now in 2.9 we go with sleep 1 In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook. Yes, we can start with spec proposal, thanks! |
To me that sounds much more hacky (and an abuse of Juju Secrets) than a simple
Sounds good to me! |
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1 second before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Sleep for 1.1 seconds before attempting to start the KFP API server, in order to avoid issues with Pebble when the corresponding service fails fast. This is a temporary measure until a fix for canonical/pebble#240 is provided. Refs #220 Signed-off-by: Phoevos Kalemkeris <[email protected]>
Following up on this since its been a while... The Kubeflow team would appreciate it if this was delivered in 24.10, ideally implementing the same restart procedure whether something dies in <1s or >1s. We're happy to contribute to a spec or guide however you want to elaborate the problem. For some justification of why this would be helpful, the Kubeflow team has been burned by this a few times recently. Not that we can't work around it, but that we usually spend hours debugging the wrong thing before we realize this is the root cause. The trouble is that when someone creates a pebble service with checks and restart settings, it is not intuitive that these only apply some of the time (an example of this is @jnsgruk's interpretation here - probably everyone reading those settings expects restarts will occur even on quick failures). I doubt Kubeflow is the only team facing this, just that others might not realize what is happening and don't know this is the cause |
We haven't forgotten about this. We were going to go ahead and fix it, because it seems like a simple thing to say "Pebble should always auto-restart services", but -- as is often the case -- it's not quite that simple. Pebble, by design, has the 1-second "okay delay" we all know and love (or hate). This is so that you'll get an error right away -- from the The problem this feature creates is when the service is configured correctly, but it fails in that first 1s because a dependency isn't yet ready -- as in Maksim's original example -- then the caller is forced to deal with it when they just want to leave it in Pebble's hands. For comparison, systemd doesn't have the "okay delay" feature. Even in cases where a service's config or command line is invalid, Right now, you can use the workaround I suggested earlier: configure your service with Arguably the charm should wait till the dependency is ready: if the relation to charm 2 is not ready yet, or the charm 2 workload is not ready yet, we shouldn't start/replan the Pebble service in charm 1. So you could wait for
I think that would probably be the appropriate fix here. However, I can see that this complicates the charm a little. So maybe Pebble should have a "no okay delay" or "leave it with me from the beginning" option, either as a service configuration field or an argument to start/replan:
The downside of course is that the charm author won't get notified if they have a bug in the service config and the service will never start successfully. (Integration tests should catch this, of course.) Either way, the code change should be fairly straightforward. I hope to discuss this with Gustavo and others at the sprints, to see if we can agree on which direction to go. |
The Jira ticket is marked done, does this mean that this issue is now resolved? |
@dimaqq No, this is not resolved (we've have an idea, but have bumped it till the next cycle). Maybe the Jira ticket represented this issue? In any case, this needs to stay open. |
Discussion with Gustavo/Harry/Tiexin/Tony:
Note: we definitely don't want We'll create a small spec and finalise this early in the cycle. |
By default, rockcraft sets Pebble as the default entrypoint. However, if the defined service finishes faster than 1 second (either successfully or not), Pebble will ignore the on-success / on-failure settings and still continue running. [1] This can be problematic for a few types of rocks meant to be used in Kubernetes. There are some workload types which are expected to eventually finish and stop, entering a Completed state, and Pebble can be preventing that, leading to some workloads / operators not able to work as intended. A service manager like Pebble is not needed in these cases. Some of these cases are: - Jobs / CronJobs: These are meant as tasks with a finite execution time. They will spawn Pods which are not meant to be endlessly running. - InitContainers: Pods may have these in order to initialize various things before the main workload containers start. The workload containers are not started until the InitContainers have finished executing and entered a Completed state. - .*ctl type of images (e.g.: kubectl, falcoctl, longhornctl): These can be used in previously mentioned cases, but they may also be used standalone by the user in ephemeral Pods (kubectl run). At the moment, the workaround is to forcefully add a 1.1 sleep to the Pebble service. However, that may not be an option for bare-based rocks, as they wouldn't have the sleep command (unless it would be sliced in, needlessly increasing the image size and rock building complexity). Plus, it is an unpleasant user experience for peple switching to rockcraft and hitting this issue mostly through trial and error (building the rock, uploading it, deploy in a Kubernetes env if you have one, see if you encounter issues with it or not, notice issue, investigate, find issue, fix, start over). Alternatively, some Helm charts may have the option to completely override the image command / entrypoint, bypassing this issue. But a user would have to *know* about this issue (mostly through trial and error) and amend the Helm chart values accordingly, which is not great, considering that rocks should be drop-in replacements for regular images. But not all Helm charts have this option, in which case, the Helm chart would have to be patched, which is not great either. [1] canonical/pebble#240 (comment)
Based on the discussion above, I created two PoC PRs to demonstrate these two options. Option 1 PR: #514 Option 2 PR: #513 See more descriptions in the PR. I lean towards option 1 because the behaviour is more consistent, and the code is simpler. The PoC PR for option 2 isn't optimal, only for demonstrating purposes, but compared with that, option 1 seems much easier to understand. Draft spec: https://docs.google.com/document/d/1J4XeP-e_UH5fSIiJ0urO2PWNKAmcNFC6o95igOHsH-c/edit?tab=t.0. |
Hello,
we face the following issue:
our charm service is dependent on another service (2nd charm) to be alive. When we deploy bundle we cannot guarantee that 2nd charm will go alive before main charm.
What happens:
We tried to define health checks, but it looks like if the service was not start for the first time, then health checks are ignored.
Can we control from pebble retry interval and force it to try to start the service again?
We do not want to do service management on charm side and prefer to rely on Pebble in this scenario
please see related issue in the repo: canonical/kfp-operators#220
The text was updated successfully, but these errors were encountered: