Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pebble does not try to restart service if exited too quickly for the first time #240

Open
beliaev-maksim opened this issue Jun 20, 2023 · 17 comments · May be fixed by #520
Open

Pebble does not try to restart service if exited too quickly for the first time #240

beliaev-maksim opened this issue Jun 20, 2023 · 17 comments · May be fixed by #520
Labels
25.04 Planned for the 25.04 cycle Feature A feature request Needs Design Needs a design or spec

Comments

@beliaev-maksim
Copy link
Member

Hello,
we face the following issue:
our charm service is dependent on another service (2nd charm) to be alive. When we deploy bundle we cannot guarantee that 2nd charm will go alive before main charm.

What happens:

  1. Charm 1 installs first
  2. Charm 1 tries to start pebble service
  3. Charm 2 is not yet alive
  4. Service exits with error since tries to connect to charm 2 which is not alive (exited to quickly)
  5. Charm 2 goes UP
  6. Charm 1 never recovers since pebble tries to start the service only once

We tried to define health checks, but it looks like if the service was not start for the first time, then health checks are ignored.

Can we control from pebble retry interval and force it to try to start the service again?
We do not want to do service management on charm side and prefer to rely on Pebble in this scenario

please see related issue in the repo: canonical/kfp-operators#220

@jnsgruk
Copy link
Member

jnsgruk commented Jun 20, 2023

Can we control from pebble retry interval and force it to try to start the service again?

There are a bunch of controls you have over the restarting of a service:

services:
    # (Optional) Defines what happens when the service exits with a nonzero
    # exit code. Possible values are: "restart" (default) which restarts
    # the service after the backoff delay, "shutdown" which shuts down and
    # exits the Pebble server, and "ignore" which does nothing further.
    on-failure: restart | shutdown | ignore

    # (Optional) Defines what happens when each of the named health checks
    # fail. Possible values are: "restart" (default) which restarts
    # the service once, "shutdown" which shuts down and exits the Pebble
    # server, and "ignore" which does nothing further.
    on-check-failure:
        <check name>: restart | shutdown | ignore

    # (Optional) Initial backoff delay for the "restart" exit action.
    # Default is half a second ("500ms").
    backoff-delay: <duration>

    # (Optional) After each backoff, the backoff delay is multiplied by
    # this factor to get the next backoff delay. Must be greater than or
    # equal to one. Default is 2.0.
    backoff-factor: <factor>

    # (Optional) Limit for the backoff delay: when multiplying by
    # backoff-factor to get the next backoff delay, if the result is
    # greater than this value, it is capped to this value. Default is
    # half a minute ("30s").
    backoff-limit: <duration>

    # (Optional) The amount of time afforded to this service to handle
    # SIGTERM and exit gracefully before SIGKILL terminates it forcefully.
    # Default is 5 seconds ("5s").
    kill-delay: <duration>

Have you tried any of those? :)

@cjdcordeiro
Copy link
Collaborator

I believe this is related to the fact that Pebble does not support one-shot services (and maybe also a bug as IMO, regardless of the service's nature, Pebble should catch its errors). Something similar was initially reported here and implemented via pebble exec, without ever altering the nature of the supported commands for Pebble services.

Although this could be worth a discussion, I think the most immediate workaround for that issue is to force a sleep.

@phoevos
Copy link

phoevos commented Jun 20, 2023

@jnsgruk, I did try the available options:

  • on-failure seemed the most promising one. I've tried it (either as the default or explicitly setting it), but it looks like since the service is considered inactive (and not failed) this is never triggered. I believe that this is implied in the description:

    when the service exits with a nonzero exit code given that the failure did not occur within less than 1 second

  • on-check-failure has been the one we've been trying to use along with our health checks. Again, this doesn't work for inactive services.

  • the backoff options all control the behaviour between restarts. Won't make a difference if the service hasn't even started from the perspective of Pebble.

phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 20, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@ca-scribner
Copy link

Yeah @jnsgruk as @phoevos mentioned, those controls are applied only to things pebble things have previously started. See this section of pebble docs, specifically:

When starting a service, Pebble executes the service's command, and waits 1 second to ensure the command doesn't exit too quickly. Assuming the command doesn't exit within that time window, the start is considered successful, otherwise pebble start will exit with an error

iirc, a service that exits with an error like this goes to a different state than something that has crashed and needs restarting. Pebble does not try to restart from this state, even if there are health checks.

From reading the docs and discussion in this thread, it sounds like this is the intended behaviour rather than a bug. But to many this intended behaviour is not expected and maybe worth revisiting

@ca-scribner
Copy link

imo this is different from one-shot commands. one-shot commands are intended to be single execution. The problem here is that a long-running service may legitimately error out quickly and users will expect pebble to restart it.

This is typical in kubernetes controllers. Commonly you create a controller and some configmaps that the controller needs at the same time. If the controller wins the race and starts before the configmaps, it errors out and k8s restarts it again soon.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 20, 2023

Yes, @ca-scribner is correct: this is not related to exec (one-shot commands), as @beliaev-maksim is describing a long-running service, just one that exits quickly if it can't connect to a dependency.

This is actually happening by design: when you start a service manually, with pebble start or ops.Container.start(), it waits up to 1 second for the service to be started. If the service exits quickly (within 1s), the start operation returns an error / raises an exception, and the idea is that you'd handle it there. So if you were calling container.start('mysvc') manually in the charm, the charm hook would raise an error and presumably Juju would try it again, so you'd get retry behaviour that way.

Similar with replan, which is what it looks like you're doing in the charm you linked. You're calling update_layer to add a layer and call replan. This will raise a pebble.ChangeError in the case of a service which exits too quickly, which update_layer turns into an ErrorWithStatus exception. However, this is then caught by the charm and the hook sets an error status but returns successfully:

        except ErrorWithStatus as err:
            self.model.unit.status = err.status
            self.logger.error(f"Failed to handle {event} with error: {err}")
            return

This means Juju won't retry the hook, because you've exited the Python code successfully. I think the simplest thing to do would be to change the return above to raise, to re-raise the error, forcing Juju to retry the hook a bit later, which would do the add-layer/replan thing again, and hopefully this time the other charm is working. I forget off-hand how quickly and how many times Juju retries running a hook.

There are two other options I can think of:

  • Defer the event. Again, this would mean whenever the next event came through, it would re-do the add-layer/replan. But it's uncertain when the next Juju event is going to come through (the problem with defer).
  • Make the service wait at least 1 second before exiting, even if case of error. This would mean the service would be "successfully started" from Pebble's point of view, and then kick in Pebble's auto-restart behaviour.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free. You could either do the wait in your service's code if you have control over that, or add a simple wrapper script that waits. For example, I just tested it with this, and the service goes into the auto-restart / backoff loop as we want here:

services:
  srv1:
    override: replace
    command: /bin/sh -c "myservice; sleep 1.1"
    startup: enabled

It's possible we should revisit this design. But the current idea is that either the start (or replan) fails and the error is reported that way, or if the service exits later Pebble's auto-restart behaviour kicks in. The problem with your charm's current approach is that it's catching the exception and masking the error.

@beliaev-maksim
Copy link
Member Author

Hi @benhoyt,

I think the simplest thing to do would be to change the return above to raise

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed. Also, do not forget that user theoretically may set model setting to not retry failed apps.

I'd lean towards the last option, because then you get Pebble's auto-restart behaviour for free.

yes, I also support this idea more. However, I cannot say you get it really for free. Will not we get an overhead of 1s on each service start then ?

The problem with your charm's current approach is that it's catching the exception and masking the error.

I do not think this is the root problem. Again, see note about charm unhandled exceptions above.
I think that it would be better if pebble will try to start any service independent of the error raised. It should not lead to any system overload due to backoff-factor.

If we want to make the code fully backwards compatible we can add new pebble option that will ignore all the errors and will always try to restart according to the settings.

@ca-scribner
Copy link

I'm still digesting the comments about how to make it work in a charm, so I'll come back to those.

Regarding the design choice on handling <1s errors different from >1s errors: was there a constraint that drove the design that way? The only thing I can think of would be avoiding thrashing a process that'll never succeed be restarting it over and over. Was that the issue, or was there something else?

What caught me off guard here I think is that I don't know of another process manager that uses a similar pattern. For example with kubernetes, a Pod may die instantly (image cannot pull, process starts and dies, etc) and it is handled the same as one that dies after hours. This method could lead to thrashing the restarts, but they avoid that with an exponential backoff.

Doesn't mean restarting after a quick death is the best thing to do, but to me it feels like the intuitive thing to do unless something really gets in the way.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 22, 2023

I strongly believe this is against charming best practices. Error in the charm code means we do not know what to do in this scenario and that is why we failed.

@beliaev-maksim Yeah, that's fair push-back. Though I did say "simplest" thing, not "best". :-) But you're right, and charms shouldn't go into error state when they know what's wrong and they're really just waiting on something.

I guess my point is that you should either handle the start / replan error (by having the charm retry, or deferring the event). You could do something like this:

class MyCharm(ops.CharmBase):
    def _on_config_changed(self, event):
        if not self.ensure_started()  # or perhaps an @ensure_started decorator?
            return
        ...

    def _on_other_hook(self, event):
        if not self.ensure_started()
            return
        ...

    def ensure_started(self) -> bool:
        """Ensure service is started and return True, or return False if it can't be started."""
        if self.container.get_service("mysvc").is_running():
            return True
        try:
            self.container.replan()
            return True
        except pebble.ChangeError:
            self.unit.status = ops.WaitingStatus("waiting for dependency to start")
            return False

The drawback is you don't know when the next event is going to come in (and right now we have no way of telling Juju, "please wake me up in 10 seconds, will ya", though we might have something equivalent when we get workload events / Pebble Notices).

But I think the other approach of the wrapper script with a sleep 1 is probably a cleaner workaround for your use case.

Will not we get an overhead of 1s on each service start then ?

Well, kind of, but only when the service can't start and you need to wait for the dependency anyway. When the service starts properly, it's not an issue because it won't get to the sleep.


So that's my recommendation for now, with the current system. As you point out, this might not be the best design, and we could definitely consider a different approach where the service start always succeeds (or maybe there's a config option, to make it backwards-compatible). I think it's a little tricky, and requires adding config, so it probably needs a small spec. It's not on my roadmap so realistically I probably won't get to it soon, but if one of you wants to start with a small OPnnn spec, we could discuss further from there?

Still, that's not going to be discussed and implemented overnight, so you would likely want to use the sleep 1 workaround for now.

@beliaev-maksim
Copy link
Member Author

@benhoyt yes, for now in 2.9 we go with sleep 1

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

Yes, we can start with spec proposal, thanks!

@benhoyt
Copy link
Contributor

benhoyt commented Jun 25, 2023

In 3.1 as workaround I proposed to use juju secrets and set expiration time and try in X minutes on secret expire hook.
WDYT?

To me that sounds much more hacky (and an abuse of Juju Secrets) than a simple sleep 1 in your command.

Yes, we can start with spec proposal, thanks!

Sounds good to me!

phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 28, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 29, 2023
Sleep for 1 second before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
phoevos added a commit to canonical/kfp-operators that referenced this issue Jun 29, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
DnPlas pushed a commit to canonical/kfp-operators that referenced this issue Aug 4, 2023
Sleep for 1.1 seconds before attempting to start the KFP API server, in
order to avoid issues with Pebble when the corresponding service fails
fast. This is a temporary measure until a fix for canonical/pebble#240
is provided.

Refs #220

Signed-off-by: Phoevos Kalemkeris <[email protected]>
@benhoyt benhoyt added 24.10 Feature A feature request Needs Design Needs a design or spec labels Mar 13, 2024
@ca-scribner
Copy link

Following up on this since its been a while...

The Kubeflow team would appreciate it if this was delivered in 24.10, ideally implementing the same restart procedure whether something dies in <1s or >1s. We're happy to contribute to a spec or guide however you want to elaborate the problem.

For some justification of why this would be helpful, the Kubeflow team has been burned by this a few times recently. Not that we can't work around it, but that we usually spend hours debugging the wrong thing before we realize this is the root cause. The trouble is that when someone creates a pebble service with checks and restart settings, it is not intuitive that these only apply some of the time (an example of this is @jnsgruk's interpretation here - probably everyone reading those settings expects restarts will occur even on quick failures). I doubt Kubeflow is the only team facing this, just that others might not realize what is happening and don't know this is the cause

@benhoyt benhoyt changed the title Pebble does not try to restart if exited too quickly for the first time Pebble does not try to restart service if exited too quickly for the first time Apr 18, 2024
@benhoyt
Copy link
Contributor

benhoyt commented Oct 9, 2024

We haven't forgotten about this. We were going to go ahead and fix it, because it seems like a simple thing to say "Pebble should always auto-restart services", but -- as is often the case -- it's not quite that simple.

Pebble, by design, has the 1-second "okay delay" we all know and love (or hate). This is so that you'll get an error right away -- from the pebble start command or the container.start() call -- if a service exits immediately. This is great when you've made a mistake in the service config or command line args: the service would never have started correctly no matter how many automatic retries Pebble did. So you want to an error to be raised right away, and you can fix the service config or command line by fixing your charm. So in that case the 1s okay delay is a feature (1s is somewhat arbitrary, but it needs to be some shortish amount of time). In this case, a start error means the caller should deal with it.

The problem this feature creates is when the service is configured correctly, but it fails in that first 1s because a dependency isn't yet ready -- as in Maksim's original example -- then the caller is forced to deal with it when they just want to leave it in Pebble's hands.

For comparison, systemd doesn't have the "okay delay" feature. Even in cases where a service's config or command line is invalid, systemctl start myservice will succeed. If Restart is configured, systemd will happily try to restart a service that's never going to start, over and over. But it will also restart services that will eventually come up.

Right now, you can use the workaround I suggested earlier: configure your service with command: /bin/sh -c "myservice; sleep 1.1". This will effectively get you systemd-like behaviour.

Arguably the charm should wait till the dependency is ready: if the relation to charm 2 is not ready yet, or the charm 2 workload is not ready yet, we shouldn't start/replan the Pebble service in charm 1. So you could wait for relation-joined or check the relation or similar. @DnPlas suggested this type of solution in this comment:

We can mitigate this error if we don't attempt to start the service (or replan it) unless we are sure that minio is active and the object-storage relation exists, which means that we have to always check for the relation before calling update_layer, deferring (or returning before) the update_layer call until the relation is present.

I think that would probably be the appropriate fix here. However, I can see that this complicates the charm a little. So maybe Pebble should have a "no okay delay" or "leave it with me from the beginning" option, either as a service configuration field or an argument to start/replan:

# Option 1: service config field
services:
  svc1:
    override: merge
    command: myservice
    okay-delay: 0  # disable okay delay and have Pebble always auto-restart

# Option 2: argument to start/replan
container.start(okay_delay=0)
container.replan(okay_delay=0)

The downside of course is that the charm author won't get notified if they have a bug in the service config and the service will never start successfully. (Integration tests should catch this, of course.)

Either way, the code change should be fairly straightforward.

I hope to discuss this with Gustavo and others at the sprints, to see if we can agree on which direction to go.

@benhoyt benhoyt added 25.04 Planned for the 25.04 cycle and removed 24.10 labels Oct 17, 2024
@dimaqq
Copy link

dimaqq commented Oct 17, 2024

The Jira ticket is marked done, does this mean that this issue is now resolved?

@benhoyt
Copy link
Contributor

benhoyt commented Oct 17, 2024

@dimaqq No, this is not resolved (we've have an idea, but have bumped it till the next cycle). Maybe the Jira ticket represented this issue? In any case, this needs to stay open.

@benhoyt
Copy link
Contributor

benhoyt commented Oct 31, 2024

Discussion with Gustavo/Harry/Tiexin/Tony:

  • option 1: return an error but put the service in "keep trying to start" mode
  • option 2 (Harry's idea): if you specify startup: enabled it would be like okay-delay=0 above (simpler)

Note: we definitely don't want okay-delay: <x> as a service config option, as the config would then affect the UX (very unexpected).

We'll create a small spec and finalise this early in the cycle.

claudiubelu added a commit to claudiubelu/rockcraft that referenced this issue Nov 4, 2024
By default, rockcraft sets Pebble as the default entrypoint. However, if
the defined service finishes faster than 1 second (either successfully
or not), Pebble will ignore the on-success / on-failure settings and
still continue running. [1]

This can be problematic for a few types of rocks meant to be used in
Kubernetes. There are some workload types which are expected to
eventually finish and stop, entering a Completed state, and Pebble can
be preventing that, leading to some workloads / operators not able to
work as intended. A service manager like Pebble is not needed in these
cases. Some of these cases are:

- Jobs / CronJobs: These are meant as tasks with a finite execution
  time. They will spawn Pods which are not meant to be endlessly running.
- InitContainers: Pods may have these in order to initialize various
  things before the main workload containers start. The workload
  containers are not started until the InitContainers have finished
  executing and entered a Completed state.
- .*ctl type of images (e.g.: kubectl, falcoctl, longhornctl): These can
  be used in previously mentioned cases, but they may also be used
  standalone by the user in ephemeral Pods (kubectl run).

At the moment, the workaround is to forcefully add a 1.1 sleep to the
Pebble service. However, that may not be an option for bare-based rocks,
as they wouldn't have the sleep command (unless it would be sliced in,
needlessly increasing the image size and rock building complexity).
Plus, it is an unpleasant user experience for peple switching to
rockcraft and hitting this issue mostly through trial and error
(building the rock, uploading it, deploy in a Kubernetes env if you have
one, see if you encounter issues with it or not, notice issue,
investigate, find issue, fix, start over).

Alternatively, some Helm charts may have the option to completely
override the image command / entrypoint, bypassing this issue. But a
user would have to *know* about this issue (mostly through trial and error)
and amend the Helm chart values accordingly, which is not great,
considering that rocks should be drop-in replacements for regular
images. But not all Helm charts have this option, in which case, the Helm chart
would have to be patched, which is not great either.

[1] canonical/pebble#240 (comment)
@IronCore864
Copy link
Contributor

IronCore864 commented Nov 7, 2024

Discussion with Gustavo/Harry/Tiexin/Tony:

  • option 1: return an error but put the service in "keep trying to start" mode
  • option 2 (Harry's idea): if you specify startup: enabled it would be like okay-delay=0 above (simpler)

Note: we definitely don't want okay-delay: <x> as a service config option, as the config would then affect the UX (very unexpected).

We'll create a small spec and finalise this early in the cycle.

Based on the discussion above, I created two PoC PRs to demonstrate these two options.

Option 1 PR: #514

Option 2 PR: #513

See more descriptions in the PR.

I lean towards option 1 because the behaviour is more consistent, and the code is simpler. The PoC PR for option 2 isn't optimal, only for demonstrating purposes, but compared with that, option 1 seems much easier to understand.

Draft spec: https://docs.google.com/document/d/1J4XeP-e_UH5fSIiJ0urO2PWNKAmcNFC6o95igOHsH-c/edit?tab=t.0.

@IronCore864 IronCore864 linked a pull request Nov 15, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
25.04 Planned for the 25.04 cycle Feature A feature request Needs Design Needs a design or spec
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants