Skip to content

Commit

Permalink
checkstate: make check runners exit cleanly
Browse files Browse the repository at this point in the history
The checks manager uses contexts to cancel check runners. This happens
during checkstate.PlanChanged() when the previous checkers are all
terminated, and the new checker definition is applied.

However, context cancellation is asynchronous (as highlighted in the
context documentation), so the check runners will continue to complete
even after the context cancellations returned.

Due to this race condition, the runner can still launch a check even after
the context cancellation is complete.

This problem is reproduced here:
https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c

The result of this is easily seen the following unit test failure:

panic: internal error: reaper must be started

goroutine 75 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand
	/home/flotter/pebble/internals/reaper/reaper.go:164
github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check()
	/home/flotter/pebble/internals/overlord/checkstate/checkers.go:170
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck()
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:199
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop()
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:182
created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:74

In this particular case, the problem is that test teardown stops the reaper
process which is responsible for reaping check commands, while the final
(edge case) run of the check is still completing.

Following the same logic, this could also cause check overlap between terminating
previous checks and starting new checks in checkstate.PlanChanged().

The following changes fix the problem:

- Use a WaitGroup to represent the group of active checks in the check manager

- Wait() on the group to complete before creating the next group of checks
  when PlanChanged is called.

- Each check runner will use a deferred exit to decrement the group counter.
  • Loading branch information
flotter committed Jun 27, 2023
1 parent 8902cbe commit 9100887
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions internals/overlord/checkstate/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
// CheckManager starts and manages the health checks.
type CheckManager struct {
mutex sync.Mutex
group sync.WaitGroup
checks map[string]*checkData
failureHandlers []FailureFunc
}
Expand Down Expand Up @@ -58,13 +59,20 @@ func (m *CheckManager) PlanChanged(p *plan.Plan) {
for _, check := range m.checks {
check.cancel()
}
// Wait for all context cancellations to propagate and allow
// each goroutine to cleanly exit.
m.group.Wait()

// Set the size of the next wait group
m.group.Add(len(p.Checks))

// Then configure and start new checks.
checks := make(map[string]*checkData, len(p.Checks))
for name, config := range p.Checks {
ctx, cancel := context.WithCancel(context.Background())
check := &checkData{
config: config,
group: &m.group,
checker: newChecker(config),
ctx: ctx,
cancel: cancel,
Expand Down Expand Up @@ -155,6 +163,7 @@ const (
// checkData holds state for an active health check.
type checkData struct {
config *plan.Check
group *sync.WaitGroup
checker checker
ctx context.Context
cancel context.CancelFunc
Expand All @@ -171,6 +180,10 @@ type checker interface {
}

func (c *checkData) loop() {
// Schedule a notification on exit to indicate another
// checker in the group is complete.
defer c.group.Done()

logger.Debugf("Check %q starting with period %v", c.config.Name, c.config.Period.Value)

ticker := time.NewTicker(c.config.Period.Value)
Expand Down

0 comments on commit 9100887

Please sign in to comment.