Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkstate: make tests more robust #249

Merged
merged 3 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions internals/overlord/checkstate/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
// CheckManager starts and manages the health checks.
type CheckManager struct {
mutex sync.Mutex
group sync.WaitGroup
flotter marked this conversation as resolved.
Show resolved Hide resolved
checks map[string]*checkData
failureHandlers []FailureFunc
}
Expand Down Expand Up @@ -58,13 +59,20 @@ func (m *CheckManager) PlanChanged(p *plan.Plan) {
for _, check := range m.checks {
check.cancel()
}
// Wait for all context cancellations to propagate and allow
// each goroutine to cleanly exit.
m.group.Wait()

// Set the size of the next wait group
m.group.Add(len(p.Checks))

// Then configure and start new checks.
checks := make(map[string]*checkData, len(p.Checks))
for name, config := range p.Checks {
ctx, cancel := context.WithCancel(context.Background())
check := &checkData{
config: config,
group: &m.group,
flotter marked this conversation as resolved.
Show resolved Hide resolved
checker: newChecker(config),
ctx: ctx,
cancel: cancel,
Expand Down Expand Up @@ -155,6 +163,7 @@ const (
// checkData holds state for an active health check.
type checkData struct {
config *plan.Check
group *sync.WaitGroup
checker checker
ctx context.Context
cancel context.CancelFunc
Expand All @@ -171,6 +180,10 @@ type checker interface {
}

func (c *checkData) loop() {
// Schedule a notification on exit to indicate another
// checker in the group is complete.
defer c.group.Done()

logger.Debugf("Check %q starting with period %v", c.config.Name, c.config.Period.Value)

ticker := time.NewTicker(c.config.Period.Value)
Expand Down
27 changes: 19 additions & 8 deletions internals/overlord/checkstate/manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,14 @@ func (s *ManagerSuite) SetUpSuite(c *C) {
setLoggerOnce.Do(func() {
logger.SetLogger(logger.New(os.Stderr, "[test] "))
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved reaper.Start() and reaper.Stop() to test level. The command cleanup should happen before the next test starts, and it also provides a more robust test environment for race conditions, as now each test must cleanly exit before the reaper is stopped.

Note that the reaper is bound to the package level test binary process ID, and therefore moving it to the test setup and teardown make it incompatible for parallel testing. However, parallel testing is strictly opt-in, so this is still a valid requirement for this particular package tests.


func (s *ManagerSuite) SetUpTest(c *C) {
err := reaper.Start()
c.Assert(err, IsNil)
}

func (s *ManagerSuite) TearDownSuite(c *C) {
func (s *ManagerSuite) TearDownTest(c *C) {
err := reaper.Stop()
c.Assert(err, IsNil)
}
Expand Down Expand Up @@ -137,7 +139,6 @@ func (s *ManagerSuite) TestTimeout(c *C) {
c.Assert(check.Failures, Equals, 1)
c.Assert(check.Threshold, Equals, 1)
c.Assert(check.LastError, Equals, "exec check timed out")
c.Assert(check.ErrorDetails, Equals, "FOO")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in the commit message, this check is too racey. There is no way there is any guarantee this would be done before the timeout cuts off the command execution. Given that this test focuses on the timeout mechanism, and not command output logging, I feel it is justified to remove this to improve test robustness.

}

func (s *ManagerSuite) TestCheckCanceled(c *C) {
Expand All @@ -161,17 +162,15 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
},
})

// Wait for command to start (output file grows in size)
prevSize := 0
// Wait for command to start (output file is not zero in size)
Copy link
Contributor Author

@flotter flotter Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous logic did not make 100% sense, so I simplified it a bit.

c.MkDir() is test managed and removed after the test. The tempfile is therefore always zero to start with, and so we can simply check for a non-zero value. In the original code, the prevSize = len(b) appeared redundant as the previous check would break out before it could happen for any non-zero value.

for i := 0; ; i++ {
if i >= 100 {
c.Fatalf("failed waiting for command to start")
}
b, _ := ioutil.ReadFile(tempFile)
if len(b) != prevSize {
if len(b) > 0 {
break
}
prevSize = len(b)
time.Sleep(time.Millisecond)
}

Expand All @@ -185,7 +184,6 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
stopChecks(c, mgr)

// Ensure command was terminated (output file didn't grow in size)
time.Sleep(50 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopChecks() calls PlanChanged(), which terminates the previous checks synchronously now, so this is not needed.

b1, err := ioutil.ReadFile(tempFile)
c.Assert(err, IsNil)
time.Sleep(20 * time.Millisecond)
Expand Down Expand Up @@ -269,8 +267,20 @@ func (s *ManagerSuite) TestFailures(c *C) {
c.Assert(failureName, Equals, "")
}

// waitCheck is a time based approach to wait for a checker run to complete.
// The timeout value does not impact the general time it takes for tests to
// complete, but determines a worse case waiting period before giving up.
flotter marked this conversation as resolved.
Show resolved Hide resolved
// The timeout value must take into account single core or very busy machines
// so it makes sense to pick a conservative number here as failing a test
// due to a busy test resource is more extensive than waiting a few more
// seconds.
func waitCheck(c *C, mgr *CheckManager, name string, f func(check *CheckInfo) bool) *CheckInfo {
for i := 0; i < 100; i++ {
// Worse case waiting time for checker run(s) to complete. This
flotter marked this conversation as resolved.
Show resolved Hide resolved
// period should be much longer (10x is good) than the longest
// check timeout value.
timeout := time.Second * 10

for start := time.Now(); time.Since(start) < timeout; {
checks, err := mgr.Checks()
c.Assert(err, IsNil)
for _, check := range checks {
Expand All @@ -280,6 +290,7 @@ func waitCheck(c *C, mgr *CheckManager, name string, f func(check *CheckInfo) bo
}
time.Sleep(time.Millisecond)
}

c.Fatalf("timed out waiting for check %q", name)
return nil
}
Expand Down