Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failures: servstate.TestOnCheckFailure* #270

Closed
barrettj12 opened this issue Aug 2, 2023 · 8 comments · Fixed by #272 or #283
Closed

Intermittent test failures: servstate.TestOnCheckFailure* #270

barrettj12 opened this issue Aug 2, 2023 · 8 comments · Fixed by #272 or #283
Labels
Tests Test failures (usually intermittent)

Comments

@barrettj12
Copy link
Contributor

barrettj12 commented Aug 2, 2023

Observing intermittent failures in the following tests:

TestOnCheckFailureRestartWhileRunning
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureRestartWhileRunning -count=1 -race
2023-08-03T00:16:46.601Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3400122428/1/out; sync; touch /tmp/check-3400122428/0/test2; sleep 10'
2023-08-03T00:16:46.702Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.702Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:16:46.928Z [test] Service "test2" on-check-failure action is "restart", terminating process before restarting
2023-08-03T00:16:46.929Z [test] Service "test2" exited after check failure, restarting
2023-08-03T00:16:46.929Z [test] Service "test2" on-check-failure action is "restart", waiting ~100ms before restart (backoff 1)
2023-08-03T00:16:46.929Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.971Z [test] Service "test2" stopped while waiting for backoff

----------------------------------------------------------------------
FAIL: manager_test.go:984: S.TestOnCheckFailureRestartWhileRunning

manager_test.go:1044:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.40s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.458s
FAIL
TestOnCheckFailureIgnore
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureIgnore -count=1 -race
2023-08-03T00:18:28.213Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3858153059/1/out; sync; touch /tmp/check-3858153059/0/test2; sleep 10'
2023-08-03T00:18:28.313Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.313Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:18:28.444Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.513Z [test] Check "chk1" failure 3 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.544Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1167: S.TestOnCheckFailureIgnore

manager_test.go:1226:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.34s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.374s
FAIL
TestOnCheckFailureShutdown
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureShutdown -count=1 -race
2023-08-03T00:19:01.028Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-1520139323/1/out; sync; touch /tmp/check-1520139323/0/test2; sleep 10'
2023-08-03T00:19:01.173Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.173Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:19:01.546Z [test] Service "test2" on-check-failure action is "shutdown", triggering server exit
2023-08-03T00:19:01.546Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.604Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1243: S.TestOnCheckFailureShutdown

manager_test.go:1302:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.60s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.628s
FAIL
@barrettj12

This comment was marked as outdated.

@barrettj12
Copy link
Contributor Author

I did some bisecting, and this regression appears to have been introduced in 5a54fa5 (part of #266).

@flotter
Copy link
Contributor

flotter commented Aug 2, 2023

@barrettj12 Sorry about this. I can see why the test has become even more timing sensitive. I will fix this test entirely asap.

@flotter
Copy link
Contributor

flotter commented Aug 2, 2023

#272 Will now remove all remaining races from the on-check-failure unit tests (some other stress reproducible tests remain, as noted in #266).

The issue was caused by part of the test behaviour more precisely (waiting for the sync; touch xxx to complete), and as a result made the remaining races worse.

@jnsgruk jnsgruk linked a pull request Aug 2, 2023 that will close this issue
@barrettj12
Copy link
Contributor Author

@flotter synchronising on a channel seems like a good idea. Nonetheless, I am still observing intermittent failures in the following tests:

TestOnCheckFailureRestartWhileRunning
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureRestartWhileRunning -count=1 -race
2023-08-03T00:16:46.601Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3400122428/1/out; sync; touch /tmp/check-3400122428/0/test2; sleep 10'
2023-08-03T00:16:46.702Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.702Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:16:46.928Z [test] Service "test2" on-check-failure action is "restart", terminating process before restarting
2023-08-03T00:16:46.929Z [test] Service "test2" exited after check failure, restarting
2023-08-03T00:16:46.929Z [test] Service "test2" on-check-failure action is "restart", waiting ~100ms before restart (backoff 1)
2023-08-03T00:16:46.929Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:16:46.971Z [test] Service "test2" stopped while waiting for backoff

----------------------------------------------------------------------
FAIL: manager_test.go:984: S.TestOnCheckFailureRestartWhileRunning

manager_test.go:1044:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.40s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.458s
FAIL
TestOnCheckFailureIgnore
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureIgnore -count=1 -race
2023-08-03T00:18:28.213Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-3858153059/1/out; sync; touch /tmp/check-3858153059/0/test2; sleep 10'
2023-08-03T00:18:28.313Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.313Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:18:28.444Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.513Z [test] Check "chk1" failure 3 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:18:28.544Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1167: S.TestOnCheckFailureIgnore

manager_test.go:1226:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.34s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.374s
FAIL
TestOnCheckFailureShutdown
$ stress go test ./internals/overlord/servstate -check.f TestOnCheckFailureShutdown -count=1 -race
2023-08-03T00:19:01.028Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-1520139323/1/out; sync; touch /tmp/check-1520139323/0/test2; sleep 10'
2023-08-03T00:19:01.173Z [test] Check "chk1" failure 1 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.173Z [test] Check "chk1" failure threshold 1 hit, triggering action
2023-08-03T00:19:01.546Z [test] Service "test2" on-check-failure action is "shutdown", triggering server exit
2023-08-03T00:19:01.546Z [test] Check "chk1" failure 2 (threshold 1): exec: "will-fail": executable file not found in $PATH
2023-08-03T00:19:01.604Z [test] Service "test2" stopped

----------------------------------------------------------------------
FAIL: manager_test.go:1243: S.TestOnCheckFailureShutdown

manager_test.go:1302:
    c.Assert(checks[0].Failures, Equals, 1)
... obtained int = 2
... expected int = 1

OOPS: 0 passed, 1 FAILED
--- FAIL: Test (0.60s)
FAIL
FAIL    github.com/canonical/pebble/internals/overlord/servstate        0.628s
FAIL

@barrettj12 barrettj12 reopened this Aug 3, 2023
@barrettj12 barrettj12 changed the title Intermittent test failure: servstate.TestOnCheckFailureRestartWhileRunning Intermittent test failures: servstate.TestOnCheckFailure* Aug 3, 2023
@flotter
Copy link
Contributor

flotter commented Aug 3, 2023

@barrettj12 I know what I have missed. For some reason my stress test did reveal my oversight. I will fix that asap, and because I missed that super obvious race, as punishment, fix a few more of them in servstate.

@benhoyt
Copy link
Contributor

benhoyt commented Aug 3, 2023

Hehe, but don't feel bad, @flotter -- you're the only one of us fixing failures at the moment, the rest of us are just complaining about them! :-)

@benhoyt benhoyt added the Tests Test failures (usually intermittent) label Aug 18, 2023
@flotter
Copy link
Contributor

flotter commented Aug 18, 2023

Should be fixed by this: #283

@jnsgruk jnsgruk linked a pull request Aug 22, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Test failures (usually intermittent)
Projects
None yet
3 participants