Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test race conditions (including cases only seen under heavy cpu load) #181

Closed
flotter opened this issue Jan 18, 2023 · 14 comments
Closed

Test race conditions (including cases only seen under heavy cpu load) #181

flotter opened this issue Jan 18, 2023 · 14 comments
Labels
Tests Test failures (usually intermittent)

Comments

@flotter
Copy link
Contributor

flotter commented Jan 18, 2023

I was getting what appears to be intermittent failures on CI test runners at times when github feels slow and under heavy load.

I spent some hours investigating intermittent issues locally with Pebble test, and I would like to use this issue to track my progress, and get some feedback.

My environment details are:

flotter@flotter-pc:~$ uname -a
Linux flotter-pc 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
flotter@flotter-pc:~$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"

It is possible this is related to, but I have not directly encountered those:

In general, some of the service tests relies on output written to disk or the logger to verify weather a service performed the intended work. The strategy some of the tests follow is to 'ensure' the service is started, and then proceed to verifying the service output. However, this does not guarantee any work the service process itself is doing is complete.

internal/overlord/servstate/manager_test.go:

func (s *S) TestEnvironment(c *C) {
    :
    chg := s.startServices(c, []string{"envtest"}, 1)  // Uses ensure() and wait() to make sure task completed.
    s.st.Lock() 
    c.Check(chg.Status(), Equals, state.DoneStatus, Commentf("Error: %v", chg.Err())) 
    s.st.Unlock() 
    data, err := ioutil.ReadFile(logPath)                  // Race condition

In some places there is a loop with arbitrary number of iterations to wait on the service status:

func (s *S) waitUntilService(c *C, service string, f func(svc *servstate.ServiceInfo) bool) {
    for i := 0; i < 310; i++ {
        svc := s.serviceByName(c, service)
        if f(svc) {
            return
        }
        time.Sleep(10 * time.Millisecond)
   }

Reproducing

I can easily reproduce various failures in internal/overlord/servstate/manager_test.go

Terminal 1:

    ~/> sudo apt install stress
    ~/> stress --cpu=10 --io=4    # I specify more cpu than my physical cores to make sure I overload the kernel task scheduler

Terminal 2:

    ~/pebble> cd internal/overlord/servstate
    ~/pebble/internal/overlord/servstate/> go version
    go version go1.14.15 linux/amd64
    ~/pebble/internal/overlord/servstate/> while [ true ]; do go test -count=1 -race ./... -check.vv; done
@cjdcordeiro
Copy link
Collaborator

Possibly related to #188 (comment)

@flotter
Copy link
Contributor Author

flotter commented Jun 26, 2023

#247

Not directly related, but I've seen "reaper not started messages" before, and this may be a source. This is part of my journey to make tests more reliable.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 26, 2023

@flotter A noble journey, fellow Pebbler. I hope you have a good steed! ;-)

@flotter
Copy link
Contributor Author

flotter commented Jun 27, 2023

#249

checkstate should now be bullet proof.

@flotter
Copy link
Contributor Author

flotter commented Jun 28, 2023

START: daemon_test.go:1057: daemonSuite.TestHTTPAPI
START: daemon_test.go:64: daemonSuite.SetUpTest
PASS: daemon_test.go:64: daemonSuite.SetUpTest	0.000s

2023-06-28T10:14:44.077Z [test] HTTP API server listening on ":0".
2023-06-28T10:14:44.077Z [test] Started daemon.
2023-06-28T10:14:44.083Z [test] [::1]:54022 GET /v1/checks 95.701µs 401
daemon_test.go:1088:
    c.Assert(err, IsNil)
... value *net.OpError = &net.OpError{Op:"close", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc00044baa0), Err:poll.errNetClosing{}} ("close tcp [::]:43187: use of closed network connection")

START: daemon_test.go:73: daemonSuite.TearDownTest
PASS: daemon_test.go:73: daemonSuite.TearDownTest	0.000s

FAIL: daemon_test.go:1057: daemonSuite.TestHTTPAPI

@flotter
Copy link
Contributor Author

flotter commented Jun 28, 2023

Next one to tackle.

@flotter
Copy link
Contributor Author

flotter commented Jul 3, 2023

PASS: manager_test.go:496: S.TestStartBadCommand 0.006s

START: manager_test.go:630: S.TestStartFastExitCommand
START: manager_test.go:147: S.SetUpTest
PASS: manager_test.go:147: S.SetUpTest 0.001s

2023-07-03T07:18:35.824Z [test] Service "test4" starting: echo -e 'too-fast\nsecond line'
START: manager_test.go:183: S.TearDownTest
PASS: manager_test.go:183: S.TearDownTest 0.000s

PASS: manager_test.go:630: S.TestStartFastExitCommand 0.006s

START: manager_test.go:268: S.TestStartStopServices
START: manager_test.go:147: S.SetUpTest
PASS: manager_test.go:147: S.SetUpTest 0.001s

2023-07-03T07:18:35.832Z [test] Service "test1" starting: /bin/sh -c "echo test1 | tee -a /tmp/check-327806957/37/log.txt; sleep 10"
2023-07-03T07:18:35.883Z [test] Service "test2" starting: /bin/sh -c "echo test2 | tee -a /tmp/check-327806957/37/log.txt; sleep 10"
2023-07-03T07:18:35.937Z [test] Service "test1" stopped
2023-07-03T07:18:35.937Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-327806957/19/out; sleep 1'
2023-07-03T07:18:35.938Z [test] Service "test2" stopped
START: manager_test.go:183: S.TearDownTest
panic: internal error: reaper must be started

goroutine 300 [running]:
github.com/canonical/pebble/internals/reaper.WaitCommand(0xc000476160, 0xc000338b40, 0xc34b40, 0xc0000ba5d8)
/home/runner/work/pebble/pebble/internals/reaper/reaper.go:192 +0x794
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal.func2(0xc000476160, 0xc0001efa50, 0x5, 0xc0000991a0, 0xc0004a4080)
/home/runner/work/pebble/pebble/internals/overlord/servstate/handlers.go:420 +0x51
created by github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal
/home/runner/work/pebble/pebble/internals/overlord/servstate/handlers.go:419 +0x1255
FAIL github.com/canonical/pebble/internals/overlord/servstate 3.011s

@flotter
Copy link
Contributor Author

flotter commented Jul 3, 2023

START: request_test.go:44: S.TestStop
START: manager_test.go:147: S.SetUpTest
PASS: manager_test.go:147: S.SetUpTest 0.000s

START: manager_test.go:183: S.TearDownTest
PASS: manager_test.go:183: S.TearDownTest 0.000s

PASS: request_test.go:44: S.TestStop 0.000s

START: manager_test.go:1101: S.TestStopDuringBackoff
START: manager_test.go:147: S.SetUpTest
PASS: manager_test.go:147: S.SetUpTest 0.000s

2023-07-03T07:23:48.582Z [test] Service "test2" starting: sleep 0.1
2023-07-03T07:23:48.638Z [test] Service "nowrkdir" starting: /bin/sh -c "pwd >/tmp/check-507338424/17/output; sleep 0.060000000000000005"
2023-07-03T07:23:48.683Z [test] Service "test2" stopped unexpectedly with code 0
2023-07-03T07:23:48.683Z [test] Service "test2" on-success action is "restart", waiting ~500ms before restart (backoff 1)
2023-07-03T07:23:48.684Z [test] Service "test2" stopped while waiting for backoff
START: manager_test.go:183: S.TearDownTest
PASS: manager_test.go:183: S.TearDownTest 0.000s

PASS: manager_test.go:1101: S.TestStopDuringBackoff 0.104s

START: manager_test.go:1718: S.TestStopRunning
2023-07-03T07:23:48.684Z [test] Service "test2" starting: /bin/sh -c 'echo x >>/tmp/check-507338424/23/out; sleep 1'
panic: internal error: reaper must be started

goroutine 357 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand(0xc0004729a0, 0x0, 0x0)
/home/runner/work/pebble/pebble/internals/reaper/reaper.go:164 +0x3fd
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).startInternal(0xc000134980, 0xc00024e7a8, 0x800916)
/home/runner/work/pebble/pebble/internals/overlord/servstate/handlers.go:405 +0xe8d
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).backoffTimeElapsed(0xc000134980, 0x0, 0x0)
/home/runner/work/pebble/pebble/internals/overlord/servstate/handlers.go:664 +0x118
github.com/canonical/pebble/internals/overlord/servstate.(*serviceData).doBackoff.func1()
/home/runner/work/pebble/pebble/internals/overlord/servstate/handlers.go:547 +0x46
created by time.goFunc
/opt/hostedtoolcache/go/1.15.15/x64/src/time/sleep.go:167 +0x52
FAIL github.com/canonical/pebble/internals/overlord/servstate 3.099s

@flotter
Copy link
Contributor Author

flotter commented Jul 4, 2023

START: daemon_test.go:1057: daemonSuite.TestHTTPAPI
START: daemon_test.go:64: daemonSuite.SetUpTest
PASS: daemon_test.go:64: daemonSuite.SetUpTest	0.000s

2023-06-28T10:14:44.077Z [test] HTTP API server listening on ":0".
2023-06-28T10:14:44.077Z [test] Started daemon.
2023-06-28T10:14:44.083Z [test] [::1]:54022 GET /v1/checks 95.701µs 401
daemon_test.go:1088:
    c.Assert(err, IsNil)
... value *net.OpError = &net.OpError{Op:"close", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc00044baa0), Err:poll.errNetClosing{}} ("close tcp [::]:43187: use of closed network connection")

START: daemon_test.go:73: daemonSuite.TearDownTest
PASS: daemon_test.go:73: daemonSuite.TearDownTest	0.000s

FAIL: daemon_test.go:1057: daemonSuite.TestHTTPAPI

#253 should now fix this.

@flotter flotter changed the title Test race conditions under heavy cpu load Test race conditions (including cases only seen under heavy cpu load) Jul 4, 2023
@flotter
Copy link
Contributor Author

flotter commented Jul 5, 2023

Servstate is next.

@flotter
Copy link
Contributor Author

flotter commented Jul 28, 2023

#266 will address the common "reaper not started" type panics often seen during servstate unit testing. The PR includes a fix that will prevent the same from happening at Pebble process exit.

In addition to the reaper issues, I have started remove some of the timing related races, in this case only one specific to code reading file output from started services. However, tons of timing super sensitive code remains, so this is not all stress --io x --cpu y proof yet. Since this is not really the failures we see on CI, I will first prioritise other CI failures before coming back to these.

@benhoyt benhoyt added the Tests Test failures (usually intermittent) label Aug 18, 2023
@flotter
Copy link
Contributor Author

flotter commented Aug 18, 2023

Further improvement for servstate: #283

@benhoyt
Copy link
Contributor

benhoyt commented Aug 27, 2024

@flotter Do you think this is good enough that we should just close this now? I haven't done stress tests recently, but I haven't seen these failures for a long time.

@benhoyt
Copy link
Contributor

benhoyt commented Sep 30, 2024

We haven't seen these flakiness issues for ages (thanks in large part to @flotter's earlier work). Going to close this issue. We can open specific issues if we see more flakiness.

@benhoyt benhoyt closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Test failures (usually intermittent)
Projects
None yet
Development

No branches or pull requests

5 participants