Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakiness detection #998

Open
tvolkert opened this issue Mar 14, 2019 · 13 comments
Open

Flakiness detection #998

tvolkert opened this issue Mar 14, 2019 · 13 comments

Comments

@tvolkert
Copy link
Contributor

It'd be really helpful if the test runner could be configured to auto-rerun individual failed tests upon failure, and then if a test succeeded in a rerun invocation, flag it as flaky. Then the runner would report how many tests passed, how many failed, how many were skipped, and how many flaked. Along with this, we'd be able to configure whether flakes would trigger an overall passing result or failing result (via the exit code).

Upon failure or flakiness, stdout would report the failure of the first invocation, and if a subsequent invocation passed, it'd get reported as [F] for flaky.

For example, configured to report success if flakes were detected:

$ pub run test --rerun-on-failure=5 --success_on_flake
00:04 +20 ~1 !1: test/foo_test.dart: Example test [F]                                                        
  Expected: true
    Actual: <false>
  
  package:test_api           expect
  test/foo_test.dart 10:20  main.<fn>.<fn>.<fn>
  
00:04 +40 ~1 !1: Some tests flaked.                                                                                                                           

$ echo $?
0 <-- If the only failures were identified to be flakes, then report tests passing

... and configured to report non-success if flakes were detected:

$ pub run test --rerun-on-failure=5 --no-success_on_flake
00:04 +20 ~1 !1: test/foo_test.dart: Example test [F]                                                        
  Expected: true
    Actual: <false>
  
  package:test_api           expect
  test/foo_test.dart 10:20  main.<fn>.<fn>.<fn>
  
00:04 +40 ~1 !1: Some tests flaked.                                                                                                                           

$ echo $?
[X] <-- some well-defined exit code that signals failure due only to flaky tests

Related (not not the same): #441

@jakemac53
Copy link
Contributor

jakemac53 commented Mar 14, 2019

Given that we already do have the retry logic, would this request be satisfied by:

  • Reporting any test that had to be retried as a flake (in the summary at the end)
  • Adding a flag which causes any flakes to result in a non-zero exit code

The primary difference I see here compared to your proposal is you wouldn't have a command line option to control the number of retries (although you can disable retries with --no-retry). Unfortunately that also means we can't add a --retry=<num-retries> option since that is already a flag. However it is supported through dart_test.yaml, annotations, and named args on tests/groups today.

Note that you can also create different retry configurations using presets, so you can have different config on travis etc.

@natebosch
Copy link
Member

What does our current reporting look like when using existing retry behavior?

@jakemac53
Copy link
Contributor

Currently it looks something like this:

0:01 +1 ~1 -1: flaky test [E]                                                                           
  Expected: <5>
    Actual: <0>
  
  package:test_api            expect
  hello_world_test.dart 34:5  <fn>
  
Retry: flaky test
  Expected: <5>
    Actual: <1>
  
  package:test_api            expect
  hello_world_test.dart 34:5  <fn>
  
Retry: flaky test
  Expected: <5>
    Actual: <2>
  
  package:test_api            expect
  hello_world_test.dart 34:5  <fn>
  
Retry: flaky test
  Expected: <5>
    Actual: <3>
  
  package:test_api            expect
  hello_world_test.dart 34:5  <fn>
  
Retry: flaky test
  Expected: <5>
    Actual: <4>
  
  package:test_api            expect
  hello_world_test.dart 34:5  <fn>
  
Retry: flaky test
00:02 +2 ~1: All tests passed!

@fzyzcjy
Copy link

fzyzcjy commented May 1, 2023

+1 after 4 years. For example, Flutter itself has a dedicated article about detecting and reducing flakiness - https://github.com/flutter/flutter/wiki/Reducing-Test-Flakiness, thus flaky really should be handled separately instead of marking it as passed (or failed).

My https://github.com/fzyzcjy/flutter_convenient_test does work well for integration tests, but when running widget tests / unit tests using IDE (e.g. intellij), I cannot rely on convenient_test and thus hope that package:test can have flaky support. Thanks!

@jakemac53
Copy link
Contributor

It doesn't look like we ever got any clarity here on whether the proposed UX would be sufficient -- cc @tvolkert .

Presumably we would also want to have a special flag for this in the JSON protocol which could be breaking, cc @DanTup . We could make it non-breaking by having it just be an additional flag on successful tests, but it might be cleaner if we just had 3 result states (success, failure, flaky).

@jakemac53 jakemac53 added the needs-info Additional information needed from the issue author label May 1, 2023
@fzyzcjy
Copy link

fzyzcjy commented May 1, 2023

Looks quite reasonable, I am willing to PR if it is clear

@tvolkert
Copy link
Contributor Author

tvolkert commented May 1, 2023

@pdblasi-google is the right person on the Flutter side now 🙂

@github-actions github-actions bot removed the needs-info Additional information needed from the issue author label May 1, 2023
@pdblasi-google
Copy link

Taking inspiration from NUnit, would it make sense to have four possible states? NUnit has Passed, Failed, Inconclusive and Skipped.

Inconclusive could be what we use for tests that are marked flaky, and the additional Skipped state would be helpful for reporting.

@pdblasi-google
Copy link

Inconclusive could also be a useful state for situations such as, "We were unable to connect to a device, the test itself didn't explicitly fail but we are unable to report on it meaningfully".

@DanTup
Copy link
Contributor

DanTup commented May 2, 2023

Presumably we would also want to have a special flag for this in the JSON protocol which could be breaking, cc @DanTup . We could make it non-breaking by having it just be an additional flag on successful tests, but it might be cleaner if we just had 3 result states (success, failure, flaky).

If this behaviour is opt-in by the IDE, then presumably that wouldn't be breaking. If it could be enabled outside of the IDE, then having a third result state could be breaking and we might need to ship updates to handle this first.

There was mention of existing retry functionality above which I'm not sure I've seen before and I have no idea what Dart-Code will do for it. I'm happy to update the Dart extension/DAP to handle whatever you think makes most sense for package:test though.

FWIW, VS Code's API only allows us to report tests as passed, failed, errored, skipped (and queued). So we'd need to map the status to one of those at the end for its test runner anyway (I presume that Passed is the most reasonable, even though it doesn't make the flakes very visible).

@jakemac53
Copy link
Contributor

Ok it sounds like it would make sense to have this be a separate field and keep the current success/failure status then

@jakemac53
Copy link
Contributor

FWIW, VS Code's API only allows us to report tests as passed, failed, errored, skipped (and queued). So we'd need to map the status to one of those at the end for its test runner anyway (I presume that Passed is the most reasonable, even though it doesn't make the flakes very visible).

Would you be able to pop up a notification that some of the tests had been flaky?

@DanTup
Copy link
Contributor

DanTup commented May 3, 2023

We can show toast notifications, but they could seem spammy if we show them too much. There are some cases where we have to spawn multiple debug sessions at once (if you select a bunch of tests across multiple files in the test explorer) so doing it at the end of each session could still trigger multiple.

We can probably find some way to do it (and have a Don't Show Again option), it'll just need a little thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants