Skip to content

Conversation

@mhamza15
Copy link
Collaborator

@mhamza15 mhamza15 commented Dec 23, 2025

Description

Switch tests to use gotestsum. gotestsum has a few benefits, namely prettier output, built-in junit report generation, automatic retries (individual tests, rather than full packages as some of our test infra does), and slow tests reporting.

The workflows will now report the top 20 slowest tests in the job summary. Click on any one of the jobs below to see the new output + job summary.

Backporting to improve release branches CI as well.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Helped by Claude

@mhamza15 mhamza15 self-assigned this Dec 23, 2025
@github-actions github-actions bot added this to the v24.0.0 milestone Dec 23, 2025
@vitess-bot vitess-bot bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Dec 23, 2025
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Dec 23, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.89%. Comparing base (32b8bd8) to head (0baa174).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19076      +/-   ##
==========================================
- Coverage   69.89%   69.89%   -0.01%     
==========================================
  Files        1612     1613       +1     
  Lines      215826   216019     +193     
==========================================
+ Hits       150857   150989     +132     
- Misses      64969    65030      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mhamza15 mhamza15 force-pushed the gotestsum branch 2 times, most recently from 60f77d1 to e201c50 Compare January 8, 2026 01:19
Switch unit tests to use gotestsum. gotestsum has a few benefits, namely prettier output, built-in junit report generation, and automatic retries (individual tests, rather than full packages as some of our test infra does).

Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 changed the title use gotestsum Run unit tests with gotestsum Jan 8, 2026
@arthurschreiber
Copy link
Member

What do you think about adding the output of gotestsum tool slowest in a step after running the tests? Would help track down the biggest offenders to slow CI runs more systematically.

@mhamza15
Copy link
Collaborator Author

mhamza15 commented Jan 8, 2026

What do you think about adding the output of gotestsum tool slowest in a step after running the tests? Would help track down the biggest offenders to slow CI runs more systematically.

Good call!

Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15
Copy link
Collaborator Author

mhamza15 commented Jan 8, 2026

Here's an example: https://github.com/vitessio/vitess/actions/runs/20826216650/attempts/1#summary-59828106838. Default is all tests >100ms. We have a gazillion, so changing to top 20.

@mhamza15 mhamza15 added Component: Build/CI Type: CI/Build Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jan 8, 2026
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 marked this pull request as ready for review January 8, 2026 21:00
@mhamza15 mhamza15 requested a review from frouioui as a code owner January 8, 2026 21:00
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 changed the title Run unit tests with gotestsum Run tests with gotestsum Jan 8, 2026
mhamza15 added a commit to mhamza15/vitess that referenced this pull request Jan 9, 2026
Saving some time on some slow tests using [synctest](https://pkg.go.dev/testing/synctest). Slow tests pulled from changes in vitessio#19076.

Before:

```

Signed-off-by: Mohamed Hamza <[email protected]>
--- PASS: TestSleepTablet (15.01s)
    --- PASS: TestSleepTablet/default_sleep_duration (15.00s)
--- PASS: TestEmergencyReparentShardSlow (0.00s)
    --- PASS: TestEmergencyReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_29_seconds_is_ok (29.00s)
    --- PASS: TestEmergencyReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_31_seconds_is_error (30.00s)
--- PASS: TestPlannedReparentShardSlow (0.00s)
    --- PASS: TestPlannedReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_29_seconds_is_ok (28.01s)
    --- PASS: TestPlannedReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_31_seconds_is_error (30.00s)
ok  	vitess.io/vitess/go/vt/vtctl/grpcvtctldserver	30.854s
```

After:

```
--- PASS: TestSleepTablet (0.00s)
    --- PASS: TestSleepTablet/default_sleep_duration (0.00s)
--- PASS: TestEmergencyReparentShardSlow (0.00s)
    --- PASS: TestEmergencyReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_29_seconds_is_ok (0.01s)
    --- PASS: TestEmergencyReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_31_seconds_is_error (0.00s)
--- PASS: TestPlannedReparentShardSlow (0.00s)
    --- PASS: TestPlannedReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_29_seconds_is_ok (0.01s)
    --- PASS: TestPlannedReparentShardSlow/nil_WaitReplicasTimeout_and_request_takes_31_seconds_is_error (0.01s)
ok  	vitess.io/vitess/go/vt/vtctl/grpcvtctldserver	0.880s
```

Signed-off-by: Mohamed Hamza <[email protected]>
Some tests like TestPlayerBatchMode have subtests that share state.
When gotestsum retries just the failed subtest, the state is already
corrupted from subsequent subtests that ran. This flag makes gotestsum
rerun the entire root test instead.

Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 marked this pull request as ready for review January 9, 2026 17:11
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@arthurschreiber
Copy link
Member

Only concern I have that this seems to stop CI from using vitessio/go-junit-report. I remember talking to @harshit-gangal some while back on why we have that fork and what we were using it for, but I don't quite remember what the reason was. I think it had something to do with flaky tests? 🤔

@arthurschreiber
Copy link
Member

Digging a bit more, #9102 is when vitessio/go-junit-report was introduced.

@arthurschreiber
Copy link
Member

arthurschreiber commented Jan 12, 2026

And I think vitessio/go-junit-report@a07af90 was the main reason for the fork. For workflows that automatically re-run failed test cases, the text output across different runs was accumulated and then send to vitessio/go-junit-report, which then had to handle cases where a test might have failed but then was successful in a re-run.

Given that gotestsum does the re-running and report generation now, we might not need this anymore? Does that mean we can get rid of the remaining references to vitessio/go-junit-report in the code? test/ci_workflow_gen.go, .github/workflows/vitess_tester_vtgate.yml, .github/workflows/codeql_analysis.yml and a few more files still seem to have references to it.

@arthurschreiber
Copy link
Member

According to gotestyourself/gotestsum#281, I think gotestsum does what we want. It records both the success and failure run, but the overall success of the test run is based on retries being eventually successful.

@arthurschreiber
Copy link
Member

I also see there's different levels of retry logic. In config.json, a few test cases have RetryMax set to some specific value, with the default being 3. How does that interact with retrying at the gotestsum layer now (or the various test type bash scripts before)? Does this need to be cleaned up at some point?

@mhamza15
Copy link
Collaborator Author

mhamza15 commented Jan 12, 2026

@arthurschreiber Thank you for taking a deep look at go-junit-report ❤️ . I agree, I believe gotestsum does everything we need. We can even have it report the failures that were retried successfully so that we are also aware of flaky tests (the job will still pass). I'll clean up the other references I had missed.

As for retries, I intentionally left out retries for gotestsum for e2e tests (which uses config.json), and let test.go handle it. I first thought of passing RetryMax through to the gotestsum script and letting it handle it, but test.go also handles creating fresh VTDATAROOTs for each try, so I kept it simple for now for concern that retries would just fail immediately do to stale leftover state between runs.

I'm looking through it again and it looks like many tests themselves also create their own subdirectories under VTDATAROOT, so we might be able to remove that bit of logic, which allows us to rely on gotestsum's retries fully. I'll take a deeper look 👀 .

gotestsum handles JUnit report generation via JUNIT_OUTPUT env var,
so go-junit-report is no longer needed.

Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 force-pushed the gotestsum branch 2 times, most recently from 34a5b1f to d52af1a Compare January 12, 2026 14:27
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15
Copy link
Collaborator Author

I've reverted the changes to use gotestsum for end-to-end tests since they require a bit more work to integrate nicely with the junit report generation. Will follow up in another PR.

@mhamza15 mhamza15 enabled auto-merge (squash) January 12, 2026 20:15
@mhamza15 mhamza15 merged commit 93e4601 into vitessio:main Jan 12, 2026
103 of 108 checks passed
@mhamza15 mhamza15 deleted the gotestsum branch January 12, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport to: release-22.0 Needs to be backport to release-22.0 Backport to: release-23.0 Needs to be backport to release-23.0 Component: Build/CI Type: CI/Build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants