-
Notifications
You must be signed in to change notification settings - Fork 845
test(reexecute): add firewood chaos test [2/2] #4695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c68a289 to
e1cf68f
Compare
| # XXX: remove this before merging | ||
| pull_request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing the PR - will remove prior to merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a chaos testing framework for Firewood to verify crash resilience during C-Chain block reexecution. The test simulates application crashes by forcefully killing the reexecution process, then verifies that Firewood can recover and resume from the persisted state.
Key Changes:
- Implements a chaos test executable that spawns, kills, and restarts reexecution test processes
- Adds GitHub Actions workflow for scheduled and manual chaos testing
- Integrates chaos testing into the project's Taskfile build system
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/reexecute/chaos/main.go | Core chaos test implementation with process management and crash recovery logic |
| tests/reexecute/chaos/deps.go | VM initialization helpers for creating mainnet C-Chain VM instances |
| Taskfile.yml | Task definitions for running chaos tests with and without data copying |
| .github/workflows/chaos-test.yml | CI workflow for scheduled and manual chaos test execution |
| .github/workflows/chaos-test.json | Configuration matrix for chaos test parameters |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ci: add chaos test job chore: nits chore: add nix installation step chore: configure AWS credentials chore: remove inputs chore: add perms chore: nit chore: nit chore: extend wait time chore: nits chore: nit chore: nit chore: Create shared `evm` module (#4690) chore(reexecute/c): remove go bench from benchmark (#4640) chore: nits fix: MAX_WAIT_TIME chore: nit chore: extend wait times chore: log errs chore: stdout tail chore: nit chore: nits chore: nits ci: improve workflow chore: nits chore: nits chore: nits chore: lint
27869b7 to
7b6d057
Compare
maru-ava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the implementation considerations mentioned inline, I'm ok merging what you've proposed provided its scope remains limited. Fault injection will be a priority post-monorepo and tools like antithesis and chaos mesh provide a much richer set of capabilities that I would rather not see us attempt to duplicate.
Taskfile.yml
Outdated
| cmds: | ||
| - cmd: bash -x ./scripts/copy_dir.sh {{.SRC}} {{.DST}} | ||
|
|
||
| firewood-chaos-test: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the example of #4761, var indirection as proposed below is best avoided. Instead, please prefer passing args directly so that any changes to the inputs don't require task modification:
test-firewood-chaos:
desc: ...
cmd: go run ./tests/reexecute/chaos {{.CLI_ARGS}}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.github/workflows/chaos-test.json
Outdated
| "pull_request": { | ||
| "include": [ | ||
| { | ||
| "start-block": "101", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as per the example of #4761, please ensure that configuration required to run a test locally is defined outside of github actions - ideally execution would be something like task run-test test-name. That could mean that only test, runner and timeout-minutes would be defined in this file, and given that both the runner and the timeout are static, the value of this file would seem questionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also went ahead and removed the JSON file as suggested: 9b24d42
|
Will wait for #4761 to be merged in before making any of the requested changes here. |
Why this should be merged
As mentioned in ava-labs/firewood#1371, we need a crash test to make sure that Firewood is safe against application crashes. This PR fulfills this by adding the
chaosexecutable, which does the following:xamount of time, the executable kills the testThe test is considered a success if the reexecution test is able to start up again.
Closes ava-labs/firewood#1588
How this works
tests/reexecute/chaosexecutableHow this was tested
CI
https://github.com/ava-labs/avalanchego/actions/runs/20284273056/job/58254137735?pr=4695
Need to be documented in RELEASES.md?
No