Skip to content

Conversation

@RodrigoVillar
Copy link
Contributor

@RodrigoVillar RodrigoVillar commented Dec 15, 2025

Why this should be merged

As mentioned in ava-labs/firewood#1371, we need a crash test to make sure that Firewood is safe against application crashes. This PR fulfills this by adding the chaos executable, which does the following:

  • Starts the reexecution test as a child process which uses Firewood
  • After waiting for x amount of time, the executable kills the test
  • Starts the reexecution test again, picking up from where the child process crashed

The test is considered a success if the reexecution test is able to start up again.

Closes ava-labs/firewood#1588

How this works

  • Adds the tests/reexecute/chaos executable
  • Adds firewood chaos test workflow which is scheduled and can also be invoked via a manual dispatch
  • Extends reexecution script to add invocation of chaos test

How this was tested

CI

https://github.com/ava-labs/avalanchego/actions/runs/20284273056/job/58254137735?pr=4695

Need to be documented in RELEASES.md?

No

Comment on lines +30 to +31
# XXX: remove this before merging
pull_request:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for testing the PR - will remove prior to merging.

@RodrigoVillar RodrigoVillar requested a review from rkuris December 16, 2025 22:12
@RodrigoVillar RodrigoVillar marked this pull request as ready for review December 16, 2025 22:23
@RodrigoVillar RodrigoVillar requested a review from a team as a code owner December 16, 2025 22:23
Copilot AI review requested due to automatic review settings December 16, 2025 22:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a chaos testing framework for Firewood to verify crash resilience during C-Chain block reexecution. The test simulates application crashes by forcefully killing the reexecution process, then verifies that Firewood can recover and resume from the persisted state.

Key Changes:

  • Implements a chaos test executable that spawns, kills, and restarts reexecution test processes
  • Adds GitHub Actions workflow for scheduled and manual chaos testing
  • Integrates chaos testing into the project's Taskfile build system

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/reexecute/chaos/main.go Core chaos test implementation with process management and crash recovery logic
tests/reexecute/chaos/deps.go VM initialization helpers for creating mainnet C-Chain VM instances
Taskfile.yml Task definitions for running chaos tests with and without data copying
.github/workflows/chaos-test.yml CI workflow for scheduled and manual chaos test execution
.github/workflows/chaos-test.json Configuration matrix for chaos test parameters

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ci: add chaos test job

chore: nits

chore: add nix installation step

chore: configure AWS credentials

chore: remove inputs

chore: add perms

chore: nit

chore: nit

chore: extend wait time

chore: nits

chore: nit

chore: nit

chore: Create shared `evm` module (#4690)

chore(reexecute/c): remove go bench from benchmark (#4640)

chore: nits

fix: MAX_WAIT_TIME

chore: nit

chore: extend wait times

chore: log errs

chore: stdout tail

chore: nit

chore: nits

chore: nits

ci: improve workflow

chore: nits

chore: nits

chore: nits

chore: lint
@RodrigoVillar RodrigoVillar force-pushed the rodrigo/firewood-chaos-test branch from 27869b7 to 7b6d057 Compare December 17, 2025 15:19
@RodrigoVillar RodrigoVillar requested a review from a team as a code owner December 17, 2025 15:19
@RodrigoVillar RodrigoVillar changed the base branch from master to rodrigo/export-new-mainnet-c-chain-vm December 17, 2025 15:20
@RodrigoVillar RodrigoVillar changed the title test(reexecute): add firewood chaos test test(reexecute): add firewood chaos test [2/2] Dec 17, 2025
Base automatically changed from rodrigo/export-new-mainnet-c-chain-vm to master December 17, 2025 18:09
Copy link
Contributor

@maru-ava maru-ava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the implementation considerations mentioned inline, I'm ok merging what you've proposed provided its scope remains limited. Fault injection will be a priority post-monorepo and tools like antithesis and chaos mesh provide a much richer set of capabilities that I would rather not see us attempt to duplicate.

Taskfile.yml Outdated
cmds:
- cmd: bash -x ./scripts/copy_dir.sh {{.SRC}} {{.DST}}

firewood-chaos-test:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the example of #4761, var indirection as proposed below is best avoided. Instead, please prefer passing args directly so that any changes to the inputs don't require task modification:

  test-firewood-chaos:
    desc: ...
    cmd: go run ./tests/reexecute/chaos {{.CLI_ARGS}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed the example of #4761 and moved the logic of test-firewood-chaos into it's own script: 00a4852

"pull_request": {
"include": [
{
"start-block": "101",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as per the example of #4761, please ensure that configuration required to run a test locally is defined outside of github actions - ideally execution would be something like task run-test test-name. That could mean that only test, runner and timeout-minutes would be defined in this file, and given that both the runner and the timeout are static, the value of this file would seem questionable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also went ahead and removed the JSON file as suggested: 9b24d42

@RodrigoVillar
Copy link
Contributor Author

Will wait for #4761 to be merged in before making any of the requested changes here.

@RodrigoVillar RodrigoVillar requested a review from maru-ava January 6, 2026 16:27
@maru-ava maru-ava added this pull request to the merge queue Jan 6, 2026
Merged via the queue into master with commit 5bb3329 Jan 6, 2026
56 checks passed
@maru-ava maru-ava deleted the rodrigo/firewood-chaos-test branch January 6, 2026 20:21
@github-project-automation github-project-automation bot moved this to Done 🎉 in avalanchego Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done 🎉

Development

Successfully merging this pull request may close these issues.

Add Firewood Chaos Test

4 participants