Skip to content

Conversation

aong-atlassian
Copy link

Related issue

Fixes #3206

Description

This PR fixes a critical bug where parallel stack deployments would crash and leave infrastructure in an inconsistent state when one stack failed during deployment.

The Problem:

When running cdktf deploy --parallelism N, if a stack failed while at max parallelism:

  • Promise.race() would throw immediately (line 472 in cdktf-project.ts)
  • The execution loop would exit prematurely
  • Already-running Terraform child processes would be killed mid-deployment
  • Infrastructure would be left in an inconsistent state (partial resources created, state locks held, corrupted Terraform state files)

This bug was introduced in v0.10.0 (March 2022) and has affected all versions through v0.21.0+.

Root Cause:

The code used an unwrapped Promise.race() in the execute loop without proper error handling:

if (runningStacks.length >= maxParallelRuns) {
  await Promise.race(runningStacks.map((s) => s.currentWorkPromise));
  continue;
}

When a promise rejects, Promise.race() immediately throws. This causes the execution loop to exit and the CLI process terminates via exit(new Error(err)), killing all active Terraform child processes.

The Solution:

I wrapped Promise.race() with try-catch that breaks out of the loop instead of throwing immediately:

if (runningStacks.length >= maxParallelRuns) {
  try {
    await Promise.race(runningStacks.map((s) => s.currentWorkPromise));
  } catch (e) {
     logger.debug(
       "Encountered an error in one of the stacks, allowing running stacks to finish before exit",
       e,
     );
    break;
  }
  continue;
}

Why this approach:

I chose to use break instead of throwing because it allows execution to reach the existing ensureAllSettledBeforeThrowing call at line 507-510. This existing infrastructure properly waits for all running stacks to complete before reporting the error, ensuring:

  • No orphaned Terraform processes
  • All active deployments complete cleanly
  • Proper error reporting after all work is done
  • Clean shutdown with no inconsistent infrastructure state

In testing, it appears that the error handling for the promises is handled elsewhere, leading to duplicate errors if rethrowing the error directly in the catch block. The unit test checks for the error.

Testing:

I added a new test: "waits for running stacks to complete when one fails with limited parallelism"

  • Uses parallelism: 2 with 4 stacks to trigger the bug scenario
  • Verifies that already-running stacks complete even when one fails
  • The test fails without the fix and passes with it

To properly test this, I also added stack4 to the parallel-error test fixture, as the bug only manifests when:

  1. Running at max parallelism AND
  2. There's another pending stack waiting (forcing re-entry into the if block) AND
  3. A running stack fails while waiting for a slot

All 25 tests pass with this change (487s runtime).

Checklist

  • I have updated the PR title to match CDKTF's style guide
  • I have run the linter on my code locally
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation if applicable (N/A - this is a bug fix, no user-facing API changes)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works if applicable
  • New and existing unit tests pass locally with my changes

@aong-atlassian aong-atlassian requested a review from a team as a code owner October 6, 2025 23:03
@aong-atlassian aong-atlassian requested review from ansgarm and mutahhir and removed request for a team October 6, 2025 23:03
Copy link

vercel bot commented Oct 6, 2025

@aong-atlassian is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

1 similar comment
Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

@aong-atlassian
Copy link
Author

Just waiting on my company's internal review for the CLA

Hopefully we can iterate on the fix in the meantime if needed 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One stack failure when deploying multiple stack in parallel causes other stacks state to remain locked

1 participant