feat(map): Fix map replay, wait-inside-map, and concurrency race conditions by ayushiahjolia · Pull Request #229 · aws/aws-durable-execution-sdk-java

ayushiahjolia · 2026-03-18T19:45:31Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Issue Link, if available

#39

Description

Updated Map implementation to rely on ConcurrencyOperation.
Added tests for all map usages.

Follow-ups/Remaining Work

If function returns null output, are we allowing checkpointing that?
Is it okay to enqueue all items? or we want to start executing item as soon as possible.
Doc updates in next CR, will remove dead code from Test as well.
Thread issues during replay in LocalDurableTestRunner.
If minSuccessful is reached then remaining iterations should return NOT_STARTED or SUCCESSFUL?

Demo/Screenshots

Checklist

I have filled out every section of the PR template
I have thoroughly tested this change

Testing

Unit Tests

Have unit tests been written for these changes? Updated.

Integration Tests

Have integration tests been written for these changes? Yes

Examples

Has a new example been added for the change? (if applicable) Yes

…itions

zhongkechen · 2026-03-18T23:41:43Z

unit tests failed:

[ERROR] Errors: 
[ERROR]   ParallelOperationTest.branchCreation_multipleBranchesAllCreated:106 » IllegalState Cannot add items to a completed operation

This is due to the race condition introduced in the following change:

-        // Block until operation completes. No-op if the future is already completed.
-        completionFuture.join();
+        // Block until operation completes or execution suspends.
+        // Using runUntilCompleteOrSuspend races completionFuture against executionExceptionFuture,
+        // so when all active threads suspend (e.g., wait inside map branches), the
+        // SuspendExecutionException propagates and this thread is freed — preventing thread leaks
+        // on shared executor pools across invocations.
+        executionManager.runUntilCompleteOrSuspend(completionFuture).join();

Removing this change should fix the tests

ayushiahjolia · 2026-03-19T04:06:37Z

When all map branches call wait(), every branch thread deregisters and suspendExecution() fires. But the parent thread sitting on completionFuture.join() in BaseConcurrentOperation.get() is stuck forever - no branch will ever complete to trigger onChildContextComplete and finalize the map's completionFuture. runUntilCompleteOrSuspend races completionFuture against executionExceptionFuture, so when suspension fires, the parent thread gets freed via SuspendExecutionException instead of blocking indefinitely.

sdk/src/main/java/software/amazon/lambda/durable/operation/BaseDurableOperation.java

…itions

sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java

sdk/src/main/java/software/amazon/lambda/durable/serde/JacksonSerDes.java

sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java

…itions

sdk/src/test/java/software/amazon/lambda/durable/operation/BaseDurableOperationTest.java

sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java

sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java

wangyb-A · 2026-03-20T00:10:50Z

sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java

+
+    private void addAllItems() {
+        // Enqueue all items first, then start execution. This prevents early termination
+        // criteria (e.g., minSuccessful) from completing the operation mid-loop on replay,


Wondering the expected behavior of early termination for map, do we skip the rest of the items?

ChildOperation will not send checkpoint if parent operation is completed.

yes we skip remaining items with NOT_STARTED status.

But if there were some items already started, they will complete.

sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java

sdk/src/main/java/software/amazon/lambda/durable/operation/WaitOperation.java

zhongkechen · 2026-03-20T00:07:13Z

sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java

+     * after all items have been enqueued. This prevents early termination from blocking item creation when all items
+     * are known upfront (e.g., map operations).
+     */
+    protected <R> ChildContextOperation<R> enqueueItem(


calling enqueueItem for all items, followed by startPendingItems, seems no difference with calling addItem for all items. Just for calculation of failure percentile?

The calculation of failure numbers can also be calculated for eager start, because when user create Map they already pass the items as a list?

addItem calls executeNextItemIfAllowed() immediately, and on replay child execution is synchronous - so early termination (e.g., minSuccessful=1) can mark the operation completed before the remaining items are even registered. The next addItem would then throw IllegalStateException. Enqueue-then-start ensures all items are registered before any execution begins.

zhongkechen · 2026-03-20T00:17:23Z

sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java

-        } else {
-            handleFailure(completionStatus);
+        synchronized (this) {
+            if (isOperationCompleted()) {


As soon as this is done, we need to prevent more children from checkpointing their results so that we can keep the result consistent.

ChildContextOperation.checkpointSuccess checks if (parentOperation != null && parentOperation.isOperationCompleted()) before sending any checkpoint. So once handleComplete marks the parent as completed, any in-flight children that finish afterwards will skip their checkpoint and just call markAlreadyCompleted() locally.

wangyb-A

We will have some minor refactors and fix in the future.

@zhongkechen Please also take a look 😀

zhongkechen · 2026-03-20T00:20:02Z

We will have some minor refactors and fix in the future.

@zhongkechen Please also take a look 😀

We still need some major fixes 😅

ayushiahjolia self-assigned this Mar 18, 2026

ayushiahjolia force-pushed the map_bug_fixes branch from cae0595 to 00210be Compare March 18, 2026 19:57

feat(map): Fix map replay, wait-inside-map, and concurrency race cond…

14a215e

…itions

ayushiahjolia force-pushed the map_bug_fixes branch from 00210be to 14a215e Compare March 18, 2026 21:22

Merge branch 'main' into map_bug_fixes

fde6a1b

ayushiahjolia marked this pull request as ready for review March 18, 2026 22:08

ayushiahjolia requested a review from a team March 18, 2026 22:08

zhongkechen force-pushed the map_bug_fixes branch from fde6a1b to 3c9974e Compare March 18, 2026 23:41

ayushiahjolia force-pushed the map_bug_fixes branch from 3c9974e to bb907d9 Compare March 19, 2026 03:35

ayushiahjolia force-pushed the map_bug_fixes branch from bb907d9 to 33e5548 Compare March 19, 2026 04:07

zhongkechen requested changes Mar 19, 2026

View reviewed changes

sdk/src/main/java/software/amazon/lambda/durable/operation/BaseDurableOperation.java Outdated Show resolved Hide resolved

feat(map): Fix map replay, wait-inside-map, and concurrency race cond…

d7339bc

…itions

ayushiahjolia force-pushed the map_bug_fixes branch from 72e4167 to d7339bc Compare March 19, 2026 04:42

zhongkechen requested changes Mar 19, 2026

View reviewed changes

zhongkechen reviewed Mar 19, 2026

View reviewed changes

sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java Outdated Show resolved Hide resolved

ayushiahjolia added 4 commits March 19, 2026 12:54

Merge branch 'main' into map_bug_fixes

fcd8120

feat(map): Fix map replay, wait-inside-map, and concurrency race cond…

6dceef9

…itions

Merge branch 'main' into map_bug_fixes

1122ba5

Fix handleComplete method for map

0ba6f4e

ayushiahjolia force-pushed the map_bug_fixes branch from 977217c to 0ba6f4e Compare March 19, 2026 23:25

ayushiahjolia commented Mar 19, 2026

View reviewed changes

sdk/src/test/java/software/amazon/lambda/durable/operation/BaseDurableOperationTest.java Show resolved Hide resolved

zhongkechen self-requested a review March 19, 2026 23:40

wangyb-A reviewed Mar 19, 2026

View reviewed changes

sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java Show resolved Hide resolved

wangyb-A reviewed Mar 20, 2026

View reviewed changes

sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java Show resolved Hide resolved

sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java Show resolved Hide resolved

wangyb-A reviewed Mar 20, 2026

View reviewed changes

zhongkechen approved these changes Mar 20, 2026

View reviewed changes

wangyb-A approved these changes Mar 20, 2026

View reviewed changes

ayushiahjolia merged commit b625cb8 into main Mar 20, 2026
11 checks passed

ayushiahjolia deleted the map_bug_fixes branch March 20, 2026 18:16

wangyb-A mentioned this pull request Mar 20, 2026

feat: [Parallel] Add parallel result #246

Merged

2 tasks

Conversation

ayushiahjolia commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Link, if available

Description

Demo/Screenshots

Checklist

Testing

Unit Tests

Integration Tests

Examples

Uh oh!

zhongkechen commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ayushiahjolia commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyb-A left a comment

Choose a reason for hiding this comment

Uh oh!

zhongkechen commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ayushiahjolia commented Mar 18, 2026 •

edited

Loading

zhongkechen commented Mar 18, 2026 •

edited

Loading

ayushiahjolia commented Mar 19, 2026 •

edited

Loading