feat(map): Fix map replay, wait-inside-map, and concurrency race conditions#229
feat(map): Fix map replay, wait-inside-map, and concurrency race conditions#229ayushiahjolia merged 7 commits intomainfrom
Conversation
cae0595 to
00210be
Compare
00210be to
14a215e
Compare
fde6a1b to
3c9974e
Compare
|
unit tests failed: This is due to the race condition introduced in the following change: Removing this change should fix the tests |
3c9974e to
bb907d9
Compare
|
When all map branches call wait(), every branch thread deregisters and suspendExecution() fires. But the parent thread sitting on completionFuture.join() in BaseConcurrentOperation.get() is stuck forever - no branch will ever complete to trigger onChildContextComplete and finalize the map's completionFuture. runUntilCompleteOrSuspend races completionFuture against executionExceptionFuture, so when suspension fires, the parent thread gets freed via SuspendExecutionException instead of blocking indefinitely. |
bb907d9 to
33e5548
Compare
sdk/src/main/java/software/amazon/lambda/durable/operation/BaseDurableOperation.java
Outdated
Show resolved
Hide resolved
72e4167 to
d7339bc
Compare
sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java
Outdated
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/serde/JacksonSerDes.java
Outdated
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java
Outdated
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java
Outdated
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/BaseConcurrentOperation.java
Outdated
Show resolved
Hide resolved
977217c to
0ba6f4e
Compare
sdk/src/test/java/software/amazon/lambda/durable/operation/BaseDurableOperationTest.java
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/ConcurrencyOperation.java
Show resolved
Hide resolved
|
|
||
| private void addAllItems() { | ||
| // Enqueue all items first, then start execution. This prevents early termination | ||
| // criteria (e.g., minSuccessful) from completing the operation mid-loop on replay, |
There was a problem hiding this comment.
Wondering the expected behavior of early termination for map, do we skip the rest of the items?
ChildOperation will not send checkpoint if parent operation is completed.
There was a problem hiding this comment.
yes we skip remaining items with NOT_STARTED status.
There was a problem hiding this comment.
But if there were some items already started, they will complete.
sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java
Show resolved
Hide resolved
sdk/src/main/java/software/amazon/lambda/durable/operation/WaitOperation.java
Show resolved
Hide resolved
| * after all items have been enqueued. This prevents early termination from blocking item creation when all items | ||
| * are known upfront (e.g., map operations). | ||
| */ | ||
| protected <R> ChildContextOperation<R> enqueueItem( |
There was a problem hiding this comment.
calling enqueueItem for all items, followed by startPendingItems, seems no difference with calling addItem for all items. Just for calculation of failure percentile?
There was a problem hiding this comment.
The calculation of failure numbers can also be calculated for eager start, because when user create Map they already pass the items as a list?
There was a problem hiding this comment.
addItem calls executeNextItemIfAllowed() immediately, and on replay child execution is synchronous - so early termination (e.g., minSuccessful=1) can mark the operation completed before the remaining items are even registered. The next addItem would then throw IllegalStateException. Enqueue-then-start ensures all items are registered before any execution begins.
| } else { | ||
| handleFailure(completionStatus); | ||
| synchronized (this) { | ||
| if (isOperationCompleted()) { |
There was a problem hiding this comment.
As soon as this is done, we need to prevent more children from checkpointing their results so that we can keep the result consistent.
There was a problem hiding this comment.
ChildContextOperation.checkpointSuccess checks if (parentOperation != null && parentOperation.isOperationCompleted()) before sending any checkpoint. So once handleComplete marks the parent as completed, any in-flight children that finish afterwards will skip their checkpoint and just call markAlreadyCompleted() locally.
wangyb-A
left a comment
There was a problem hiding this comment.
We will have some minor refactors and fix in the future.
@zhongkechen Please also take a look 😀
We still need some major fixes 😅 |
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue Link, if available
#39
Description
Follow-ups/Remaining Work
NOT_STARTEDorSUCCESSFUL?Demo/Screenshots
Checklist
Testing
Unit Tests
Have unit tests been written for these changes? Updated.
Integration Tests
Have integration tests been written for these changes? Yes
Examples
Has a new example been added for the change? (if applicable) Yes