Fix child WF ID generation #1803

gow · 2025-02-05T07:04:24Z

What was changed

We now use workflowInfo.OriginalRunID in parent when it is generating a child workflow ID.

Why?

We need this change to enable resetting of workflows that have pending children. What this means is Go SDK should be able to successfully replay StartChildWorkflowExecutionInitiated event in a parent that was reset.
Without this change Go SDK will run into non-determinism error since the child ID it generates (based on new run ID) doesn't match with the ID available in the StartChildWorkflowExecutionInitiated event. Since OriginalRunID stays the same across resets, with this change the SDK can successfully replay StartChildWorkflowExecutionInitiated events.

Checklist

Closes
N/A
How was this tested:
Manually tested it (unit tests coming)
Any docs updates needed?
N/A

cretz · 2025-02-05T14:23:18Z

Questions here:

Can we confirm what Java does here?
IIUC Core-based languages make child workflow IDs with a seeded random that changes seed on reset midway through execution, meaning post-reset child workflow IDs would be different than they were on original run but pre-reset ones are the same. This was done very much by intention. Should we consider a similar approach in Go/Java that has a "current run ID" that can change midway through execution?
Can we assess the impact of behavior changes like this on existing users with existing expectations? I am guessing there is no impact because you could not reset to before child workflows?
Can you confirm OriginalRunID always matches RunID at this point of code today? Meaning since resets were never allowed over child workflows, it is impossible to have these two values different at child workflow creation time today?

gow · 2025-02-05T21:16:28Z

Questions here:

Can we confirm what Java does here?

Java seems to be generating UUIDs when an ID is not provided in ChildWorkflow options. Additionally when I replayed an existing StartChildWorkflowExecutionInitiated event (by resetting to a point after the child completion) it didn't run into non-determinism error. Go on the other hand ran into non-det error complaining about workflowID. ("Error":"[TMPRL1100] unknown command CommandType: ChildWorkflow, ID: 0174afb2-0213-4644-9214-8b977310cc9f_14, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition")

IIUC Core-based languages make child workflow IDs with a seeded random that changes seed on reset midway through execution, meaning post-reset child workflow IDs would be different than they were on original run but pre-reset ones are the same. This was done very much by intention. Should we consider a similar approach in Go/Java that has a "current run ID" that can change midway through execution?

Absolutely yes. Java seems to be already doing this (at least in my testing). Looks like it changes the seed at reset-point?
If this is an easier change in Go-SDK, I'd rather do it. Otherwise, we still need this fix since Go-SDK is currently broken and the upcoming reset changes will increase the exposure to this bug (See below)

Can we assess the impact of behavior changes like this on existing users with existing expectations? I am guessing there is no impact because you could not reset to before child workflows?

Yes. So far all server versions have prevented resetting of workflows when the reset point falls in between child Init & Completed events (i.e within the lifespan of child). So no potential user impact or backward compatibility issues there.
However the server has always allowed workflows to be reset when the reset-point chosen is outside the child init & completed events. In cases where reset-point is after child complete event, workers with Go SDK run into non-det error while other SDKs seems to simply reapply events (child Init, Start & complete) without any issues. In that regard this PR is more of a bug fix in Go-SDK. And now with the upcoming server changes to allow reset-points to be in between Init-Completed, it's important that Go-SDK successfully replay a StartChildWorkflowExecutionInitiated event.

Can you confirm OriginalRunID always matches RunID at this point of code today? Meaning since resets were never allowed over child workflows, it is impossible to have these two values different at child workflow creation time today?

Yes it is possible to have these two values different at child creation time. When the reset-point is outside the child's lifespan (Init - Completed) then the server has always allowed resets. So in cases when the reset-point is before the child-init, the parent will try to create a child again and this time OriginalRunID is different from RunID.

cretz · 2025-02-05T22:53:03Z

@gow - thanks for this info! So reading all of that I am seeing two things:

We need to fix Go to be at least a little like other SDKs. We have two options:
1. Start offering a re-seeded random to users and we use it for child IDs in Go with an SDK flag to only apply to new workflows
2. Make available (maybe only to us) a "current run ID" that is updated when reset that is used for child workflow IDs (no SDK flag needed for compatibility)
We are not concerned with backwards compatibility of only changing Go since it never worked anyways after reset, so long as the non-reset logic remains working

For 1, I will start an internal discussion and invite you. Regardless, I am not sure the desired outcome is what is in this PR.

gow · 2025-02-05T23:10:58Z

The desired outcome of this PR is for Go-SDK to be able to replay StartChildWorkflowExecutionInitiated events like other SDKs. Currently it fails. Using OriginalRunID will fix this issue and unblock reset feature development. But I understand it doesn't bring Go-SDK to be fully be at parity with other SDKs.

gow · 2025-02-07T19:41:49Z

Putting this on hold. Will discuss internally and work on a better fix.

gow · 2025-02-12T23:53:32Z

After an internal discussion, we decided to bring Go SDK on par with the rest of the SDKs. So instead of simply using OriginalRunID, we are using the FirstExecutionRunId as the initial seed and updating it with the currentRunID whenever we encounter a reset.

…hild_workflow_id

yuandrew

overall LGTM, deferring to @Quinn-With-Two-Ns or @cretz for approval. I vaguely remember there was a test they were hoping was added, but can't remember exactly what that test was.

yuandrew · 2025-02-13T01:33:16Z

internal/internal_event_handlers.go

+		attr := event.GetWorkflowTaskFailedEventAttributes()
+		if attr.GetCause() == enumspb.WORKFLOW_TASK_FAILED_CAUSE_RESET_WORKFLOW {
+			weh.workflowInfo.childWorkflowIDSeed = attr.GetNewRunId()
+		}
 		// No Operation


stale comment

yuandrew · 2025-02-13T01:36:47Z

internal/internal_task_handlers_interfaces_test.go

@@ -176,7 +176,7 @@ func (s *PollLayerInterfacesTestSuite) TestGetNextCommands() {
 		createTestEventWorkflowTaskStarted(3),
 		{
 			EventId:   4,
-			EventType: enumspb.EVENT_TYPE_WORKFLOW_TASK_FAILED,
+			EventType: enumspb.EVENT_TYPE_WORKFLOW_TASK_TIMED_OUT,


why was this changed from failed to timed_out?

This is because we are now not skipping workflow task failed events in history.prepareTask() (internal/internal_task_handlers.go) so that they can be replayed. Some of the existing tests relied on this event being skipped. So I changed them to use other events that are still being skipped.

gow

I vaguely remember there was a test they were hoping was added, but can't remember exactly what that test was.

We need to add some integration tests here. I'm trying to figure out the structuring and setup of them in this codebase. I'll probably sync with you all offline to get some pointers.

gow · 2025-02-13T02:18:06Z

internal/internal_task_handlers_interfaces_test.go

@@ -176,7 +176,7 @@ func (s *PollLayerInterfacesTestSuite) TestGetNextCommands() {
 		createTestEventWorkflowTaskStarted(3),
 		{
 			EventId:   4,
-			EventType: enumspb.EVENT_TYPE_WORKFLOW_TASK_FAILED,
+			EventType: enumspb.EVENT_TYPE_WORKFLOW_TASK_TIMED_OUT,


This is because we are now not skipping workflow task failed events in history.prepareTask() (internal/internal_task_handlers.go) so that they can be replayed. Some of the existing tests relied on this event being skipped. So I changed them to use other events that are still being skipped.

gow · 2025-02-13T02:21:55Z

internal/internal_event_handlers.go

+	// Use the first execution run ID from the start event as the initial seed.
+	// First execution run ID stays the same for the entire chain of workflow resets.
+	// This helps us keep child workflow IDs consistent up until the next reset point.
+	weh.workflowInfo.childWorkflowIDSeed = attributes.GetFirstExecutionRunId()


Can FirstExecutionRunID be empty ever? The current version of the server is guaranteed to include it in the start event. But could there be some long running WFs with this field being empty in start event?
Asking to see if we want to be defensive here and fall back to RunID if FirstExecutionRunID is empty.

gow requested a review from a team as a code owner February 5, 2025 07:04

gow requested review from cretz, yycptt and Quinn-With-Two-Ns February 5, 2025 07:04

gow marked this pull request as draft February 7, 2025 19:41

gow force-pushed the cg/reset_child_workflow_id branch from c54c992 to ab1a0c3 Compare February 12, 2025 06:31

gow changed the title ~~Use OriginalRunID instead of RunID to generate child workflow IDs~~ Fix seed child WF ID generation Feb 12, 2025

gow changed the title ~~Fix seed child WF ID generation~~ Fix child WF ID generation Feb 12, 2025

gow added 3 commits February 12, 2025 15:45

Use OriginalRunID instead of RunID to generate child workflow IDs

9520db2

Use first execution run

ccb73b2

Unskip EVENT_TYPE_WORKFLOW_TASK_FAILED in internal_task_handlers.go

f0eb17d

gow force-pushed the cg/reset_child_workflow_id branch from ab1a0c3 to f0eb17d Compare February 12, 2025 23:49

gow marked this pull request as ready for review February 12, 2025 23:52

gow added 3 commits February 12, 2025 15:57

Remove log

7a812be

Fix existing tests

1e60a0e

Merge branch 'master' of github.com:temporalio/sdk-go into cg/reset_c…

98f65da

…hild_workflow_id

yuandrew reviewed Feb 13, 2025

View reviewed changes

gow commented Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix child WF ID generation #1803

Fix child WF ID generation #1803

gow commented Feb 5, 2025

cretz commented Feb 5, 2025 •

edited

Loading

gow commented Feb 5, 2025 •

edited

Loading

cretz commented Feb 5, 2025

gow commented Feb 5, 2025

gow commented Feb 7, 2025

gow commented Feb 12, 2025

yuandrew left a comment

yuandrew Feb 13, 2025

yuandrew Feb 13, 2025

gow Feb 13, 2025

gow left a comment

gow Feb 13, 2025

gow Feb 13, 2025

Fix child WF ID generation #1803

Are you sure you want to change the base?

Fix child WF ID generation #1803

Conversation

gow commented Feb 5, 2025

What was changed

Why?

Checklist

cretz commented Feb 5, 2025 • edited Loading

gow commented Feb 5, 2025 • edited Loading

cretz commented Feb 5, 2025

gow commented Feb 5, 2025

gow commented Feb 7, 2025

gow commented Feb 12, 2025

yuandrew left a comment

Choose a reason for hiding this comment

yuandrew Feb 13, 2025

Choose a reason for hiding this comment

yuandrew Feb 13, 2025

Choose a reason for hiding this comment

gow Feb 13, 2025

Choose a reason for hiding this comment

gow left a comment

Choose a reason for hiding this comment

gow Feb 13, 2025

Choose a reason for hiding this comment

gow Feb 13, 2025

Choose a reason for hiding this comment

cretz commented Feb 5, 2025 •

edited

Loading

gow commented Feb 5, 2025 •

edited

Loading