-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ErrStateMachineNotFound handling in HSM state replication #7032
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with the replication code path, there may be a better place to put this error handling logic. Were the errors you observed on the source or target cluster?
// Based on 1 and 2, node should always be found here. | ||
// The node may not be found if: | ||
// 1. The state machine was deleted (e.g. terminal state cleanup) | ||
// 2. We're missing events that created this node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not true based on the comment that you deleted.
Already done history resend if needed before,
// and node creation today always associated with an event
I would also clarify that creation and deletion are always associated with an event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify the code comments, otherwise LGTM.
tag.WorkflowID(mutableState.GetExecutionInfo().WorkflowId), | ||
tag.WorkflowRunID(mutableState.GetExecutionInfo().OriginalExecutionRunId), | ||
) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd add a sanity check that version history in the mutable state is > the one in the request (same as the one on L265. or just return that info from compareVersionHistory), and return an error otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already check this in compareVersionHistory? In other words, an error will be returned by compareVersionHistory
if the condition you mention exists, so we won't even get to the point of a ErrStateMachineNotFound
error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm not sure I follow. Error is returned from compareVersionHistory if last version history item of the (local) mutable state is < that in the request. The check I mentioned is for > (also not the same as the >= checked in compareVersionHistory)
What changed?
HSMStateReplicatorImpl.syncHSMNode()
to handleErrStateMachineNotFound
gracefullyOriginalExecutionRunId
TestSyncHSM_StateMachineNotFound
to verify behaviorWhy?
After adding support for state deletion in terminal states in Nexus, nightly tests started failing when sync HSM tasks tried to replicate state machines that had been legitimately deleted. Since the deletion is intentional for terminal states, we should gracefully handle these cases by logging and continuing replication of other state machines.
How did you test it?
ErrStateMachineNotFound
Potential risks
ErrStateMachineNotFound
could potentially mask other issues if the error occurs for unexpected reasonsDocumentation
No documentation changes required as this is an internal implementation detail handling error cases in the replication path.
Is hotfix candidate?
No - while this fixes test failures, it's not causing production issues that would warrant a hotfix.