[BUG] Collisions on subworkflow launchplan execution IDs #2778

hamersaw · 2022-08-17T16:42:58Z

Describe the bug

When a launchplan is executed as part of a Flyte workflow FlytePropeller is responsible for generating an execution ID. This algorithm must be deterministic so that subsequent workflow processing checks the correct launchplan execution ID to retrieve status. Currently it uses a 32bit hash of the subworkflow node ID. However, this only produces a 8 character (actually 7 and 1 padded) execution ID. With a large number of launchplan executions collisions may occur.

Expected behavior

Launchplan executions ID should never collide.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

hamersaw · 2022-08-17T16:44:33Z

For an immediate fix we will update the hashing algorithm to 64bit (or 128bit?) to increase the length of the hash. This should significantly reduce the probability of a collision. However, in the very small chance a collision would still occur we should implement some additional logic to check for an existing execution before starting the launchplan.

EngHabu · 2022-08-23T18:52:32Z

I've considered a few different options, after a brief chat with Dan.

There are three problems to solve here

Detecting collisions [Punt]

Admin today detect collisions and returns "Already exists" iff exec name matches an existing one. We can modify that behavior to check for collisions in parent node exec id column as well and return a different error if the execID matches but the parent doesn't (e.g. Belongs to a different Execution-Error)... Propeller can react differently to this error (possibly by either erroring out early for now)

Why should we do this?
To prioritize correctness, a failed workflow execution with a system error indicating an unrecoverable collision is preferable as the last resort besides all the heuristics to reduce collisions.

Reduce collisions

Upgrade hashing algorithm to 64bit
Pros: simple change.
Cons: No easy way for backward compatibility. If an existing execution is in "flight", subsequent checks for the health/status of the execution will result in building a different execution id and subsequently failing.
Prefix the 32bit hash with the WF Exec ID
Cons: Could potentially be longer than Admin allows (it allows 20characters) if the original exec id is already 20 characters, the 32bit one is 8 characters, we can then remove the last 8 characters of the original exec id...

Backward compatibility

Add static Version in NodeStatus/WFNodeState
Similar to EventVersion in Admin, we can have a NodeStatusVersion global field and store that into every node status (or at the CRD level). We can then bump it to indicate the change in hashing algorithm... New executions use the new hashing algorithm and in-flight ones keep using the 32bit one
Pros: Straightforward to implement
Cons: Leaves if/elses all around. Potentially making debugging harder.
Add HashingAlgorithm to the WFNodeState
(similar to the option before) For new launch plan executions, it can always use 64bit and store that fact into the WF Node State. When checking for LP Status, if the alg bit is set, we use that, otherwise fallback to the 32bit algorithm.
Add a static DefinitionVersion to the CRD
Similar to the above solution but can be rolled forward for any backward incompatible changes to the CRD

hamersaw · 2022-08-24T22:13:59Z

@EngHabu my only comment deals with maintaining versions in differing locations moving forward, for example if we chose to add a CRD version we will have (1) individual component version (ie. eventing like you mentioned + current others), (2) our newly defined CRD version which is incremented for any changes moving forward, and (3) the k8s CRD resource version (currently v1alpha1), but I know we discussed moving this if we update the CRD to store strict binary for space savings (what is the relationship between a v1beta1 release and our flyte workflow version?). I don't feel strongly about any of these options - but I think it is important to give some thought on the implications of adding another version. Any choice we can probably do some refactoring to clean versions not conforming to the new "standard".

hamersaw added the bug Something isn't working label Aug 17, 2022

hamersaw added this to the 1.2.0-candidate milestone Aug 17, 2022

EngHabu mentioned this issue Aug 23, 2022

Prefix sub-lp exec id with the parent exec-id flyteorg/flytepropeller#474

Draft

8 tasks

This was referenced Aug 23, 2022

Introduce a CRD Definition Version to allow backward incompatible changes & use it to change LP Exec ID flyteorg/flytepropeller#475

Closed

Use fnv64 to create a hash for child workflow executions flyteorg/flytepropeller#476

Merged

EngHabu mentioned this issue Aug 31, 2022

Update hash algo to 64bit to reduce collisions flyteorg/flyteplugins#283

Merged

9 tasks

EngHabu closed this as completed in flyteorg/flytepropeller#476 Sep 9, 2022

hamersaw mentioned this issue Sep 16, 2022

[BUG] Flyte execution IDs collide with past executions #2800

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Collisions on subworkflow launchplan execution IDs #2778

[BUG] Collisions on subworkflow launchplan execution IDs #2778

hamersaw commented Aug 17, 2022

hamersaw commented Aug 17, 2022

EngHabu commented Aug 23, 2022 •

edited

Loading

hamersaw commented Aug 24, 2022

[BUG] Collisions on subworkflow launchplan execution IDs #2778

[BUG] Collisions on subworkflow launchplan execution IDs #2778

Comments

hamersaw commented Aug 17, 2022

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented Aug 17, 2022

EngHabu commented Aug 23, 2022 • edited Loading

Detecting collisions [Punt]

Reduce collisions

Backward compatibility

hamersaw commented Aug 24, 2022

EngHabu commented Aug 23, 2022 •

edited

Loading