Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Collisions on subworkflow launchplan execution IDs #2778

Closed
2 tasks done
hamersaw opened this issue Aug 17, 2022 · 3 comments · Fixed by flyteorg/flytepropeller#476 · May be fixed by flyteorg/flytepropeller#474
Closed
2 tasks done

[BUG] Collisions on subworkflow launchplan execution IDs #2778

hamersaw opened this issue Aug 17, 2022 · 3 comments · Fixed by flyteorg/flytepropeller#476 · May be fixed by flyteorg/flytepropeller#474
Labels
bug Something isn't working

Comments

@hamersaw
Copy link
Contributor

Describe the bug

When a launchplan is executed as part of a Flyte workflow FlytePropeller is responsible for generating an execution ID. This algorithm must be deterministic so that subsequent workflow processing checks the correct launchplan execution ID to retrieve status. Currently it uses a 32bit hash of the subworkflow node ID. However, this only produces a 8 character (actually 7 and 1 padded) execution ID. With a large number of launchplan executions collisions may occur.

Expected behavior

Launchplan executions ID should never collide.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@hamersaw hamersaw added the bug Something isn't working label Aug 17, 2022
@hamersaw hamersaw added this to the 1.2.0-candidate milestone Aug 17, 2022
@hamersaw
Copy link
Contributor Author

For an immediate fix we will update the hashing algorithm to 64bit (or 128bit?) to increase the length of the hash. This should significantly reduce the probability of a collision. However, in the very small chance a collision would still occur we should implement some additional logic to check for an existing execution before starting the launchplan.

@EngHabu
Copy link
Contributor

EngHabu commented Aug 23, 2022

I've considered a few different options, after a brief chat with Dan.

There are three problems to solve here

Detecting collisions [Punt]

Admin today detect collisions and returns "Already exists" iff exec name matches an existing one. We can modify that behavior to check for collisions in parent node exec id column as well and return a different error if the execID matches but the parent doesn't (e.g. Belongs to a different Execution-Error)... Propeller can react differently to this error (possibly by either erroring out early for now)

Why should we do this?
To prioritize correctness, a failed workflow execution with a system error indicating an unrecoverable collision is preferable as the last resort besides all the heuristics to reduce collisions.

Reduce collisions

  1. Upgrade hashing algorithm to 64bit
    Pros: simple change.
    Cons: No easy way for backward compatibility. If an existing execution is in "flight", subsequent checks for the health/status of the execution will result in building a different execution id and subsequently failing.

  2. Prefix the 32bit hash with the WF Exec ID
    Cons: Could potentially be longer than Admin allows (it allows 20characters) if the original exec id is already 20 characters, the 32bit one is 8 characters, we can then remove the last 8 characters of the original exec id...

Backward compatibility

  1. Add static Version in NodeStatus/WFNodeState
    Similar to EventVersion in Admin, we can have a NodeStatusVersion global field and store that into every node status (or at the CRD level). We can then bump it to indicate the change in hashing algorithm... New executions use the new hashing algorithm and in-flight ones keep using the 32bit one
    Pros: Straightforward to implement
    Cons: Leaves if/elses all around. Potentially making debugging harder.

  2. Add HashingAlgorithm to the WFNodeState
    (similar to the option before) For new launch plan executions, it can always use 64bit and store that fact into the WF Node State. When checking for LP Status, if the alg bit is set, we use that, otherwise fallback to the 32bit algorithm.

  3. Add a static DefinitionVersion to the CRD
    Similar to the above solution but can be rolled forward for any backward incompatible changes to the CRD

@hamersaw
Copy link
Contributor Author

@EngHabu my only comment deals with maintaining versions in differing locations moving forward, for example if we chose to add a CRD version we will have (1) individual component version (ie. eventing like you mentioned + current others), (2) our newly defined CRD version which is incremented for any changes moving forward, and (3) the k8s CRD resource version (currently v1alpha1), but I know we discussed moving this if we update the CRD to store strict binary for space savings (what is the relationship between a v1beta1 release and our flyte workflow version?). I don't feel strongly about any of these options - but I think it is important to give some thought on the implications of adding another version. Any choice we can probably do some refactoring to clean versions not conforming to the new "standard".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment