Skip to content

Conversation

mt-ob
Copy link
Collaborator

@mt-ob mt-ob commented Aug 26, 2025

Distinguish the task that failed on its own merit (the "root cause") from the tasks that were simply terminated as collateral damage.

metaflow.task_fail_reason is the new field and can have the following values:

  • exception --> exception in the user's step code
  • signal_sigint --> CTRL+C
  • signal_sigterm --> external termination signal by K8s etc.
  • killed --> reported by orchestrator for flagging the tasks it terminates as collateral damage.

Test Cases:

  1. One task in a foreach failed due to a Python Exception.
Task: 3 -> Status: False, Failure Reason: exception  <-- Root Cause
Task: 4 -> Status: False, Failure Reason: killed      <-- Collateral Damage
Task: 2 -> Status: False, Failure Reason: killed      <-- Collateral Damage
  1. User Interruption (Ctrl+C)
    All tasks that were active at the time of the interruption were correctly marked with signal_sigint, attributing the failure to a direct user action.

  2. A task failed with an exception, and immediately after, the run was interrupted with Ctrl+C before the runtime's cleanup completed.

Task: 3 -> Status: False, Failure Reason: exception
Task: 4 -> Status: False, Failure Reason: signal_sigint
Task: 2 -> Status: False, Failure Reason: signal_sigint

@mt-ob
Copy link
Collaborator Author

mt-ob commented Aug 26, 2025

Supersedes #1731

@mt-ob
Copy link
Collaborator Author

mt-ob commented Aug 26, 2025

things can be verified using this snippet:

>>> from metaflow import Flow, Run
>>> latest_run = Run("FailureTestFlow/1756199905447021")
>>> for task in latest_run['fail_step']:
...     reason = next((datum.value for datum in task.metadata if datum.type == 'metaflow.task_fail_reason'), None)
...     print(f"Task: {task.id} -> Status: {task.successful}, Failure Reason: {reason}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant