Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralized errors / workflow runs reporting #505

Open
mattdurant opened this issue Nov 10, 2024 · 5 comments
Open

Centralized errors / workflow runs reporting #505

mattdurant opened this issue Nov 10, 2024 · 5 comments
Labels
enhancement New feature or request priority:medium Medium priority ticket

Comments

@mattdurant
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Unhandled exceptions are not easily reported on currently, if you have a workflow that has errored out in a way that is unexpected, unless you add error handling to the workflow itself, there is no centralized place to report or alert on errors.

Describe the solution you'd like
A few ideas:

  1. If a workflow ends because of an exception, have an alerting configuration/workflow that runs. Idea here is that there is a system-defined "error" workflow that we could use to do whatever we need to with an error.

  2. Logging and a report of same errors that can be viewed or scheduled.

  3. Single "runs" view that shows all workflows, not just the currently open one. View can be filtered on whether the run was successful or not.

Describe alternatives you've considered
The only way currently to tell if an error occurred is to open each workflow and look at the previous runs or inspect logs on the worker container for errors.

Additional context

@topher-lo
Copy link
Contributor

topher-lo commented Nov 10, 2024

To clarify, in your mind, this centralized list of failures consolidates all errors across all workflows in a workspace correct?

I think this makes sense

@topher-lo
Copy link
Contributor

Haven't seen option 3 in SOAR UI. But it's the default UI/UX for data workflow orchestrators like prefect and airflow. And it makes...a lot of sense.

Especially cause Tracecat encourages smaller composable workflows.

A list of all workflow runs is the most elegant option. Also gives nice filtering / search capabilities.

@mattdurant
Copy link
Contributor Author

Yes, as an example, since we moved our production instance from somewhere around 0.9 to 0.13, some of our workflows were broken due to namespaces renamed or configuration for udfs changing. There was no way to know this unless proactively unless each workflow was inspected to look at the recent runs. Highlighted a bigger issue such as a secret expiring and causing an integration to stop working and not knowing.

@mattdurant
Copy link
Contributor Author

Haven't seen option 3 in SOAR UI. But it's the default UI/UX for data workflow orchestrators like prefect and airflow. And it makes...a lot of sense.

Especially cause Tracecat encourages smaller composable workflows.

A list of all workflow runs is the most elegant option. Also gives nice filtering / search capabilities.

I think a big piece is going to be whether we can also have the system notify on recurring errors with a workflow. I think we discussed something like this at one point, being able to see the "health" of a workflow. Notifying on the health of a workflow declining would be helpful.

@topher-lo
Copy link
Contributor

topher-lo commented Nov 10, 2024

Yes, as an example, since we moved our production instance from somewhere around 0.9 to 0.13, some of our workflows were broken due to namespaces renamed or configuration for udfs changing. There was no way to know this unless proactively unless each workflow was inspected to look at the recent runs. Highlighted a bigger issue such as a secret expiring and causing an integration to stop working and not knowing.

The namespaces issue should have been caught on our end. Moving forward we will manage migrations for integration namespaces directly in Alembic.

Another point: we're revamping the workflow table as well to show: 1. last run date, 2. last run failure date, and general status flag

@topher-lo topher-lo added the enhancement New feature or request label Nov 10, 2024
@topher-lo topher-lo changed the title Centralized error reporting Centralized errors / workflow runs reporting Nov 10, 2024
@topher-lo topher-lo added the priority:medium Medium priority ticket label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:medium Medium priority ticket
Projects
None yet
Development

No branches or pull requests

2 participants