Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] [Investigate] Alert Incidents #209391

Open
6 tasks
dominiqueclarke opened this issue Feb 3, 2025 · 3 comments
Open
6 tasks

[Meta] [Investigate] Alert Incidents #209391

dominiqueclarke opened this issue Feb 3, 2025 · 3 comments
Labels
Team:obs-ux-management Observability Management User Experience Team

Comments

@dominiqueclarke
Copy link
Contributor

dominiqueclarke commented Feb 3, 2025

Failures in complex software systems have compounding effects. A failure in a single component may result in multiple alerts for different rule types alerting on numerous signals.

To help SREs make sense of these fast moving alerts, we'll group alerts into a hypothesized "incident" by identifying groups of possibly related alerts.

@dominiqueclarke dominiqueclarke added the Team:obs-ux-management Observability Management User Experience Team label Feb 3, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@maryam-saeidi
Copy link
Member

@dominiqueclarke Just for clarification, is incident a new concept that will have a dedicated page (a details view) similar to investigations and cases? How would it be different from those concepts? Maybe one of the sub-tickets already has this information, which I am not aware of.

@dominiqueclarke
Copy link
Contributor Author

@maryam-saeidi apologies, we've been having discussion internally but there's not as much information on this ticket.

An incident is an opinionated grouping of alerts. It's our attempt to help customer's make sense of their wall of alerts, with a more curated view of a group of alerts which may be related.

While we currently allow users to manually group by rule name, source, or a custom field, we want to provide a curated list of possibly interesting groupings that could indicate an incident.

As an MVP, these groupings will be ephemeral and not idempotent. If you visit the page multiple times, you may get different suggestions.

Andrew has used the idea of Spotify playlists as an analogy. Spotify offers these curated playlists suggestions "You might like..."

Similarly, we want to highlight groupings of alerts that may be of interest to the user, even if we're not completely confident they represent an incident. "you might be interested in this group.... because all these alerts fired at the same time." "You might be interested in this group... because they are all for the same kubernetes cluster".

This concept of an incident only relates to these groups. Currently, there are no other features encompassed in this concept, like collaboration. We don't currently have a mind towards building full incident management capabilities. We can enable incidents to be added to a case.

Also, the concept of investigation is also in flux, with the original concept of a new independent Kibana asset being sunset for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
Development

No branches or pull requests

3 participants