You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Failures in complex software systems have compounding effects. A failure in a single component may result in multiple alerts for different rule types alerting on numerous signals.
To help SREs make sense of these fast moving alerts, we'll group alerts into a hypothesized "incident" by identifying groups of possibly related alerts.
@dominiqueclarke Just for clarification, is incident a new concept that will have a dedicated page (a details view) similar to investigations and cases? How would it be different from those concepts? Maybe one of the sub-tickets already has this information, which I am not aware of.
@maryam-saeidi apologies, we've been having discussion internally but there's not as much information on this ticket.
An incident is an opinionated grouping of alerts. It's our attempt to help customer's make sense of their wall of alerts, with a more curated view of a group of alerts which may be related.
While we currently allow users to manually group by rule name, source, or a custom field, we want to provide a curated list of possibly interesting groupings that could indicate an incident.
As an MVP, these groupings will be ephemeral and not idempotent. If you visit the page multiple times, you may get different suggestions.
Andrew has used the idea of Spotify playlists as an analogy. Spotify offers these curated playlists suggestions "You might like..."
Similarly, we want to highlight groupings of alerts that may be of interest to the user, even if we're not completely confident they represent an incident. "you might be interested in this group.... because all these alerts fired at the same time." "You might be interested in this group... because they are all for the same kubernetes cluster".
This concept of an incident only relates to these groups. Currently, there are no other features encompassed in this concept, like collaboration. We don't currently have a mind towards building full incident management capabilities. We can enable incidents to be added to a case.
Also, the concept of investigation is also in flux, with the original concept of a new independent Kibana asset being sunset for the time being.
Failures in complex software systems have compounding effects. A failure in a single component may result in multiple alerts for different rule types alerting on numerous signals.
To help SREs make sense of these fast moving alerts, we'll group alerts into a hypothesized "incident" by identifying groups of possibly related alerts.
The text was updated successfully, but these errors were encountered: