Alerts blocked from reaching notification channel when templating issues happen #2885
Replies: 2 comments 1 reply
-
I agree with this as a visibility problem from a user perspective. As a counterpart, In Prometheus when your rules and alerts fail to evaluate or produce errors you can clearly see the current information as part of the
In the Alertmanager, It feels like we're missing something that could provide us with more information about:
The current UI does a magnificent job at giving you information about the current alerts and whether they've been received or not by the alertmanager, but end-user visibility of what's happening with the notification pipeline is certainly something we could improve. Besides templating errors, are there other things you'd like more information on like an end-user? |
Beta Was this translation helpful? Give feedback.
-
@gotjosh having some view of "broken notifications", while nice, won't really address the issue that an alert may fire 3am and not notify / page who we need to fix things: this is the main issue here. I'd expect AlertManager to, if any template failed to render, still send a notification, but with a value such as Grafana Labs managed AlertManager provides a counter for "errors to send notifications", which, enables to detect that something bad happened, but not what exactly. This means that, when this generic failure trigger, it must go to a generic oncall, which'll then, have to access AlertManager logs (via urgent support ticket), figure out why the notification failed (eg: timeout, 403 or... template issue), which then, needs to manually notify the correct owner. If we chase exposing some view on "this notification failed", it'd certainly aid debugging, but it'd not ensure that Alert > Notification happens, even when "recoverable" errors happen inside AlertManager (even if user's fault, such as bad template). |
Beta Was this translation helpful? Give feedback.
-
What did you do?
When defining a receiver in AlertManager, we use a template variable 'connect.k8s.url' and reference it in the receiver actions. The variable, by an honest mistake, did not exist / was not defined.
What did you expect to see?
Expected for the receiver to push the alert through its notification channel, even if some template variable is non existent/not defined.
What did you see instead? Under which circumstances?
The alert blackholed (didn't go through the receiver endpoint), troubleshooting was cumbersome but eventually we found an entry in the log with the following message:
msg="Notify for alerts failed" num_alerts=1 err="connect_slack-connect-alert-validation_production/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: template: :1:12: executing "" at <>: template "connect.k8s.url" not defined"
Environment
Grafana Cloud Production Instance
Grafana Cloud Hosted Alert Manager
Grafana Cloud Hosted Alert Manager - latest version
Beta Was this translation helpful? Give feedback.
All reactions