Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel pushover notifications when the alert resolves itself #4194

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

kadaan
Copy link

@kadaan kadaan commented Jan 3, 2025

Pushover emergency priority (2) alerts will continue to require acknowledgement even when the alert is resolved in Alertmanager. Currently a resolution message is sent to Pushover but this does not cancel the emergency alert. To support cancellation of emergency alerts Pushover added the ability to attach tags to an alert and a new route (https://api.pushover.net/1/receipts/cancel_by_tag/(your tag).json) which can be used to cancel all emergency priority alerts which have the specified tag.

To support this, the Pushover notifier will now add a tag to outgoing messages which emergency priority: groupKey=<group key hash>. When an alert is resolved the notifier will send the resolution message. When that is complete and the priority is 2, the cancel_by_tag request will be sent: https://api.pushover.net/1/receipts/cancel_by_tag/groupKey=<group key hash>.json

@jcs Can you confirm that my usage of the cancel_by_tag API is correct. There is a little ambiguity in the docs as to whether the request should be formatted like https://api.pushover.net/1/receipts/cancel_by_tag/groupKey=<group key hash>.json or like https://api.pushover.net/1/receipts/cancel_by_tag/<group key hash>.json

@kadaan kadaan force-pushed the support_pushover_emergecy_cancel_by_tags branch from a17cd5f to 7b2743b Compare January 3, 2025 17:39
@jcs
Copy link

jcs commented Jan 3, 2025

@jcs Can you confirm that my usage of the cancel_by_tag API is correct. There is a little ambiguity in the docs as to whether the request should be formatted like https://api.pushover.net/1/receipts/cancel_by_tag/groupKey=.json or like https://api.pushover.net/1/receipts/cancel_by_tag/.json

Hi, since you are sending a tags parameter of the result of groupKey = fmt.Sprintf("groupKey=%s", key.Hash()), you would need to use cancel_by_tag/groupKey=<group key hash>.json. The tags parameter does not need to be a key/value so you can make it groupKey_<hash> or whatever if that makes it easier to understand.

Though Pushover already has a concept of group keys for delivery groups which I'm assuming are different from what this group key means in Prometheus, so prefixing the tag with groupKey= may be confusing when trying to troubleshoot issues sending notifications to a Pushover delivery group. Perhaps naming it PrometheusGroupKey or whatever the group key actually means in this context might make it clearer.

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

@jcs Thanks for the feedback. I've changed it to promAlertGroupKey=<groupKey_hash> An example of a tag is promAlertGroupKey_d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

Ugh. I realize one of the issues here is that the priority in the alert is a template. Ideally we would only send the cancellation if the original alerts was emergency priority, but that is basically impossible to know without record keeping, and record keeping seems like it would be undesirable. The other alternative is to always send a cancel_by_tag message on resolution. @jcs Does the cancel_by_tag message count against user's quotas? I would hate to add this feature and effectively reduce their quota by 1/3.

@onedr0p
Copy link

onedr0p commented Jan 3, 2025

Not only that but it would add more strain to the pushover API server, doesn't seem efficient.

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

Agreed. Hopefully @jcs can shed some more light on that as Alertmanager should continue to be a good steward of Pushover's resources

@jcs
Copy link

jcs commented Jan 3, 2025

These requests would not count against the application's quota, though I agree that it would be best to avoid sending them unnecessarily.

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

That is good news about the quota, but I don't see a way to not send these short of storing state for each alert.

@SuperQ Can you chime in here?

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

@jcs Pending feedback from Prometheus developers, what is your feeling of this feature moving forward if keeping state is not permitted?

@kadaan
Copy link
Author

kadaan commented Jan 3, 2025

At least in the configs that I have seen, it is common that the priority is a template. One example is having a template which sets the priority to 2 if the alert is firing and the severity label is critical, 1 if it is firing and severity is not critical, and 0 otherwise. There are "hacky" ways that we could reduce the number of calls to cancel_by_tag (ie. check if 2 is in the template).

Alternatively we could introduce a config option on the receiver which would enable cancellation support. Users could configure alertmanager to have separate receiver which only handles those alerts which are critical. That receiver could still only include tags on messages with priority 2, but always send the cancel_by_tag message. This would ensure we will not abuse the Pushover API, but does require user's to correctly configure the system. Nothing would prevent them from configuring it so that a cancel_by_tag message is sent for every alert that is resolved.

@grobinson-grafana
Copy link
Contributor

Ugh. I realize one of the issues here is that the priority in the alert is a template. Ideally we would only send the cancellation if the original alerts was emergency priority, but that is basically impossible to know without record keeping, and record keeping seems like it would be undesirable. The other alternative is to always send a cancel_by_tag message on resolution. @jcs Does the cancel_by_tag message count against user's quotas? I would hate to add this feature and effectively reduce their quota by 1/3.

I think I can add some clarification here. Alertmanager does not officially support alert prioritization. Instead, it just has opaque key/value pairs via annotations and labels. As such, there is not way to really determine the priority of an alert, if that's what you would like to do. For example, some users might use a "priority" annotation, but other users might use an annotation with a completely different name. But importantly, Alertmanager does not reserve names for such things.

Priorities are only a feature of some integrations, such as Pushover. However, they are more of an output rather than an input. I.e. "given this set of alerts, what is the priority of this notification?". The nice thing about this is that it lets users use templates to implement their own logic to decide if a notification is a priority or not.

@grobinson-grafana
Copy link
Contributor

That is good news about the quota, but I don't see a way to not send these short of storing state for each alert.

@SuperQ Can you chime in here?

I'm not sure I understand why you need to store state though?

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

@grobinson-grafana Thanks for including that information to ensure that we are all on the same page in this discussion.

As for storing state, it was mentioned in the context of the prior statements. Let me recap.

Pushover supports messages having priority. Priority can be -2 to 2. Priority of 2 indicates an emergency and affects how Pushover handles the message, with one of the behaviors being it will keep alerting you unless you acknowledge the message.

The current problem is that Pushover continues to notify the user and requires message acknowledgement after Alertmanager resolves an alert.

Pushover developers added a way to ”cancel” priority 2 messages, the cancel_by_tag api.

Pushover protects their system by enforcing message quotas. Often, each Alertmanager alert ends up sending two messages to Pushover: one for firing and one for resolution. Adding a cancel_by_tag call would increase the number of calls that the Pushover systems need to handle.

In order to be a good citizen, Alertmanager shouldn’t overload the Pushover system. Ideally, we would know that an alert was fired with a priority of 2 and only call cancel_by_tag when those alerts are resolved. Because priority is a template we may not know in resolution what the priority was.

So, we have a few choices:

  1. Keep some Pushover specific state in Alertmanager so that we can determine if an alert sent a message to Pushover with a priority of 2. This would result in the optimal number of cancel_by_tag calls to Pushover.
  2. Send a cancel_by_tag call to Pushover for every resolution. This would potentially double the number of calls to Pushover.
  3. Try to determine if priority was a 2 by looking for the number 2 in the priority config. This is pretty hacky.
  4. Require the Alertmanager configurer to be a good citizen by configuring a second receiver that only handles alerts which send priority 2 messages in Pushover. Allow that receiver to be configured to send cancel_by_tag messages.

I would like to determine a solution that is acceptable to both the Alertmanager and Pushover maintainers.

@grobinson-grafana
Copy link
Contributor

Ideally, we would know that an alert was fired with a priority of 2 and only call cancel_by_tag when those alerts are resolved. Because priority is a template we may not know in resolution what the priority was.

But don't you know that from the priority variable? Why can't you assume that a resolved notification has the same priority as it's previous firing notifications(s)?

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

@grobinson-grafana The priority of the resolution message itself is immaterial. If you choose to send the resolution message with priority=2, that is on you. But the initially fired message with priority=2 will continue to require acknowledgment unless cancel_by_tag is called. This is all about improving the common workflow for people that use Pushover for notification.

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

Often resolution messages will have lower priority than firing messages. The priority dictates how the app responds. If the priority is 2, the app will require acknowledgement, ignores quiet hours (configurable), triggers a critical alert (iOS, configurable), etc. It would be a bad experience if resolution messages would alert without respect for quiet hous.

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

Much of this was discussed in: #634

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Jan 5, 2025

@grobinson-grafana The priority of the resolution message itself is immaterial. If you choose to send the resolution message with priority=2, that is on you. But the initially fired message with priority=2 will continue to require acknowledgment unless cancel_by_tag is called. This is all about improving the common workflow for people that use Pushover for notification.

That's what I thought as well. But then I don't understand. Previously you mentioned you need to store state in order to "remember" if you previously sent a high priority alert notification to Pushover, but here you say the priority is immaterial. Did I completely misunderstand?

Is the priority of the resolution notification not supposed to match the priority of the previous firing notifications?

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

No, the priority of resolution message does not need to match the firing message, and likely you don't want it to. I wouldn't want my phone to wake me up in the middle of the night because an issue was resolved.

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

As I mentioned before, this is all in the scope of reducing load again Pushover's server, so that Alertmanager is not abusing it.

And as I also mentioned, I see only 4 ways of accomplishing this:

  1. store state for fired notifications to Pushover so that when the alert is resolved we can call the cancel_by_tag api
  2. call cancel_by_tag for every resolution
  3. try to determine if the priority could have been 2 (ie. check if the priority template contains 2) and call cancel_by_tag
  4. require users to enable cancellation and suggest that they create a different receiver for priority 2 alerts; alway call cancel_by_tag for that receiver (this is an optimization of item 2 in this list)

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Jan 5, 2025

Options 2 and Option 3 seems like the most sensible option to me. I still don't understand why you would need or want to store state for this, as priority should not be "flapping" in such a way it becomes completely non deterministic between two consecutive notifications for the same aggregation group. However, if you feel you really really need to store state, you can probably find a way to add metadata to nflog entries.

Option 4 could work if the priority of a notification can be determined from a single alert. If a user's template looks at the entire aggregation group (i.e. if more than 2 firing alerts, set priority to 2) then it won't be possible.

@kadaan
Copy link
Author

kadaan commented Jan 5, 2025

Regarding Option 1, the nflog.Entry currently doesn't contain receiver specific information. I assumed, maybe incorrectly, that Alertmanager would not want to be in the business of storing receiver specific information.

Option 2 seemed like a non-starter as it would always increase the load on the Pushover servers and they would prefer we not do that.

Option 3, seemed like a hack, but it could reduce the load on the Pushover server and still enabled even better behavior via a specific receiver for priority 2 messages.

What you explained about option 4 makes sense.

I don't think the priority of the alert is "flapping" in the way you mean. Here is an example template for priority: {{ if eq .Status "firing" }}2{{ else }}0{{ end }}. It doesn't flap for consecutive notifications, but it does mean that the notification of an alert firing is priority 2 and the notification of resolution is priority 0.

Given all this, if you think Option 3 will get merged, I am happy to remove the new config option I added and simply do a strings.Contains(n.conf.Priority, "2") to determine if cancellation should be enabled.

@onedr0p
Copy link

onedr0p commented Jan 6, 2025

If I understand correctly to use this feature it will require people to set priority: 2 and not a different priority templated based on if an alert is firing or not? That seems like a good enough solution to me.

Maybe there could be some docs written around that. It might not be clear to people who already have priority templated to be 2 for firing and another value for non-firing. At least it's not a breaking change and they have to opt into changing their priority if they want emergency alerts to be automatically resolved.

@kadaan
Copy link
Author

kadaan commented Jan 6, 2025

@onedr0p Option 3, which is what is in the PR now, checks to see if the value 2 is anywhere within the config value for priority on the receiver. It will handle the case where the priority is templated based on firing. There will be some unnecessary calls of cancel_by_tag. The user could tune the alertnmanager config to eliminate these unneeded calls by adding a second receiver that only handles critical alerts. That way 2 would not be in the priority of the normal receiver.

@grobinson-grafana
Copy link
Contributor

Regarding Option 1, the nflog.Entry currently doesn't contain receiver specific information. I assumed, maybe incorrectly, that Alertmanager would not want to be in the business of storing receiver specific information.

It should be here https://github.com/prometheus/alertmanager/blob/main/nflog/nflogpb/nflog.proto#L29.

Option 2 seemed like a non-starter as it would always increase the load on the Pushover servers and they would prefer we not do that.

Is it possible to send just one message? The code right now sends two messages for a resolved notification. If the notification is a resolved message, can you just send cancel_by_tag? Then you don't increase the number of messages.

shouldRetry, err := n.sendMessage(ctx, key, u, parameters)

	// Notifications sent for firing alerts could be sent with emergency priority, but the resolution notifications
	// might be sent with a different priority. Because of this, and the desire to reduce unnecessary cancel_by_tag
	// call again Pushover, we only call cancel_by_tag if the priority config value contains.
	if err == nil && strings.Contains(n.conf.Priority, emergencyPriority) && alerts.Status() == model.AlertResolved {
		u, err = url.Parse(fmt.Sprintf("%s/%s.json", n.apiReceiptsURL, groupKeyTag))
		if err != nil {
			return false, err
		}
		shouldRetry, err = n.sendMessage(ctx, key, u, parameters)
	}
	return shouldRetry, err

Option 3, seemed like a hack, but it could reduce the load on the Pushover server and still enabled even better behavior via a specific receiver for priority 2 messages.

Why do you think it's a hack?

What you explained about option 4 makes sense.

I don't think the priority of the alert is "flapping" in the way you mean. Here is an example template for priority: {{ if eq .Status "firing" }}2{{ else }}0{{ end }}. It doesn't flap for consecutive notifications, but it does mean that the notification of an alert firing is priority 2 and the notification of resolution is priority 0.

For this specific template, that would be correct. I hope no one would write a template like that though. It should be more like {{ if eq (index .CommonLabels "severity") "high" }}2{{ else }}0{{ end }}. Then the priority doesn't change between firing and resolved notifications.

@kadaan
Copy link
Author

kadaan commented Jan 6, 2025

It should be here https://github.com/prometheus/alertmanager/blob/main/nflog/nflogpb/nflog.proto#L29.

I looked there, but there is not any receiver implementation specific data at the moment and I’m not sure there should be.

Is it possible to send just one message? The code right now sends two messages for a resolved notification. If the notification is a resolved message, can you just send cancel_by_tag? Then you don't increase the number of messages.

No. If you send only cancel_by_tag you will stop the phone from requiring acknowledgement of existing priority 2 alerts, but there will be no message on the users phone that the alert is resolved.

I think about it like email as it operates similarly. The original firing notification and the resolved notification are simply messages that show up on the users phone. Priority controls how important the message is and that alters how the app acts, such as only incrementing the badge count or sounding an alarm even though it is quiet hours.

Why do you think it's a hack?

Because it is just looking for whether the template string contains the number 2, rather than being able to deterministically know that the firing alert was sent with priority 2.

That said, I am fine with this compromise as it reduces load on the Pushover servers and doesn’t introduce new receiver implementation specific data to the nflog.

For this specific template, that would be correct. I hope no one would write a template like that though. It should be more like {{ if eq (index .CommonLabels "severity") "high" }}2{{ else }}0{{ end }}. Then the priority doesn't change between firing and resolved notifications.

People are definitely writing templates like this and there is a need. You do not want the resolution notification to bypass the user’s quiet hours. So sending the resolve notification with priority 2 doesn’t make a ton of sense. It would result in someone getting potentially woken up when there is no need.

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Jan 6, 2025

I looked there, but there is not any receiver implementation specific data at the moment and I’m not sure there should be.

You would add support for metadata (i.e. key/value pairs) and then each receiver could store whatever metadata it needs. No field would be added that is specific to the receiver, but a generic field for any and all metadata could be accepted.

No. If you send only cancel_by_tag you will stop the phone from requiring acknowledgement of existing priority 2 alerts, but there will be no message on the users phone that the alert is resolved.

Is it possible for Pushover to add a new server method that does both? Then we don't need to send two messages.

The other consideration is that this will only happen when all alerts in an aggregation group are resolved, right? Have you benchmarked it see what the load would actually be on a typical Alertmanager deployment? If not, perhaps try it? You may find it's not that bad.

Why do you think it's a hack?

Because it is just looking for whether the template string contains the number 2, rather than being able to deterministically know that the firing alert was sent with priority 2.

That said, I am fine with this compromise as it reduces load on the Pushover servers and doesn’t introduce new receiver implementation specific data to the nflog.

One issue I realized with tracking is it won't be enough to just store the priority for the last sent notification. Instead, you will need to store the priority for all previously un-cancelled priority 2 notifications, right? For example:

00:00:01: Firing notification is sent for group key `foo` with priority 2
00:05:01: Firing notification is sent for group key `foo` with priority 1
00:10:01: Resolved notification is sent for group key `foo` with priority 1

In this example, there is an earlier priority 2 notification that has not been canceled.

People are definitely writing templates like this and there is a need. You do not want the resolution notification to bypass the user’s quiet hours. So sending the resolve notification with priority 2 doesn’t make a ton of sense. It would result in someone getting potentially woken up when there is no need.

Perhaps cancellation is not a feature we need? The implementation seems to have a lot of problems? Is there a lot of support from the community for this feature?

@kadaan
Copy link
Author

kadaan commented Jan 6, 2025

You would add support for metadata (i.e. key/value pairs) and then each receiver could store whatever metadata it needs. No field would be added that is specific to the receiver, but a generic field for any and all metadata could be accepted.

Sure, if you wanted to go that route, great. But, it seems to add a bunch of complexity for what should be a Pushover specific change and quite limited in scope.

Is it possible for Pushover to add a new server method that does both? Then we don't need to send two messages.

I'm sure it is possible, but I'm not a dev on that project. They have already added one method based on this issue, so I'm not sure if they will want to expend more effort.

The other consideration is that this will only happen when all alerts in an aggregation group are resolved, right? Have you benchmarked it see what the load would actually be on a typical Alertmanager deployment? If not, perhaps try it? You may find it's not that bad.

This is not about the load on Alertmanager. It is about additional load on the Pushover servers. See the following comment from Joshua Stein: "These requests would not count against the application's quota, though I agree that it would be best to avoid sending them unnecessarily."

One issue I realized with tracking is it won't be enough to just store the priority for the last sent notification. Instead, you will need to store the priority for all previously un-cancelled priority 2 notifications, right? For example:

00:00:01: Firing notification is sent for group key `foo` with priority 2
00:05:01: Firing notification is sent for group key `foo` with priority 1
00:10:01: Resolved notification is sent for group key `foo` with priority 1

In this example, there is an earlier priority 2 notification that has not been canceled.

This is where I think a simple configuration option would almost be best. If the receiver had a CancelOnResolve property, the person configuring would be able to decide what to do. They could segment all alerts what will be priority 2 into their own receiver using a tag similar to the severity example you listed. At that point you know that all firing notifications are priority 2. Resolution notifications on that receiver can be any priority. If Pushover notices a lot of load on their servers from cancel_by_tag calls, they can decide to add a quota that will push people to use a more optimal receiver configuration would be best.

Perhaps cancellation is not a feature we need? The implementation seems to have a lot of problems? Is there a lot of support from the community for this feature?

I and others would be able to use emergency priority alerts with Pushover with this feature in place. Currently we don't use it because of all the manual interaction needed to handle emergency priority alerts upon resolution.

@grobinson-grafana
Copy link
Contributor

This is not about the load on Alertmanager. It is about additional load on the Pushover servers. See the following comment from Joshua Stein: "These requests would not count against the application's quota, though I agree that it would be best to avoid sending them unnecessarily."

No I understand. I wasn't talking about load on Alertmanager either. My point was the condition needed for this second message to be sent only happens when all alerts in an aggregation group are resolved. Unless a user has completely disabled grouping, my suspicion is real world load increase on Pushover servers is far less than 2x, because most group flushes have at least 1 firing alert. A group flush will all resolved alerts is far less common.

This is where I think a simple configuration option would almost be best. If the receiver had a CancelOnResolve property, the person configuring would be able to decide what to do. They could segment all alerts what will be priority 2 into their own receiver using a tag similar to the severity example you listed. At that point you know that all firing notifications are priority 2. Resolution notifications on that receiver can be any priority. If Pushover notices a lot of load on their servers from cancel_by_tag calls, they can decide to add a quota that will push people to use a more optimal receiver configuration would be best.

I think that's a fine solution that we would accept (I can approve).

I and other would be able to use emergency priority alerts with Pushover with this feature in place. Currently we don't use it because of all the manual interaction needed to handle emergency priority alerts upon resolution.

👍

@kadaan
Copy link
Author

kadaan commented Jan 6, 2025

@grobinson-grafana Great! That config option would let us do something like:

...
          priority: {{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}2{{ else }}1{{ end }}{{ else }}0{{ end }}
          cancel_on_resolve: {{ if eq .CommonLabels.severity "critical" }}true{{ else }}false{{ end }}
...

kadaan added 2 commits January 6, 2025 09:11
This reverts commit 009466d.

Signed-off-by: Joel Baranick <[email protected]>
This reverts commit 68aa21a.

Signed-off-by: Joel Baranick <[email protected]>
@kadaan kadaan force-pushed the support_pushover_emergecy_cancel_by_tags branch from 41c5f50 to 2f2cded Compare January 6, 2025 17:12
@grobinson-grafana
Copy link
Contributor

@grobinson-grafana Great! That config option would let us do something like:

...
          priority: {{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}2{{ else }}1{{ end }}{{ else }}0{{ end }}
          cancel_on_resolve: {{ if eq .CommonLabels.severity "critical" }}true{{ else }}false{{ end }}
...

Isn't cancel_on_resolve just a boolean? Do you also want to make it a template? The problem with making it a template is that it makes it extremely difficult for users to retroactively understand why they did or didn't get a cancellation notification, and I believe we have pushed back on people wanting to make similar changes in other receivers.

@kadaan
Copy link
Author

kadaan commented Jan 6, 2025

Sure cancel_by_resolve is just a boolean. Similarly, priority is just a number, is a template, and alters the behavior of Pushover. I would contend that cancel_by_resolve as a template has less impact than priority as a template.

The above templates allow Alertmanager to send a message to Pushover with priority=2 when a severity=critical alert is firing and have cancel_on_resolve=true for only severity=critical alerts. The key thing this gets us, with regard to Pushover, is that resolve messages are not emergency and cancel_by_resolve is only called for resolved alerts which would have been priority=2. This lets me optimize the calls to Pushover and ensures that if they add cancel_by_tag to the api quota I don't waste quota

@kadaan
Copy link
Author

kadaan commented Jan 7, 2025

@grobinson-grafana The PR has been updated based on our discussion

@kadaan
Copy link
Author

kadaan commented Jan 10, 2025

@grobinson-grafana Anything else I need to do for this one?

@kadaan
Copy link
Author

kadaan commented Jan 14, 2025

@grobinson-grafana All checks are passing now. Anything else I need to do to enable this to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants