Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe Webhooks for queue-processor #297

Open
bwagner5 opened this issue Nov 19, 2020 · 3 comments
Open

Dedupe Webhooks for queue-processor #297

bwagner5 opened this issue Nov 19, 2020 · 3 comments
Labels
Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request

Comments

@bwagner5
Copy link
Contributor

Events like spot rebalance recommendation, asg lifecycle hooks, spot ITN, and ec2 instance status change can cause multiple webhooks to fire, one for each event on the same instance. NTH should dedupe webhooks that have already been sent for an instance since the drain status has not changed.

@haugenj
Copy link
Contributor

haugenj commented Dec 14, 2020

I'm gonna leave this for now, if this is affecting anyone please comment here and we'll increase the priority

@bwagner5
Copy link
Contributor Author

When commenting on #353 I realized I forgot to write down some thoughts on mitigating this:

First, we could dedupe locally within the event store. If an event comes into the event store acting on a node that has already been successfully drained within a 10 minute time period, the event store could immediately mark it as completed. This would only help within a single replica deployment of NTH.

Another approach could run the Event Store as a separate deployment. NTH worker pods could be scaled independently and the same dedupe logic discussed above would scale to more workers. I'm not sure the approach is warranted though considering the relatively low traffic and the increased complexity this would add to NTH.

A third approach could be to try to handle events with a more graceful degradation. If the node doesn't exist or has already been cordoned and drained, then don't send the webhook. Since a single NTH pod can now process events in parallel, it's possible that events could be processed in parallel where the cordon and drain check fails, so a cordon and drain call is initiated while another workers is in the middle of a cordon drain call. In that situation the webhooks would be duplicated. To get around the concurrency issue, we could lock event processing for an individual node, but this would only be easily done within a single pod, so duplicate webhooks could still fire if two different NTH pods are concurrently processing two events for the same node.

Another valid approach is to do nothing and accept duplicate webhooks. In practice, no one has complained about them so far If you're out there and find them frustrating, let us know here!

@jillmon jillmon added Priority: High This issue will be seen by most users Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case and removed Priority: High This issue will be seen by most users labels Feb 11, 2021
@jillmon jillmon added this to the Webhook Enhancements milestone Feb 11, 2021
@jillmon jillmon added the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Oct 19, 2021
@evandam
Copy link

evandam commented Jun 9, 2022

Hey @bwagner5, just wanted to bump this since I think it would be really nice to have notifications deduplicated to keep noise low in Slack channels for example.

In my case, I see notifications for "Spot Interruption event received", "ASG Lifecycle Termination event received", and "EC2 State Change event received" - Three notifications when one spot instance is interrupted.

I realize it's a complicated problem, especially with multiple replicas, but the third option you stated sounds good to me. Maybe it would make sense to have a flag to say "skip webhooks when draining" and call it a day? It may still have some issues around concurrency but might be a pretty successful and lower effort fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low This issue will not be seen by most users. The issue is a very specific use case or corner case stalebot-ignore To NOT let the stalebot update or close the Issue / PR Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants