-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedupe Webhooks for queue-processor #297
Comments
I'm gonna leave this for now, if this is affecting anyone please comment here and we'll increase the priority |
When commenting on #353 I realized I forgot to write down some thoughts on mitigating this: First, we could dedupe locally within the event store. If an event comes into the event store acting on a node that has already been successfully drained within a 10 minute time period, the event store could immediately mark it as completed. This would only help within a single replica deployment of NTH. Another approach could run the Event Store as a separate deployment. NTH worker pods could be scaled independently and the same dedupe logic discussed above would scale to more workers. I'm not sure the approach is warranted though considering the relatively low traffic and the increased complexity this would add to NTH. A third approach could be to try to handle events with a more graceful degradation. If the node doesn't exist or has already been cordoned and drained, then don't send the webhook. Since a single NTH pod can now process events in parallel, it's possible that events could be processed in parallel where the cordon and drain check fails, so a cordon and drain call is initiated while another workers is in the middle of a cordon drain call. In that situation the webhooks would be duplicated. To get around the concurrency issue, we could lock event processing for an individual node, but this would only be easily done within a single pod, so duplicate webhooks could still fire if two different NTH pods are concurrently processing two events for the same node. Another valid approach is to do nothing and accept duplicate webhooks. In practice, no one has complained about them so far If you're out there and find them frustrating, let us know here! |
Hey @bwagner5, just wanted to bump this since I think it would be really nice to have notifications deduplicated to keep noise low in Slack channels for example. In my case, I see notifications for "Spot Interruption event received", "ASG Lifecycle Termination event received", and "EC2 State Change event received" - Three notifications when one spot instance is interrupted. I realize it's a complicated problem, especially with multiple replicas, but the third option you stated sounds good to me. Maybe it would make sense to have a flag to say "skip webhooks when draining" and call it a day? It may still have some issues around concurrency but might be a pretty successful and lower effort fix. |
Events like spot rebalance recommendation, asg lifecycle hooks, spot ITN, and ec2 instance status change can cause multiple webhooks to fire, one for each event on the same instance. NTH should dedupe webhooks that have already been sent for an instance since the drain status has not changed.
The text was updated successfully, but these errors were encountered: