Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix alerts duplication in HA cluster mode #4153

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nip-was-here
Copy link

#4152

There is fix for duplicates in HA cluster mode & reduce for broadcasting several times exactly same state.

!cluster.OversizedMessage(b) is not necessary here:

On current release version (0.27.0)

To sum up: at step Choose way for send data in func there's same check with choosing way of sending data - TCP or UDP

For testing this fix, stand was created:

  • Several vmalert instances send alerts to 2 HA clusters of Alertmanager - with patch & without it
  • Each Alertmanager cluster know only about his own nodes
  • Both clusters of Alertmanager send to one self-developed receiver
  • Receiver calculates duplicates of alerts with delta < 1 minute and other stats

Results of tests with big by body incoming alerts:

{
  "main": {
    "alerts_last_minute": 2,
    "alerts_total": 10864,
    "duplicates_total": 2887
  },
  "patched": {
    "alerts_last_minute": 2,
    "alerts_total": 7579,
    "duplicates_total": 0
  }
}

@MarcWort
Copy link

I stumbled across this fix because we were sometimes duplicating notifications in Slack. Are there any plans to review this change?

@jan--f
Copy link
Contributor

jan--f commented Jan 16, 2025

Hi, thanks for the contribution. Do you think you could add a unit test that demonstrates the old bug and guards against future regressions of this issue?

@nip-was-here
Copy link
Author

Hi, @jan--f

It's a bit complicated - during finding this, I run 3 nodes of Alertmanager, one of them in debugger

It's appears only on cluster mode with big notification log size between nodes

@jan--f
Copy link
Contributor

jan--f commented Jan 16, 2025

!cluster.OversizedMessage(b) is not necessary here

Iiuc the original code avoids gossiping oversized messages since they have already been distributed. So won't this change create a dupplicate gossip message for oversized messages?

@nip-was-here
Copy link
Author

a dupplicate gossip message for oversized messages

Currently, there's no sync between nodes with big messages (reason of duplication on receiver) and flood by broadcasting n times other logs (n - how much logs there's in one announce between nodes)

@nip-was-here
Copy link
Author

In my solution, all logs will be announced further once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants