Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inhibition rule not working #4205

Open
ricksj5 opened this issue Jan 16, 2025 · 1 comment
Open

Inhibition rule not working #4205

ricksj5 opened this issue Jan 16, 2025 · 1 comment

Comments

@ricksj5
Copy link

ricksj5 commented Jan 16, 2025

We have some inhibition rules that are working as expected, but we are trying to add a new inhibition rule without the "equal" field, and it is not working. We also tested the new rule with the "equal" field like the other rules, but it still did not work.

Working Rules:

  inhibit_rules:
    - source_match:
        alertname: "BlackboxProbeFailed"
      target_match_re:
        severity: "very high|high|warning"
      equal: ["hostname"]
    - source_match:
        alertname: "Network-Down"
      target_match_re:
        alertname: "BlackboxProbeFailed|Host-DOWN|prometheus-heartbeat"
      equal: ["category"]

New rule is added after the above rules without equal field:

   - source_match:
       alertname: "Test-service-cron"
     target_match_re:
       alertname: "Test-service-sshd"

Below are test alerting rules created for the same:

          - alert: Test-service-cron
            expr: node_systemd_unit_state{name="cron.service",exported_state="active"} == 0
            for: 5m
            labels:
              severity: very high
              category: Exceptions
            annotations:
              description: "Service has been down for over 5 minutes in - {{$labels.hostname}}"
              summary: "RED - {{$labels.hostname}} - CRON Service down"

          - alert: Test-service-sshd
            expr: node_systemd_unit_state{name="sshd.service",exported_state="active"} == 0
            for: 5m
            labels:
              severity: very high
              spc: disabled
              category: Exceptions
            annotations:
              description: "Service sshd has been down for over 5 minutes in - {{$labels.hostname}}"
              summary: "RED - {{$labels.hostname}} -  SSHD Service down"

To test the new rule, we first stopped the cron service. Once the "Test-service-cron" alert was fired, we stopped the sshd service. However, the "Test-service-sshd" alert also fired, indicating that the inhibition rule is not working as expected. The inhibition rule should suppress the target alert, but it did not. We verified the alert firing status through the "ALERTS" metric.

Questions:

  1. Are there any specific requirements or conditions for inhibition rules to work without the "equal" field?
  2. Could there be any conflicts or precedence issues with the existing inhibition rules that might affect the new rule?
  3. Could there be any version-specific issues or bugs related to inhibition rules that we should be aware of?
@grobinson-grafana
Copy link
Contributor

I haven't been able to reproduce this I'm afraid, it works for me. Here is the configuration file:

receivers:
  - name: test
route:
  receiver: test
inhibit_rules:
  - source_match:
      alertname: "BlackboxProbeFailed"
    target_match_re:
      severity: "very high|high|warning"
    equal: ["hostname"]
  - source_match:
      alertname: "Network-Down"
    target_match_re:
      alertname: "BlackboxProbeFailed|Host-DOWN|prometheus-heartbeat"
    equal: ["category"]
  - source_match:
      alertname: "Test-service-cron"
    target_match_re:
      alertname: "Test-service-sshd"

I added the two alerts:

./amtool --alertmanager.url=http://127.0.0.1:9093 alert add alertname=Test-service-cron
./amtool --alertmanager.url=http://127.0.0.1:9093 alert add alertname=Test-service-sshd

The debug logs show Test-service-sshd being inhibited:

time=2025-01-16T10:49:47.624Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=Test-service-cron[8bc38d5][active]
time=2025-01-16T10:49:51.561Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=Test-service-sshd[643222c][active]
time=2025-01-16T10:50:17.626Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts="[Test-service-cron[8bc38d5][active] Test-service-sshd[643222c][active]]"
time=2025-01-16T10:50:17.626Z level=DEBUG source=notify.go:579 msg="Notifications will not be sent for muted alerts" component=dispatcher alerts=[Test-service-sshd[643222c][active]] reason=inhibition

And so does the API:

[
  {
    "annotations": {},
    "endsAt": "2025-01-16T10:54:51.561Z",
    "fingerprint": "643222c68932c063",
    "receivers": [
      {
        "name": "test"
      }
    ],
    "startsAt": "2025-01-16T10:49:51.561Z",
    "status": {
      "inhibitedBy": [
        "8bc38d5516aaa89d"
      ],
      "mutedBy": [],
      "silencedBy": [],
      "state": "suppressed"
    },
    "updatedAt": "2025-01-16T10:49:51.561Z",
    "labels": {
      "alertname": "Test-service-sshd"
    }
  },
  {
    "annotations": {},
    "endsAt": "2025-01-16T10:54:47.624Z",
    "fingerprint": "8bc38d5516aaa89d",
    "receivers": [
      {
        "name": "test"
      }
    ],
    "startsAt": "2025-01-16T10:49:47.624Z",
    "status": {
      "inhibitedBy": [],
      "mutedBy": [],
      "silencedBy": [],
      "state": "active"
    },
    "updatedAt": "2025-01-16T10:49:47.624Z",
    "labels": {
      "alertname": "Test-service-cron"
    }
  }
]

Could you do the equivalent test and share the debug logs from your Alertmanager, so we can compare?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants