-
Notifications
You must be signed in to change notification settings - Fork 3
fix(observability): fix expr for alerts like PodIsNotReady #1553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nikita Korolev <[email protected]>
Reviewer's GuideThis PR refactors the Prometheus alert rules for multiple PodIsNotReady alerts by replacing the aggregated Flow diagram for updated PodIsNotReady alert logicflowchart TD
A["kube_pod_status_ready == 0 for pod"] --> B["kube_pod_status_phase == 1 for pod (phase=Running|Succeeded)"]
B --> C["AND on (namespace, pod)"]
C --> D["PodIsNotReady alert fires"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml:5-9` </location>
<code_context>
- alert: D8InternalVirtualizationCDIAPIServerPodIsNotReady
- expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-apiserver-.*"}) != 1
+ expr: |
+ min by (pod) (
+ kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-apiserver-.*"}
+ and on(namespace,pod)
+ kube_pod_status_phase{phase="Succeeded|Running", namespace="d8-virtualization"}
+ ) != 1
labels:
severity_level: "6"
</code_context>
<issue_to_address>
**issue:** The use of 'Succeeded|Running' as a value for the 'phase' label may not match Prometheus label matching semantics.
Use 'phase=~"Succeeded|Running"' for regex matching to ensure correct selection of pods in either phase, consistent with other files.
</issue_to_address>
### Comment 2
<location> `monitoring/prometheus-rules/internal-virtualization-cdi-deployment.yaml:7` </location>
<code_context>
- alert: D8InternalVirtualizationCDIDeploymentPodIsNotReady
- expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-deployment-.*"}) != 1
+ expr: |
+ (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-deployment-.*"} == 0)
+ and on (namespace,pod)
+ (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-deployment-.*"} == 1)
labels:
severity_level: "6"
</code_context>
<issue_to_address>
**suggestion:** The alert triggers only if exactly one pod matches the phase condition, which may miss cases with multiple pods.
Consider replacing '== 1' with '> 0' or a suitable aggregation to handle scenarios with multiple pods in the specified phases.
```suggestion
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-deployment-.*"} > 0)
```
</issue_to_address>
### Comment 3
<location> `monitoring/prometheus-rules/internal-virtualization-cdi-operator.yaml:4-7` </location>
<code_context>
- alert: D8InternalVirtualizationCDIOperatorPodIsNotReady
- expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"}) != 1
+ expr: |
+ (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"} == 0)
+ and on (namespace,pod)
+ (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-operator-.*"} == 1)
labels:
severity_level: "6"
</code_context>
<issue_to_address>
**suggestion:** Consider using consistent aggregation for pod readiness and phase checks.
Aggregating over pods will ensure the alert works correctly when multiple pods match the pattern, preventing missed readiness or phase issues.
```suggestion
expr: |
(sum by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"}) == 0)
and on (namespace,pod)
(sum by (pod) (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-operator-.*"}) == 1)
```
</issue_to_address>
### Comment 4
<location> `monitoring/prometheus-rules/internal-virtualization-virt-api.yaml:4-7` </location>
<code_context>
- alert: D8InternalVirtualizationVirtAPIPodIsNotReady
- expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"}) != 1
+ expr: |
+ (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"} == 0)
+ and on (namespace,pod)
+ (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-api-.*"} == 1)
labels:
severity_level: "6"
</code_context>
<issue_to_address>
**suggestion:** The alert logic may not handle multiple pods correctly due to equality checks.
Consider replacing strict equality checks with aggregation or range-based logic to ensure the alert accurately reflects pod readiness in deployments with multiple pods.
```suggestion
expr: |
(
sum(
kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"}
) by (namespace)
== 0
)
and
(
sum(
kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-api-.*"}
) by (namespace)
> 0
)
```
</issue_to_address>
### Comment 5
<location> `monitoring/prometheus-rules/internal-virtualization-virt-controller.yaml:4-7` </location>
<code_context>
- alert: D8InternalVirtualizationVirtControllerPodIsNotReady
- expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"}) != 1
+ expr: |
+ (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"} == 0)
+ and on (namespace,pod)
+ (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-controller-.*"} == 1)
labels:
severity_level: "6"
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Equality checks for pod readiness and phase may not be robust for deployments with multiple pods.
Consider updating the logic to aggregate pod readiness, ensuring the alert triggers if any pod is not ready, rather than relying on equality checks.
```suggestion
expr: |
(
sum by (namespace, pod) (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-controller-.*"})
>
sum by (namespace, pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"})
)
```
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml
Outdated
Show resolved
Hide resolved
expr: | | ||
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-deployment-.*"} == 0) | ||
and on (namespace,pod) | ||
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-deployment-.*"} == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The alert triggers only if exactly one pod matches the phase condition, which may miss cases with multiple pods.
Consider replacing '== 1' with '> 0' or a suitable aggregation to handle scenarios with multiple pods in the specified phases.
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-deployment-.*"} == 1) | |
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-deployment-.*"} > 0) |
expr: | | ||
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"} == 0) | ||
and on (namespace,pod) | ||
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-operator-.*"} == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider using consistent aggregation for pod readiness and phase checks.
Aggregating over pods will ensure the alert works correctly when multiple pods match the pattern, preventing missed readiness or phase issues.
expr: | | |
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"} == 0) | |
and on (namespace,pod) | |
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-operator-.*"} == 1) | |
expr: | | |
(sum by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"cdi-operator-.*"}) == 0) | |
and on (namespace,pod) | |
(sum by (pod) (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"cdi-operator-.*"}) == 1) |
expr: | | ||
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"} == 0) | ||
and on (namespace,pod) | ||
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-api-.*"} == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The alert logic may not handle multiple pods correctly due to equality checks.
Consider replacing strict equality checks with aggregation or range-based logic to ensure the alert accurately reflects pod readiness in deployments with multiple pods.
expr: | | |
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"} == 0) | |
and on (namespace,pod) | |
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-api-.*"} == 1) | |
expr: | | |
( | |
sum( | |
kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-api-.*"} | |
) by (namespace) | |
== 0 | |
) | |
and | |
( | |
sum( | |
kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-api-.*"} | |
) by (namespace) | |
> 0 | |
) |
expr: | | ||
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"} == 0) | ||
and on (namespace,pod) | ||
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-controller-.*"} == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Equality checks for pod readiness and phase may not be robust for deployments with multiple pods.
Consider updating the logic to aggregate pod readiness, ensuring the alert triggers if any pod is not ready, rather than relying on equality checks.
expr: | | |
(kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"} == 0) | |
and on (namespace,pod) | |
(kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-controller-.*"} == 1) | |
expr: | | |
( | |
sum by (namespace, pod) (kube_pod_status_phase{namespace="d8-virtualization", phase=~"Running|Succeeded", pod=~"virt-controller-.*"}) | |
> | |
sum by (namespace, pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virt-controller-.*"}) | |
) |
Signed-off-by: Nikita Korolev <[email protected]> fix phase cdi-apiserver Signed-off-by: Nikita Korolev <[email protected]>
Description
Fixed alerts:
Why do we need it, and what problem does it solve?
What is the expected result?
Checklist
Changelog entries