Skip to content

Commit c47fe46

Browse files
feat(operator): Add warning alert for when LokiStack is not getting ready (#19258)
Signed-off-by: Joao Marcal <[email protected]>
1 parent 386d4e1 commit c47fe46

File tree

3 files changed

+76
-0
lines changed

3 files changed

+76
-0
lines changed

operator/docs/lokistack/sop.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,3 +411,48 @@ The schema configuration does not contain the most recent schema version and nee
411411
### Steps
412412

413413
- Add a new object storage schema V13 with a future EffectiveDate
414+
415+
## Lokistack Components Not Ready Warning
416+
417+
### Impact
418+
419+
One or more LokiStack components are not ready, which can disrupt ingestion or querying and lead to degraded service.
420+
421+
### Summary
422+
423+
The LokiStack reports that some components have not reached the `Ready` state. This might be related to Kubernetes resources (Pods/Deployments), configuration, or external dependencies.
424+
425+
### Severity
426+
427+
`Warning`
428+
429+
### Access Required
430+
431+
- Console access to the cluster
432+
- Edit or view access in the namespace where the LokiStack is deployed:
433+
- OpenShift
434+
- `openshift-logging` (LokiStack)
435+
436+
### Steps
437+
438+
- Inspect the LokiStack conditions and events
439+
- Describe the LokiStack resource and review status conditions:
440+
- `kubectl -n <namespace> describe lokistack <name>`
441+
- Check for conditions that would lead to some pods not being in the `Ready` state
442+
- Check operator and reconciliation status
443+
- Ensure the Loki Operator is running and not reporting errors:
444+
- `kubectl -n <operator-namespace> logs deploy/loki-operator-controller-manager`
445+
- Look for reconcile errors related to missing permissions, invalid fields, or failed rollouts.
446+
- Verify component Pods and Deployments
447+
- Ensure all core components are running and Ready in the LokiStack namespace:
448+
- `distributor`, `ingester`, `querier`, `query-frontend`, `index-gateway`, `compactor`, `gateway`
449+
- Check Pod readiness and recent restarts:
450+
- `kubectl -n <namespace> get pods`
451+
- `kubectl -n <namespace> describe pod <pod>`
452+
- Examine Kubernetes events for failures
453+
- `kubectl -n <namespace> get events --sort-by=.lastTimestamp`
454+
- Common causes: image pull backoffs, failed mounts, readiness probe failures, or insufficient resources
455+
- Validate configuration and referenced resources
456+
- Confirm referenced `Secrets` and `ConfigMaps` exist and have correct keys
457+
- Look into the Pod logs of the component that still not `Ready`:
458+
- `kubectl -n <namespace> logs <pod>`

operator/internal/manifests/internal/alerts/prometheus-alerts.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,3 +227,20 @@ groups:
227227
for: 1m
228228
labels:
229229
severity: warning
230+
- alert: LokistackComponentsNotReadyWarning
231+
annotations:
232+
description: |-
233+
The LokiStack "{{ $labels.stack_name }}" in namespace "{{ $labels.namespace }}" has components that are not ready.
234+
summary: "One or more LokiStack components are not ready."
235+
runbook_url: "[[ .RunbookURL ]]#Lokistack-Components-Not-Ready-Warning"
236+
expr: |
237+
sum (
238+
label_replace(
239+
lokistack_status_condition{reason="ReadyComponents", status="false"},
240+
"namespace", "$1", "stack_namespace", "(.+)"
241+
)
242+
) by (stack_name, namespace)
243+
> 0
244+
for: 15m
245+
labels:
246+
severity: warning

operator/internal/manifests/internal/alerts/testdata/test.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,9 @@ tests:
6666
- series: 'loki_discarded_samples_total{namespace="my-ns", tenant="application", reason="line_too_long"}'
6767
values: '0x5 0+120x25 3000'
6868

69+
- series: 'lokistack_status_condition{stack_name="mystack", stack_namespace="my-ns", reason="ReadyComponents", status="false"}'
70+
values: '1+0x25'
71+
6972
- series: 'loki_ingester_chunks_flush_failures_total{namespace="my-ns", pod="ingester-0"}'
7073
values: '0+25x20'
7174
- series: 'loki_ingester_chunks_flush_requests_total{namespace="my-ns", pod="ingester-0"}'
@@ -200,6 +203,17 @@ tests:
200203
summary: Loki is discarding samples during ingestion because they fail validation.
201204
runbook_url: "[[ .RunbookURL]]#Loki-Discarded-Samples-Warning"
202205
- eval_time: 16m
206+
alertname: LokistackComponentsNotReadyWarning
207+
exp_alerts:
208+
- exp_labels:
209+
namespace: my-ns
210+
stack_name: mystack
211+
severity: warning
212+
exp_annotations:
213+
description: 'The LokiStack "mystack" in namespace "my-ns" has components that are not ready.'
214+
summary: "One or more LokiStack components are not ready."
215+
runbook_url: "[[ .RunbookURL ]]#Lokistack-Components-Not-Ready-Warning"
216+
- eval_time: 16m
203217
alertname: LokiIngesterFlushFailureRateCritical
204218
exp_alerts:
205219
- exp_labels:

0 commit comments

Comments
 (0)