Skip to content

Commit d939670

Browse files
feat(operator): Added LokistackPendingComponents alert
1 parent 37eddab commit d939670

File tree

3 files changed

+72
-0
lines changed

3 files changed

+72
-0
lines changed

operator/docs/lokistack/sop.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,3 +365,48 @@ The schema configuration does not contain the most recent schema version and nee
365365
### Steps
366366

367367
- Add a new object storage schema V13 with a future EffectiveDate
368+
369+
## Lokistack Components Not Ready
370+
371+
### Impact
372+
373+
One or more LokiStack components are not ready, which can disrupt ingestion or querying and lead to degraded service.
374+
375+
### Summary
376+
377+
The LokiStack reports that some components have not reached the `Ready` state. This might be related to Kubernetes resources (Pods/Deployments), configuration, or external dependencies.
378+
379+
### Severity
380+
381+
`Critical`
382+
383+
### Access Required
384+
385+
- Console access to the cluster
386+
- Edit or view access in the namespace where the LokiStack is deployed:
387+
- OpenShift
388+
- `openshift-logging` (LokiStack)
389+
390+
### Steps
391+
392+
- Inspect the LokiStack conditions and events
393+
- Describe the LokiStack resource and review status conditions:
394+
- `kubectl -n <namespace> describe lokistack <name>`
395+
- Check for conditions that would lead to some pods not being in the `Ready` state
396+
- Check operator and reconciliation status
397+
- Ensure the Loki Operator is running and not reporting errors:
398+
- `kubectl -n <operator-namespace> logs deploy/loki-operator`
399+
- Look for reconcile errors related to missing permissions, invalid fields, or failed rollouts.
400+
- Verify component Pods and Deployments
401+
- Ensure all core components are running and Ready in the LokiStack namespace:
402+
- `distributor`, `ingester`, `querier`, `query-frontend`, `index-gateway`, `compactor`, `gateway`
403+
- Check Pod readiness and recent restarts:
404+
- `kubectl -n <namespace> get pods`
405+
- `kubectl -n <namespace> describe pod <pod>`
406+
- Examine Kubernetes events for failures
407+
- `kubectl -n <namespace> get events --sort-by=.lastTimestamp`
408+
- Common causes: image pull backoffs, failed mounts, readiness probe failures, or insufficient resources
409+
- Validate configuration and referenced resources
410+
- Confirm referenced `Secrets`, `ConfigMaps`, exist and have correct keys
411+
- Look into the Pod logs of the component that still not `Ready`:
412+
- `kubectl -n <namespace> logs <pod>`

operator/internal/manifests/internal/alerts/prometheus-alerts.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,3 +209,14 @@ groups:
209209
for: 1m
210210
labels:
211211
severity: warning
212+
- alert: LokistackComponentsNotReady
213+
annotations:
214+
description: |-
215+
The LokiStack "{{ $labels.stack_name }}" in namespace "{{ $labels.stack_namespace }}" has components that are not ready.
216+
summary: "One or more LokiStack components are not ready."
217+
runbook_url: "[[ .RunbookURL ]]#Lokistack-Pending-Components"
218+
expr: |
219+
sum (lokistack_status_condition{reason="ReadyComponents", status="false"}) by (stack_name, stack_namespace, reason, status) == 1
220+
for: 10m
221+
labels:
222+
severity: critical

operator/internal/manifests/internal/alerts/testdata/test.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,9 @@ tests:
6666
- series: 'loki_discarded_samples_total{namespace="my-ns", tenant="application", reason="line_too_long"}'
6767
values: '0x5 0+120x25 3000'
6868

69+
- series: 'lokistack_status_condition{stack_name="mystack", stack_namespace="my-ns", reason="ReadyComponents", status="false"}'
70+
values: '0+0x15 1+0x10'
71+
6972
alert_rule_test:
7073
- eval_time: 16m
7174
alertname: LokiRequestErrors
@@ -194,3 +197,16 @@ tests:
194197
Samples are discarded because of "line_too_long" at a rate of 2 samples per second.
195198
summary: Loki is discarding samples during ingestion because they fail validation.
196199
runbook_url: "[[ .RunbookURL]]#Loki-Discarded-Samples-Warning"
200+
- eval_time: 26m
201+
alertname: LokistackComponentsNotReady
202+
exp_alerts:
203+
- exp_labels:
204+
stack_namespace: my-ns
205+
stack_name: mystack
206+
reason: ReadyComponents
207+
status: "false"
208+
severity: critical
209+
exp_annotations:
210+
description: 'The LokiStack "mystack" in namespace "my-ns" has components that are not ready.'
211+
summary: "One or more LokiStack components are not ready."
212+
runbook_url: "[[ .RunbookURL ]]#Lokistack-Pending-Components"

0 commit comments

Comments
 (0)