Skip to content

katib-controller reconciles excessively after neighbor charm pod restart, causing test timeouts #397

Description

@dmvdm

Note: This issue was generated with AI assistance (GitHub Copilot) based on automated log analysis and triage.
Filed by @canonical/solutions-qa


Summary

When mysql-k8s (a database provider related to katib-db-manager) has its pod deleted and recreated, katib-controller receives repeated reconciliation events and runs the full reconciliation cycle approximately 4 times over ~8 minutes. This causes integration tests that wait for all apps to return to active (with a 5-minute timeout) to fail intermittently, as the timeout fires mid-reconciliation on the 4th pass.

Test Observer Link

https://test-observer.canonical.com/#/charms/406078?testExecutionId=443475&testResultId=10110627

Environment

  • katib-controller rev 1163, channel 0.18/stable
  • katib-db-manager rev 1123, channel 0.18/stable
  • mysql-k8s rev 400, channel 8.0/candidate
  • Juju 3.6.20, Kubernetes (ubuntu:22.04), amd64
  • Test: integration/mysql-k8s:database/mysql_client/katib-db-manager:relational-db

Observed Behaviour

After mysql-k8s pod is deleted and recreated by the test (simulating a pod failure), katib-controller reconciles 4 separate times before settling:

Time (UTC) Event
11:44:15Z Reconciliation 1 (initial settle after deploy)
11:46:34Z Reconciliation 2 (~2 min after pod deletion)
11:48:16Z Reconciliation 3 (~2 min later)
11:50:32Z Reconciliation 4 — test times out at 11:50:35Z, 3 seconds before active
11:50:38Z katib-controller becomes active (too late)

Each reconciliation itself is healthy and fast (~6 seconds, all Kubernetes API calls return 200 OK). The issue is the charm enqueuing itself multiple times from what should be a single relation-changed event cascade: mysql-k8s pod restartkatib-db-manager:relational-dbkatib-controller:k8s-service-info.

Expected Behaviour

katib-controller should settle to active after a single reconciliation pass triggered by the relation-changed event, not continue re-queuing every ~2 minutes.

Impact

Intermittent ~50% failure rate on test_pod_deletion for mysql-k8s rev 400 (2 of 4 runs failed with identical symptoms: executions 443473 and 443475).

Steps to Reproduce

  1. Deploy katib-controller + katib-db-manager + mysql-k8s
  2. Wait for all apps to go active
  3. Delete the target-0 (mysql-k8s) pod: kubectl delete pod target-0 -n <model>
  4. Wait and observe juju status — katib-controller will cycle through maintenance multiple times before settling

Logs

The juju status log shows the repeated reconciliation pattern clearly:

27 Mar 2026 11:46:34Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:46:40Z  workload  active
27 Mar 2026 11:48:16Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:48:23Z  workload  active
27 Mar 2026 11:50:32Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:50:38Z  workload  active       ← test already timed out at 11:50:35Z

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions