katib-controller reconciles excessively after neighbor charm pod restart, causing test timeouts

**Note: This issue was generated with AI assistance (GitHub Copilot) based on automated log analysis and triage.**
Filed by @canonical/solutions-qa

---

## Summary

When `mysql-k8s` (a database provider related to `katib-db-manager`) has its pod deleted and recreated, `katib-controller` receives repeated reconciliation events and runs the full reconciliation cycle approximately **4 times over ~8 minutes**. This causes integration tests that wait for all apps to return to `active` (with a 5-minute timeout) to fail intermittently, as the timeout fires mid-reconciliation on the 4th pass.

## Test Observer Link

https://test-observer.canonical.com/#/charms/406078?testExecutionId=443475&testResultId=10110627

## Environment

- **katib-controller** rev 1163, channel `0.18/stable`
- **katib-db-manager** rev 1123, channel `0.18/stable`
- **mysql-k8s** rev 400, channel `8.0/candidate`
- Juju 3.6.20, Kubernetes (ubuntu:22.04), amd64
- Test: `integration/mysql-k8s:database/mysql_client/katib-db-manager:relational-db`

## Observed Behaviour

After `mysql-k8s` pod is deleted and recreated by the test (simulating a pod failure), `katib-controller` reconciles 4 separate times before settling:

| Time (UTC) | Event |
|---|---|
| 11:44:15Z | Reconciliation 1 (initial settle after deploy) |
| 11:46:34Z | Reconciliation 2 (~2 min after pod deletion) |
| 11:48:16Z | Reconciliation 3 (~2 min later) |
| 11:50:32Z | Reconciliation 4 — **test times out at 11:50:35Z**, 3 seconds before active |
| 11:50:38Z | katib-controller becomes active (too late) |

Each reconciliation itself is healthy and fast (~6 seconds, all Kubernetes API calls return `200 OK`). The issue is the charm enqueuing itself multiple times from what should be a single `relation-changed` event cascade: `mysql-k8s pod restart` → `katib-db-manager:relational-db` → `katib-controller:k8s-service-info`.

## Expected Behaviour

`katib-controller` should settle to `active` after a single reconciliation pass triggered by the relation-changed event, not continue re-queuing every ~2 minutes.

## Impact

Intermittent ~50% failure rate on `test_pod_deletion` for mysql-k8s rev 400 (2 of 4 runs failed with identical symptoms: executions 443473 and 443475).

## Steps to Reproduce

1. Deploy `katib-controller` + `katib-db-manager` + `mysql-k8s`
2. Wait for all apps to go `active`
3. Delete the `target-0` (mysql-k8s) pod: `kubectl delete pod target-0 -n <model>`
4. Wait and observe `juju status` — katib-controller will cycle through `maintenance` multiple times before settling

## Logs

The juju status log shows the repeated reconciliation pattern clearly:
```
27 Mar 2026 11:46:34Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:46:40Z  workload  active
27 Mar 2026 11:48:16Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:48:23Z  workload  active
27 Mar 2026 11:50:32Z  workload  maintenance  Reconciling charm: executing component kubernetes:auths-webhooks-crds-configmaps
27 Mar 2026 11:50:38Z  workload  active       ← test already timed out at 11:50:35Z
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

katib-controller reconciles excessively after neighbor charm pod restart, causing test timeouts #397

Summary

Test Observer Link

Environment

Observed Behaviour

Expected Behaviour

Impact

Steps to Reproduce

Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time (UTC)	Event
11:44:15Z	Reconciliation 1 (initial settle after deploy)
11:46:34Z	Reconciliation 2 (~2 min after pod deletion)
11:48:16Z	Reconciliation 3 (~2 min later)
11:50:32Z	Reconciliation 4 — test times out at 11:50:35Z, 3 seconds before active
11:50:38Z	katib-controller becomes active (too late)

Uh oh!

katib-controller reconciles excessively after neighbor charm pod restart, causing test timeouts #397

Description

Summary

Test Observer Link

Environment

Observed Behaviour

Expected Behaviour

Impact

Steps to Reproduce

Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions