Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKRI forgets custom discovery handler registration when restarted #558

Open
koepalex opened this issue Feb 20, 2023 · 6 comments
Open

AKRI forgets custom discovery handler registration when restarted #558

koepalex opened this issue Feb 20, 2023 · 6 comments
Labels
bug Something isn't working keep-alive

Comments

@koepalex
Copy link

Describe the bug
AKRI custom discovery handler, using gRPC call to register with AKRI core. If the AKRI core component is restarting (see #557) it seems to have lost the registration information. And the discovery handler are never called to discover.

Output of kubectl get pods,akrii,akric -o wide

 NAME                                              READY   STATUS    RESTARTS         AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
pod/akri-agent-daemonset-4786l                    1/1     Running   68 (2d18h ago)   3d19h   10.244.1.31   aks-agentpool-28475719-vmss000000   <none>           <none>
pod/akri-agent-daemonset-d7kx5                    1/1     Running   93 (2d14h ago)   3d19h   10.244.0.27   aks-agentpool-28475719-vmss000001   <none>           <none>
pod/akri-controller-deployment-6cb9b9dcbb-d4c2s   1/1     Running   117 (3d3h ago)

Kubernetes Version: AKS 1.24.9

To Reproduce
Steps to reproduce the behavior:

  1. Create cluster using AKS
  2. Install Akri with the Helm command
  3. Configure Akri to use custom discovery handler
  4. Force restart of Akri agents

Expected behavior
Once AKRI is restarted, it should also restart the referenced discovery handlers and therefor let them reregister. As alternative AKRI could persist the registration information and reuse the discovery handler, once it is restarted.

Logs (please share snips of applicable logs)

agent.log

It seems that this line is unexpected:

[2023-02-20T11:18:39Z TRACE agent::util::discovery_operator] delete_offline_instances - entered for configuration Some("akri-opcua-asset")

The discovery handler is still up and running and waiting for discover call.

Additional context
n/a

@koepalex koepalex added the bug Something isn't working label Feb 20, 2023
@mregen
Copy link

mregen commented Feb 20, 2023

this PR might be related: #385

@kate-goldenring
Copy link
Contributor

Hi @koepalex! it is expected that the discover handler will re-register with the agent if it's connection with the agent is dropped (such as due to the agent restarting). @mregen points to a good example of how we are doing this in Akri's Discovery Handlers.

@kate-goldenring
Copy link
Contributor

@koepalex while the behavior of the agent is expected, I agree that it isn't ideal for instances to be deleted as a result of it. With a shared device like OPC UA ones, there should be a 5 minute grace period during which the discovery handler oculd re-register

@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

Issue has been automatically marked as stale due to inactivity for 90 days. Update the issue to remove label, otherwise it will be automatically closed.

@rpieczon
Copy link

Any idea when this one will be fixed?

Copy link
Contributor

Issue has been automatically marked as stale due to inactivity for 90 days. Update the issue to remove label, otherwise it will be automatically closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keep-alive
Projects
Status: Investigating
Development

No branches or pull requests

4 participants