Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charm is in maintenance but does not recover #162

Closed
beliaev-maksim opened this issue May 28, 2024 · 20 comments
Closed

Charm is in maintenance but does not recover #162

beliaev-maksim opened this issue May 28, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@beliaev-maksim
Copy link
Member

Bug Description

I have the charm deployed and sometimes it does go into unrecoverable maintenance mode. I assume there might be some issue with LEGO on our side and it feel like the charm does not retry, which leads to the case where my workload switches to K8s self signed certs.

prod-cla-checker@enterprise-engineering-bastion-ps6:~$ juju status
Model             Controller    Cloud/Region              Version  SLA          Timestamp
prod-cla-checker  prodstack-is  k8s-prod-general/default  3.1.8    unsupported  11:54:29Z

App                       Version  Status   Scale  Charm                     Channel  Rev  Address       Exposed  Message
charmed-cla-checker                active       1  charmed-cla-checker       edge       1  10.87.244.2   no       
httprequest-lego-k8s               waiting      1  httprequest-lego-k8s      stable    40  10.87.26.217  no       waiting for units to settle down
nginx-ingress-integrator  24.2.0   active       1  nginx-ingress-integrator  stable    95  10.87.48.165  no       Ingress IP(s): 10.141.14.128

Unit                         Workload     Agent  Address          Ports  Message
charmed-cla-checker/0*       active       idle   192.168.100.249         
httprequest-lego-k8s/0*      maintenance  idle   192.168.102.41          
nginx-ingress-integrator/0*  active       idle   192.168.103.29          Ingress IP(s): 10.141.14.128

then I have to run

  374  2024-05-28 11:55:52 juju remove-unit httprequest-lego-k8s --num-units 1
  375  2024-05-28 11:56:09 juju add-unit httprequest-lego-k8s
  376  2024-05-28 11:56:13 juju status

to recover

can this be fixed?

To Reproduce

Environment

Relevant log output

-

Additional context

No response

@beliaev-maksim beliaev-maksim added the bug Something isn't working label May 28, 2024
@ghislainbourgeois
Copy link
Collaborator

Would you be able to provide the juju debug-log for this unit? One thing that could be happening is that we run lego as a separate process and wait for it to complete, maybe the timeout mechanism is broken.

@beliaev-maksim
Copy link
Member Author

@ghislainbourgeois prodstack cannot extract per app logs, they are empty.

I can just run juju debug-log but that is pretty much useless

@ghislainbourgeois
Copy link
Collaborator

@beliaev-maksim in the debug-log, would you see the events that the unit received? I am mostly interested in the history of events before it went in that state.

@beliaev-maksim
Copy link
Member Author

let me in meantime update to the latest revision.

but if you can look in parallel on what could happen, then it would be great

@beliaev-maksim
Copy link
Member Author

beliaev-maksim commented May 30, 2024

debuglog.txt

@ghislainbourgeois if you can find something

@gruyaume
Copy link
Contributor

gruyaume commented May 30, 2024

I'm pretty sure this issue was fixed when we moved to using the collect status event handler. In other words, if you refresh the charm you should be good to go.

@beliaev-maksim
Copy link
Member Author

@ghislainbourgeois @gruyaume
now it is even worse. Now I see all the charms active, but there is no certificate

that is a UX disaster...

@gruyaume
Copy link
Contributor

Yes the charm won't show up as in error/blocked if it did not provide a certificate to a request. We are planning to add a field in the status to mention the number of certificate requests fulfilled (see #154) but the charm status itself will remain Active as it is functioning correctly.

Status     Message
Active     "1/3 certificate requests fulfilled"

@beliaev-maksim
Copy link
Member Author

@gruyaume what could be done for the charm to re-request the certs ?

I do not want to scale up/down every day

@gruyaume
Copy link
Contributor

This is already done on update status events, every 5min (or however long the update status is set of the model), the charm will look at the outstanding certificate requests and re-request.

@ghislainbourgeois
Copy link
Collaborator

From what I investigated yesterday, the current version does not set the status to maintenance at all. So the previous issue should not reoccur.

I think we can definitely improve the logging and what we set in the status.

We also have some plans to get rid of the workload completely, making this charm k8s or machine agnostic, and it will also help us get more control on the certificate request process.

@gruyaume
Copy link
Contributor

I'm going to close this as the original issue was addressed. The charm status message item is tracked through issue #154

@beliaev-maksim
Copy link
Member Author

latest deployment

$ juju status
Model             Controller    Cloud/Region              Version  SLA          Timestamp
prod-cla-checker  prodstack-is  k8s-prod-general/default  3.1.8    unsupported  08:40:58Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Address       Exposed  Message
charmed-cla-checker                active      1  charmed-cla-checker       edge       1  10.87.244.2   no       
httprequest-lego-k8s               active      1  httprequest-lego-k8s      stable    83  10.87.26.217  no       
nginx-ingress-integrator  24.2.0   active      1  nginx-ingress-integrator  stable    95  10.87.48.165  no       Ingress IP(s): 10.141.14.128

Unit                         Workload  Agent  Address          Ports  Message
charmed-cla-checker/0*       active    idle   192.168.100.249         
httprequest-lego-k8s/0*      active    idle   192.168.102.43          
nginx-ingress-integrator/0*  active    idle   192.168.103.29          Ingress IP(s): 10.141.14.128

@gruyaume
Copy link
Contributor

gruyaume commented Jun 5, 2024

Reopenning based on feedback from @beliaev-maksim

@gruyaume gruyaume reopened this Jun 5, 2024
@gruyaume
Copy link
Contributor

gruyaume commented Jun 5, 2024

@beliaev-maksim can you pleas include more information as to what the problem actually is. You mentioned having to scale up/down the charm but that's a workaround to a problem. What is the problem?

Also can you please provide the following information

  • Debug Logs
  • Relation data between httprequest and the tls requirer (using jhack)

@beliaev-maksim
Copy link
Member Author

@gruyaume my workload requires TLS on the connection. I use combination of LEGO with nginx to do it.

from juju status command all the workloads look to be green and active. However, after some time we start to receive an issue in production that requests fail due to self signed certificates.

I assume something gets corrupted on LEGO and my workload switches to Kubernetes self signed certs.

To recover proper certs I have to scale down and up the LEGO charm. That resolves the issue immediately.

debug logs you can find in the comment above: #162 (comment)

I cannot use jhack. That is ProdStack, I do not have sudo access to install external tools

@gruyaume
Copy link
Contributor

gruyaume commented Jun 5, 2024

What is the workload that "switches to k8s self signed certs"? Could the issue be in that charm?

@beliaev-maksim
Copy link
Member Author

@gruyaume I think it is nginx

@mthaddon any idea ?

@beliaev-maksim
Copy link
Member Author

beliaev-maksim commented Jun 10, 2024

looks like there were a bunch of TLS issues on nginx

canonical/nginx-ingress-integrator-operator#137
canonical/nginx-ingress-integrator-operator#138
canonical/nginx-ingress-integrator-operator#140

@beliaev-maksim
Copy link
Member Author

close the issue for now, will reopen if observe certificate issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants