Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oidc charm fails to replan after config public-url #126

Closed
orfeas-k opened this issue Jan 17, 2024 · 10 comments · Fixed by #128
Closed

oidc charm fails to replan after config public-url #126

orfeas-k opened this issue Jan 17, 2024 · 10 comments · Fixed by #128
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Jan 17, 2024

Bug Description

  • Unit was in Blocked: public-url config required
  • Ran juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
  • Unit went in Blocked: Failed to replan

Logs

aks-juju-debug-log.txt
aks-oidc-container-logs.txt
aks-oidc-pod-yaml.txt

ec2-juju-debug-log.txt
ec2-oidc-container-logs.txt
ec2-oidc-pod-yaml.txt

To Reproduce

With Juju 3.1 and Microk8s 1.26, deploy this minimal kubeflow bundle bundle-yaml.txt
Wait until most charms are green and

juju config dex-auth public-url=http://10.64.140.43.nip.io
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

IIUC, this is not caught by the charm's integration tests because we pass a dummy public_url and somehow the charm goes to Active.

container's logs in integration tests
╰─$ kl -ntest-charm-8l1z oidc-gatekeeper-0 -c oidc-authservice    
2024-01-17T16:00:49.099Z [pebble] HTTP API server listening on ":38813".
2024-01-17T16:00:49.099Z [pebble] Started daemon.
2024-01-17T16:01:14.670Z [pebble] GET /v1/plan?format=yaml 70.649µs 200
2024-01-17T16:01:14.672Z [pebble] POST /v1/layers 199.264µs 200
2024-01-17T16:01:14.680Z [pebble] POST /v1/services 3.526964ms 202
2024-01-17T16:01:14.683Z [pebble] Service "oidc-authservice" starting: /home/authservice/oidc-authservice
2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Config: &{ProviderURL:http://test-url/dex ClientID:authservice-oidc ClientSecret:XUCX52JVWCEAUEFUMIN055XEZMG3NX OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:test-url UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}"
2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting readiness probe at 8081"
2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting server at :8080"
2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting web server at :8082"
2024-01-17T16:01:14.850Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host"
2024-01-17T16:01:15.687Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.006314378s 200
2024-01-17T16:01:16.231Z [pebble] GET /v1/plan?format=yaml 122.797µs 200
2024-01-17T16:01:16.234Z [pebble] POST /v1/layers 148.084µs 200
2024-01-17T16:01:16.241Z [pebble] POST /v1/services 3.601199ms 202
2024-01-17T16:01:16.252Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 10.629602ms 200
2024-01-17T16:01:24.853Z [oidc-authservice] time="2024-01-17T16:01:24Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host"
2024-01-17T16:01:34.854Z [oidc-authservice] time="2024-01-17T16:01:34Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host"
2024-01-17T16:01:45.023Z [oidc-authservice] time="2024-01-17T16:01:45Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host"

Environment

Both in an EC2 Instance and an AKS cluster.

EC2:

  • MicroK8s v1.26.11 revision 6237
  • Juju 3.1.7-genericlinux-amd64

AKS:

  • Juju 3.1.6-ubuntu-amd64
  • AKS 1.26

Relevant Log Output

<>

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Jan 17, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5205.

This message was autogenerated

@orfeas-k orfeas-k changed the title oidc charm fails to replan after config public-url oidc charm fails to replan after config public-url Jan 17, 2024
@orfeas-k
Copy link
Contributor Author

orfeas-k commented Jan 19, 2024

We haven't yet reached to the bottom of it. Summarizing debugging and findings until now (some original discussions here):

Debugging

  • deploying with public-url already configured also results in Blocked (Failed to replan) situtation

    juju deploy oidc-gatekeeper --channel latest/edge --trust --config public-url=http://10.64.140.43.nip.io
    
  • execing into the failing container, we see that there is the actual directory

    root@oidc-gatekeeper-0:~# ls /home/authservice/web/templates/default/
    after_logout.html  footer.html  header.html  homepage.html

    contrary to what the container logs:

    2024-01-17T14:12:36.386Z [oidc-authservice] time="2024-01-17T14:12:36Z" level=fatal msg="open web/templates/default: no such file or directory"
  • Adding a working-dir field to pebble configuration as done here fixes the issue.

  • Reordering main and moving _check_public_url() just before update_layer() (as done here) also fixes the issue.

  • Indeed, as mentioned in this issue's description, integration tests pass while the workload is failing as described here too Implement pebble health checks #82 (comment).

Possible root causes

Pebble

Starting from the fact that adding working-dir fixed the issue, this could mean that there was a change in pebble's behavior, but this would probably have broken more charms, not just this one, so I don't think this really stands. Looking at releases' changelog, there was a related change in Pebble that was released on the 4th Sep 2023 v1.3.0

feat: add support for services to configure working directory

However, looking at the PR I don't see any changes to pebble's default behaviour.

Fixing Waiting for pod startup to complete

One guess about it is that this error surfaced after we fixed #112 on #118 meaning that this didn't occur before because we never handled pebble_ready events. Keeping in mind though #112 was intermittent, while Failed to replan is consistent. I tried removing the line that we added in #118 (self.on.oidc_authservice_pebble_ready) and rebuilt the charm. It looks like the charm went to Active but the workload didn't operate. Instead, it logged repeatedly

2024-01-19T09:00:57.503Z [oidc-authservice] time="2024-01-19T09:00:57Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://10.64.140.43.nip.io/dex/.well-known/openid-configuration\": dial tcp 10.64.140.43:80: connect: connection refused"
2024-01-19T09:00:58.370Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.008170238s 200
2024-01-19T09:00:59.220Z [pebble] GET /v1/plan?format=yaml 197.978µs 200

@orfeas-k
Copy link
Contributor Author

Our understanding on how charm works at the moment

(quoting @ca-scribner on this)

  1. public-url must be set. If it is not set, the charm reports as blocked before the workload is started. Once set, the workload will be started

  2. public-url must be a valid URL. If it is not a valid url, we hit the issue in this comment where the workload retries every 10 seconds to hit that url. oidc-authservice is not healthy, but also not crashing in a way pebble alerts us about (we have no health checks on the workload, so we can't notice an unhealthy-but-running workload)

  3. if public-url is valid, then oidc tries to find the web resources and cannot. That's when we get the error message about not finding web/...

@ca-scribner
Copy link
Contributor

This is mitigated by #128, but probably can't say #128 fixes it unless we better understand why the issue happened

@DnPlas
Copy link
Contributor

DnPlas commented Jan 19, 2024

I am not able to reproduce this issue neither in 1.8/stable nor in latest/edge. Here are my steps and relevant information:

Charmed Kubeflow 1.8/stable

Env:

  • microk8s 1.25/stable
  • juju 3.1.7 (both controller and client)
  • microk8s on an AWS VM with 200GB of storage, 32G RAM

Steps:

  1. Deploy juju deploy kubeflow 1.8/stable --trust
  2. Wait for the oidc-gatekeeper charm to go to BlockedStatus with message Blocked: public-url config required
  3. Configure the charm juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

Logs:

  1. Charm goes to active and idle after setting up the value of public-url
Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s        microk8s/localhost  3.1.7    unsupported  17:22:57Z

App              Version  Status  Scale  Charm            Channel         Rev  Address         Exposed  Message
oidc-gatekeeper           active      1  oidc-gatekeeper  ckf-1.8/stable  294  10.152.183.215  no

Unit                Workload  Agent  Address      Ports  Message
oidc-gatekeeper/0*  active    idle   10.1.60.169
  1. Logs from the container

Deploying charms from latest/edge

  1. Same env as before
  2. Deployed oidc-gatekeeper and dex-auth from latest/edge
  3. Relate oidc-gatekeeper to dex-auth and istio-pilot
  4. Wait for the oidc-gatekeeper charm to go to BlockedStatus with message Blocked: public-url config required
  5. Configure the charm juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

Logs:

  • From a juju debug-code debug session, we can see that the charm updates the pebble layer and then goes to active
  • No error status:
Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s        microk8s/localhost  3.1.7    unsupported  17:43:24Z

App              Version  Status  Scale  Charm            Channel      Rev  Address        Exposed  Message
oidc-gatekeeper           active      1  oidc-gatekeeper  latest/edge  329  10.152.183.84  no

Unit                Workload  Agent  Address      Ports  Message
oidc-gatekeeper/0*  active    idle   10.1.60.140

@DnPlas
Copy link
Contributor

DnPlas commented Jan 19, 2024

UPDATE: I am able to reproduce this issue when deploying oidc-gatekeeper from revision 312, which was published before #128 was merged, as this helps fixing the issue.

I essentially deployed the oidc-gatekeeper from latest/edge rev 312, relate it to dex-auth and istio-pilot, then waited for the unit to go to BlockedStatus waiting for the public-url config. I then configured the public-url and the unit went to Blocked: Failed to replan. Opening a debug session and trying to manually replan yields the following:

(Pdb) self._container.add_layer(self._container.name, self._oidc_layer.services, combine=True)
(Pdb) self._container.replan()
*** ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "oidc-authservice" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-01-19T17:53:14Z INFO Most recent service output:
    time="2024-01-19T17:53:14Z" level=info msg="Config: &{ProviderURL:http://10.64.140.43.nip.io/dex ClientID:authservice-oidc ClientSecret:QU36Q0D91861PT09MN6KCLRIK6AR88 OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:http://10.64.140.43.nip.io UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}"
    time="2024-01-19T17:53:14Z" level=info msg="Starting readiness probe at 8081"
    time="2024-01-19T17:53:14Z" level=info msg="Starting server at :8080"
    time="2024-01-19T17:53:14Z" level=info msg="Starting web server at :8082"
    time="2024-01-19T17:53:14Z" level=fatal msg="open web/templates/default: no such file or directory"
2024-01-19T17:53:14Z ERROR cannot start service: exited quickly with code 1
-----

which was already reported.

You can also found container logs here.

@DnPlas
Copy link
Contributor

DnPlas commented Jan 19, 2024

I deployed oidc-gatekeeper from channel ckf-1.8/stable and executed into the oidc-authservice container, which automatically defaults the working dir to /home/authservice. In contrast, the newer versions of the oidc-authservice charms will have the default working dir to root, which does not contain the directory and files that the service is looking for.

Executing into the container (ckf-1.8/stable)

ubuntu@ip-172-31-15-25:~$ kubectl exec -ti oidc-gatekeeper-0 -coidc-authservice -nkubeflow -- /bin/sh
/home/authservice #

It is not yet clear to me what introduced the change, but it seems like containers from latest/edge rev312 and ckf-1.8/stable are different from the ones in latest/edge. In fact the "older ones" only have /bin/sh while the newer containers have /bin/bash. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.

I think, though, that #128 is the right fix for this issue. Thoughts @ca-scribner @orfeas-k ?

@orfeas-k
Copy link
Contributor Author

From the upstream Dockerfile, we can confirm that the executable is expected to be run from the working-dir added in #128.

ENV APP_HOME=/home/$USER
WORKDIR $APP_HOME

# Copy in binary and give permissions
COPY --from=builder /go/bin/oidc-authservice $APP_HOME
COPY web $APP_HOME/web
RUN chmod +x $APP_HOME/oidc-authservice
RUN chown -R $USER:$GROUP $APP_HOME

USER $USER

ENTRYPOINT [ "./oidc-authservice" ]

The question that's left unanswered is why this surfaced now and not earlier.

In fact the "older ones" only have /bin/sh while the newer containers have /bin/bash. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.

Comparing the two images from kubeflowmanifests repo and arrikto repo, they both have the same digest (sha), which means IIUC that they are the same exact image, just pushed to another repo. This doesn't explain the thing @DnPlas mentioned about /bin/sh and /bin/bash..

@ca-scribner
Copy link
Contributor

The only thing I can think of is something changed in pebble's behaviour, but I don't see why that would be the case either.

Do we just close this in light of #128 at least fixing things most of the time? I don't like accepting "we don't know" as the resolution here, but also don't have a better idea

@orfeas-k
Copy link
Contributor Author

Closing then since we cannot reproduce it to the current charms. Let's re-open in case we stumble on this again (although we shouldn't)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants