Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgoCD Vault Plugin loses connection to Vault #614

Open
rliskunov opened this issue Mar 11, 2024 · 2 comments
Open

ArgoCD Vault Plugin loses connection to Vault #614

rliskunov opened this issue Mar 11, 2024 · 2 comments

Comments

@rliskunov
Copy link

Describe the bug
Periodically the plugin loses connection to Vault. In this way, after configuration the plugin works correctly, but after 15-20 minutes the connection is lost. Hard Refresh of the app does not help. However, If you restart argocd-repo-server and argocd-redis, everything works successfully. If you restart one of them, the problem does not solve.

I use Multitenancy with Kubernetes Authentication

To Reproduce

If you want to reproduce this, you will need the following:

  1. Install Vault in a Kubernetes cluster
  2. Enable Kubernetes authorization in Vault
  3. Add policy to Vault - argocd-policy
path "secret/data/application/*" {
  capabilities = ["read"]
}
  1. Add a role to Vault - argocd-role, specifying the parameters
Bound service account namespaces - argocd-repo-server
Bound service account namespaces - argocd
Generated Token's Policies - argocd-policy
  1. Add a secret to Kubernetes in values.yaml for ArgoCD Helm Chart
extraObjects:
  - apiVersion: v1
    kind: Secret
    type: Opaque
    metadata:
      name: argo-vault-secret
      namespace: argocd
    stringData:
      VAULT_ADDR: http://vault.vault.svc.cluster.local:8200
      AVP_TYPE: vault
      AVP_AUTH_TYPE: k8s
      AVP_K8S_ROLE: argocd-role

Expected behavior

If you configure a connection to Vault for an application once, the connection will work stably.

Screenshots/Verbose output

Example of output

"helm template ... | argocd-vault-plugin generate -s argo-vault-secret -" failed exit status 1:
Error: Replace: could not replace all placeholders in Template: 
Error making API request. 
URL: GET http://vault.vault.svc.cluster.local:8200/v1/secret/data/application Code: 403. 
Errors: * 1 error occurred: * permission denied 
Error making API request. 

Additional context
If you don't use Multitenancy, but make the most insecure policy possible, the connection is stable.

path "secret/data/*" {
  capabilities = ["read"]
}
@rliskunov
Copy link
Author

In general, it seems as if the problem was not timeout, but ServiceAccount

Let's say we have two applications: api and worker

A secret is generated for each of them, which allows to go to Vault. Example with api

- apiVersion: v1
  kind: Secret
  type: Opaque
  metadata:
    name: argo-vault-api
    namespace: argocd
  stringData:
    VAULT_ADDR: http://vault.vault.svc.cluster.local:8200
    AVP_TYPE: vault
    AVP_AUTH_TYPE: k8s
    AVP_K8S_ROLE: argocd-api

The argocd-api role is generated in Vault with the parameters

Bound service account namespaces - argocd-repo-server
Bound service account namespaces - argocd
Generated Token's Policies - api

Pod argocd-repo-server uses ServiceAccount argocd-repo-server. When we do Hard Refresh in ArgoCD for api, it's as if ServiceAccount argocd-repo-server clings to the argo-vault-api secret, losing connections to Vault for argo-vault-worker
If we reboot the argocd-repo-server pod and do a Hard Refresh for the worker, then we lose the api

So when we used a universal role that has access to all secrets, we didn't encounter this problem

@max-veit-nc
Copy link

We are seeing a similar issue as we have a similar setup.

We have actually troubleshooted inside the avp-helm (in our case) sidecar container that we are using as part of the repo-server. It seems to us, that when using different AppRoles within the same sidecar, there is an issue with the token caching.

The concept is briefly discussed here: https://argocd-vault-plugin.readthedocs.io/en/stable/usage/#caching-the-hashicorp-vault-token

We believe, that there is a race-condition, whoever comes first to refresh a token (default lifetime is 20min), gets to execute. This gets a bit of additional randomness from having two repo-server instances and two sidecars therefore at the same time.

This is further supported by our discovery that this never happens for our second sidecar with avp that always uses the same secret and that we can always reproduce this by running a hard refresh for all of our applications (we are using 10+ different AppRoles in our case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants