Skip to content

[nri-bundle] Error: couldn't find key cluster-id in Secret newrelic/pl-cluster-secrets #661

@luisdavim

Description

@luisdavim

Bug description

I'm trying to install the nri-bundle-3.3.0 chart using terraform and sometimes, not always, the installation fails because one of the pods fails to start within the wait time set for the helm release.
I'm setting a helm timeout of 900 seconds, and still, sometimes that's not enough...

When I inspect the Pod that is failing to start, I see the following error in its events:

Error: couldn't find key cluster-id in Secret newrelic/pl-cluster-secrets

If I wait for long enough, it eventually works, a way to speed it up is to delete the failed Pod until it succeeds, but I don't think this is viable, we're usgin Terraform to provision our clusters, and we end up wasting time because of this when it sould be able to run unattended.

Version of Helm and Kubernetes

helm version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

The chart is nri-bundle-3.3.0

What happened?

The helm release fails, waiting for all the resources to become ready within 900 seconds.

What you expected to happen?

I would expect the deployment to succeed, this seems to be some sort of race condition where a secret (pl-cluster-secrets) is being created/updated after the pod that needs it to start, so I'd expect that secret to be ready before the deployment is created.
I would also expect 900 seconds to be enough time for any helm release.

How to reproduce it?

Just a normal helm install as mentioned in the readme, these are the values I'm using:

global:
  cluster: ${clusterName}
  licenseKey: ${newRelicLicenseKey}
  lowDataMode: true
kubeEvents:
  enabled: true
webhook:
  enabled: true
prometheus:
  enabled: true
logging:
  enabled: true
ksm:
  enabled: false
newrelic-infrastructure:
  privileged: true
newrelic-pixie:
  apiKey: ${pixieApiKey}
  enabled: true
pixie-chart:
  clusterName: ${clusterName}
  deployKey: ${pixieChartKey}
  enabled: true

This seems to be similar to #539

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions