Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration test for reinitialize-pods #309

Closed
wants to merge 5 commits into from

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Dec 11, 2023

(Note this will fail until linkerd/linkerd2#11699 lands)

The integration-cni-plugin.yml workflow (formerly known as cni-plugin-integration.yml) has been expanded to run the new recipe reinitialize-pods-integration, which performs the following steps:

  • Rebuilds the linkerd-reinitialize-pods crate and cni-plugin. The Dockerfile-cni-plugin file has been refactored to have two main targets runtime and runtime-test, the latter picking the linkerd-reinitialize-pods that has just been built locally.
  • Creates a new cluster at version v1.27.6-k3s1 (version required for Calico to work)
  • Triggers a new ./reinitialize-pods/integration/run.sh script which:
    • Installs Calico
    • Installs the latest linkerd-edge CLI
    • Installs linkerd-cni and wait for it to become ready
    • Install the linkerd control plane in CNI mode
    • Install a pause DaemonSet

The linkerd-cni instance has been configured to include an extra initContainer that will delay its start for 15s. Since we waited for it to become ready, this doesn't affect the initial install. But then a new node is added to the cluster, and this delay allows for the new pause DaemonSet replica to start before the full CNI config is ready, so we can observe its failure to come up. Once the new linkerd-cni replica becomes ready we observe how the pause failed replica is replaced by a new healthy one.

Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
@alpeb alpeb requested a review from a team as a code owner December 11, 2023 21:31
@alpeb alpeb marked this pull request as draft December 11, 2023 21:31
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods-integration branch from 2f34615 to eae377d Compare December 11, 2023 22:29
(Note this will fail until linkerd/linkerd2#11699 lands)

The `integration-cni-plugin.yml` workflow (formerly known as `cni-plugin-integration.yml`) has been expanded to run the new recipe `reinitialize-pods-integration`, which performs the following steps:

- Rebuilds the `linkerd-reinitialize-pods` crate and `cni-plugin`. The `Dockerfile-cni-plugin` file has been refactored to have two main targets `runtime` and `runtime-test`, the latter picking the `linkerd-reinitialize-pods` that has just been built locally.
- Creates a new cluster at version `v1.27.6-k3s1` (version required for Calico to work)
- Triggers a new `./reinitialize-pods/integration/run.sh` script which:
  - Installs Calico
  - Installs the latest linkerd-edge CLI
  - Installs `linkerd-cni` and wait for it to become ready
  - Install the linkerd control plane in CNI mode
  - Install a `pause` DaemonSet

The `linkerd-cni` instance has been configured to include an extra initContainer that will delay its start for 15s. Since we waited for it to become ready, this doesn't affect the initial install. But then a new node is added to the cluster, and this delay allows for the new `pause` DaemonSet replica to start before the full CNI config is ready, so we can observe its failure to come up. Once the new `linkerd-cni` replica becomes ready we observe how the `pause` failed replica is replaced by a new healthy one.
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods-integration branch from eae377d to 82aa524 Compare December 11, 2023 22:52
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from e74450f to c7d9a91 Compare December 13, 2023 11:07
Base automatically changed from alpeb/linkerd-reinitialize-pods to main January 2, 2024 16:25
@alpeb
Copy link
Member Author

alpeb commented Jan 2, 2024

Superseded by #316

@alpeb alpeb closed this Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant