fix: have all flavors delete themselves even in failure #1508

tommartensen · 2025-03-05T12:04:11Z

Some already did that. Some already clean up if they fail during create (like OCP). Still better to be consistent. This is a patch on an issue that should be solved properly.

rhacs-bot · 2025-03-05T12:12:44Z

A single node development cluster (infra-pr-1508) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server:0.10.86-1-gdaee46d474 to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1508 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make helm-deploy

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

davdhacs

tldr: I don't know enough about argo cd to know if we should add onExit now. On some create failures, do the workflows exit and never run the destroy step and delete the workflow before the cleanup runs?

detail:
There may be reasons not to add this?
onExit's were removed here from infra workflows, #320
And the argocd code marks onExit as deprecated (https://github.com/argoproj/argo-workflows/blame/68fde4ffcbe9b84dfad969a45133cd4cc659d346/pkg/apis/workflow/v1alpha1/workflow_types.go#L1573).

The infra periodic cleanup code *should find clusters that failed, argocd workflow paused, and resume those workflows to run their destroy steps (added in https://github.com/stackrox/infra/pull/47/files#diff-c18c4c95a88e0fc7894470522a609f99R639 and scheduled at https://github.com/stackrox/infra/blame/master/service/cluster/cluster.go#L131).
Maybe argocd has changed in some way that this is not valid or not always working?

fix: have all flavors delete themselves even in failure

daee46d

tommartensen self-assigned this Mar 5, 2025

tommartensen marked this pull request as ready for review March 5, 2025 12:04

tommartensen requested a review from a team as a code owner March 5, 2025 12:04

davdhacs requested changes Mar 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: have all flavors delete themselves even in failure #1508

fix: have all flavors delete themselves even in failure #1508

tommartensen commented Mar 5, 2025

rhacs-bot commented Mar 5, 2025

davdhacs left a comment •

edited

Loading

fix: have all flavors delete themselves even in failure #1508

Are you sure you want to change the base?

fix: have all flavors delete themselves even in failure #1508

Conversation

tommartensen commented Mar 5, 2025

rhacs-bot commented Mar 5, 2025

Further Development

Logs

davdhacs left a comment • edited Loading

Choose a reason for hiding this comment

davdhacs left a comment •

edited

Loading