Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log operator deployment status and gather pod logs #3800

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ventifus
Copy link
Collaborator

@ventifus ventifus commented Aug 27, 2024

Which issue this PR addresses:

Fixes ARO-9892

What this PR does / why we need it:

While troubleshooting aro pod startup failures in #3755, not enough logging was available to diagnose the problem. An earlier version of this PR was used in

Test plan for issue:

Revert 29f9229 in order to cause aro pod startup failure. Gather logs from failed e2e run, trimmed for clarity:

2024-10-09T17:31:44.4956882Z time="2024-10-09T17:31:44Z" level=info msg="running step [Condition pkg/cluster.(*manager).aroDeploymentReady, timeout 20m0s]" func="steps.Run()" file="pkg/util/steps/runner.go:56"
2024-10-09T17:31:44.5220770Z time="2024-10-09T17:31:44Z" level=info msg="deployment \"aro-operator-master\" ok status is: false, err is: <nil>" func="deploy.(*operator).IsReady()" file="pkg/operator/deploy/deploy.go:574"
2024-10-09T17:31:44.5746018Z time="2024-10-09T17:31:44Z" level=info msg="deployment \"aro-operator-master\" status: {\"observedGeneration\":1,\"replicas\":1,\"updatedReplicas\":1,\"unavailableReplicas\":1,\"conditions\":[{\"type\":\"Progressing\",\"status\":\"True\",\"lastUpdateTime\":\"2024-10-09T17:31:11Z\",\"lastTransitionTime\":\"2024-10-09T17:30:18Z\",\"reason\":\"NewReplicaSetAvailable\",\"message\":\"ReplicaSet \\\"aro-operator-master-56489845\\\" has successfully progressed.\"},{\"type\":\"Available\",\"status\":\"False\",\"lastUpdateTime\":\"2024-10-09T17:31:12Z\",\"lastTransitionTime\":\"2024-10-09T17:31:12Z\",\"reason\":\"MinimumReplicasUnavailable\",\"message\":\"Deployment does not have minimum availability.\"}]}" func="deploy.(*operator).IsReady()" file="pkg/operator/deploy/deploy.go:593"
2024-10-09T17:31:44.6348641Z time="2024-10-09T17:31:44Z" level=info msg="replicaset \"aro-operator-master-56489845\" status: {\"replicas\":1,\"fullyLabeledReplicas\":1,\"observedGeneration\":1}" func="deploy.(*operator).IsReady()" file="pkg/operator/deploy/deploy.go:612"
2024-10-09T17:31:44.6539946Z time="2024-10-09T17:31:44Z" level=info msg="pod \"aro-operator-master-56489845-vv9mp\" status: {\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-10-09T17:30:18Z\"},{\"type\":\"Ready\",\"status\":\"False\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-10-09T17:31:12Z\",\"reason\":\"ContainersNotReady\",\"message\":\"containers with unready status: [aro-operator]\"},{\"type\":\"ContainersReady\",\"status\":\"False\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-10-09T17:31:12Z\",\"reason\":\"ContainersNotReady\",\"message\":\"containers with unready status: [aro-operator]\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2024-10-09T17:30:18Z\"}],\"hostIP\":\"10.66.248.7\",\"podIP\":\"10.130.0.60\",\"podIPs\":[{\"ip\":\"10.130.0.60\"}],\"startTime\":\"2024-10-09T17:30:18Z\",\"containerStatuses\":[{\"name\":\"aro-operator\",\"state\":{\"waiting\":{\"reason\":\"CrashLoopBackOff\",\"message\":\"back-off 40s restarting failed container=aro-operator pod=aro-operator-master-56489845-vv9mp_openshift-azure-operator(76c15605-7b57-43e3-ba28-dc5c23944ba2)\"}},\"lastState\":{\"terminated\":{\"exitCode\":1,\"reason\":\"Error\",\"startedAt\":\"2024-10-09T17:31:11Z\",\"finishedAt\":\"2024-10-09T17:31:11Z\",\"containerID\":\"cri-o://3832f7527a9dc45d0c8f6fc6f03c2aaf7fed142d5897129e7688bd107a17d693\"}},\"ready\":false,\"restartCount\":3,\"image\":\"arointsvc.azurecr.io/aro:121079b\",\"imageID\":\"arointsvc.azurecr.io/aro@sha256:e8572d44cfd64e73664f620d1d0b31ee6879e85a9093ea134fdca968f07618d6\",\"containerID\":\"cri-o://3832f7527a9dc45d0c8f6fc6f03c2aaf7fed142d5897129e7688bd107a17d693\",\"started\":false}],\"qosClass\":\"BestEffort\"}" func="deploy.(*operator).IsReady()" file="pkg/operator/deploy/deploy.go:632"
2024-10-09T17:51:44.7084554Z time="2024-10-09T17:51:44Z" level=info msg="time=\"2024-10-09T17:51:32Z\" level=error msg=\"stat /.config: no such file or directory\"\n" func="cluster.(*manager).logPodLogs()" file="pkg/cluster/gatherlogs.go:160"
2024-10-09T17:51:44.7088739Z time="2024-10-09T17:51:44Z" level=info func="cluster.(*manager).logPodLogs()" file="pkg/cluster/gatherlogs.go:160"
2024-10-09T17:51:44.7515582Z time="2024-10-09T17:51:44Z" level=info msg="time=\"2024-10-09T17:51:35Z\" level=error msg=\"stat /.config: no such file or directory\"\n" func="cluster.(*manager).logPodLogs()" file="pkg/cluster/gatherlogs.go:160"
2024-10-09T17:51:44.7522341Z time="2024-10-09T17:51:44Z" level=info func="cluster.(*manager).logPodLogs()" file="pkg/cluster/gatherlogs.go:160"
2024-10-09T17:51:44.7529488Z time="2024-10-09T17:51:44Z" level=info msg="pkg/cluster.(*manager).logPodLogs: pod status aro-operator-master-56489845-vv9mp: {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:30:18 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:41:22 +0000 UTC ContainersNotReady containers with unready status: [aro-operator]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:41:22 +0000 UTC ContainersNotReady containers with unready status: [aro-operator]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:30:18 +0000 UTC  }]    10.66.248.7 10.130.0.60 [{10.130.0.60}] 2024-10-09 17:30:18 +0000 UTC [] [{aro-operator {&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=aro-operator pod=aro-operator-master-56489845-vv9mp_openshift-azure-operator(76c15605-7b57-43e3-ba28-dc5c23944ba2),} nil nil} {nil nil &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2024-10-09 17:51:32 +0000 UTC,FinishedAt:2024-10-09 17:51:32 +0000 UTC,ContainerID:cri-o://8bf02036ec554d99025efc8ebf1e23ca753b4207a5f67be1b1d224cfdf41846d,}} false 9 arointsvc.azurecr.io/aro:121079b arointsvc.azurecr.io/aro@sha256:e8572d44cfd64e73664f620d1d0b31ee6879e85a9093ea134fdca968f07618d6 cri-o://8bf02036ec554d99025efc8ebf1e23ca753b4207a5f67be1b1d224cfdf41846d 0xc0006c1e20}] BestEffort []}" func="cluster.(*manager).gatherFailureLogs()" file="pkg/cluster/gatherlogs.go:55"
2024-10-09T17:51:44.7538507Z time="2024-10-09T17:51:44Z" level=info msg="pkg/cluster.(*manager).logPodLogs: pod status aro-operator-worker-85d65958cd-cx5n7: {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:30:18 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:31:24 +0000 UTC ContainersNotReady containers with unready status: [aro-operator]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:31:24 +0000 UTC ContainersNotReady containers with unready status: [aro-operator]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-10-09 17:30:18 +0000 UTC  }]    10.66.249.4 10.129.2.8 [{10.129.2.8}] 2024-10-09 17:30:18 +0000 UTC [] [{aro-operator {&ContainerStateWaiting{Reason:CrashLoopBackOff,Message:back-off 5m0s restarting failed container=aro-operator pod=aro-operator-worker-85d65958cd-cx5n7_openshift-azure-operator(ccbc1b28-15be-47b8-a247-9227c99ac742),} nil nil} {nil nil &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2024-10-09 17:51:35 +0000 UTC,FinishedAt:2024-10-09 17:51:35 +0000 UTC,ContainerID:cri-o://d3a750079178f0f008b81b1b0a2244472ee239a9e69e46583e5ac61fb08f8e72,}} false 9 arointsvc.azurecr.io/aro:121079b arointsvc.azurecr.io/aro@sha256:e8572d44cfd64e73664f620d1d0b31ee6879e85a9093ea134fdca968f07618d6 cri-o://d3a750079178f0f008b81b1b0a2244472ee239a9e69e46583e5ac61fb08f8e72 0xc000da667f}] BestEffort []}" func="cluster.(*manager).gatherFailureLogs()" file="pkg/cluster/gatherlogs.go:55"

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus changed the title Log operator deployment status Log operator deployment status and gather pod logs Aug 27, 2024
@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch 2 times, most recently from 94c79dd to 8a8de3a Compare August 28, 2024 16:32
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from 8a8de3a to 20bee5c Compare August 28, 2024 16:43
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from 20bee5c to bc1ab3f Compare August 28, 2024 18:52
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from bc1ab3f to 7e12c54 Compare August 28, 2024 19:47
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from 7e12c54 to e587728 Compare August 28, 2024 20:20
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch 2 times, most recently from 3cfca53 to 4443a52 Compare August 29, 2024 23:56
@ventifus
Copy link
Collaborator Author

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ventifus ventifus marked this pull request as ready for review October 7, 2024 17:43
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 7, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One other question aside from the question I left on the code: how'd you test it and how were the results?

pkg/cluster/gatherlogs.go Outdated Show resolved Hide resolved
@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from 4443a52 to c0267fe Compare October 9, 2024 16:37
@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from c0267fe to 9dc836a Compare October 9, 2024 16:42
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 9, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@ventifus ventifus force-pushed the ventifus/ARO-9892-operator-deployment-logging branch from 9dc836a to 7796284 Compare October 9, 2024 18:03
@ventifus
Copy link
Collaborator Author

ventifus commented Oct 9, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants