-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s: Error out when annotation can not be set #13711
k8s: Error out when annotation can not be set #13711
Conversation
c3f796e
to
b7a1b25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, it seems the major change was the checking and returning early if brokers ordinal matchines the nodeID? was this the case you were referring to in your preamble?
@@ -132,6 +132,7 @@ var _ = Describe("Redpanda cluster scale resource", func() { | |||
|
|||
By("Scaling down only when decommissioning is done") | |||
Expect(testAdminAPI.RemoveBroker(2)).To(BeTrue()) | |||
testAdminAPI.AddGhostBroker(admin.Broker{NodeID: 2, MembershipStatus: admin.MembershipStatusDraining}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, so this needed to have locks so sequence is important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our reconciliation has changed and we require Pod Annotation with Redpanda Node ID. This test was written without our constraints. To mimic actual world example I added ghost borker function.
I'm not fully understand.
As I described in the cover letter, the problem is with the timing. We can not run unsafe decommission function if annotation for every Pod doesn't match the reality. |
630c0a1
to
37b618b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the v2 tests changes in another PR to better isolate problems?
Given the following operator logic https://github.com/redpanda-data/redpanda/blob/51766019d6e85b7252a0c637043fc5d9edbc4673/src/go/k8s/controllers/redpanda/cluster_controller.go#L266-L275 The chain of events that causes wrong decommission logic is: * `setPodNodeIDAnnotation` failed to get the Redpanda Node ID * `setPodNodeIDAnnotation` does not return an error, but only it logs it out * `setPodNodeIDLabel` was able to get Redpanda Node ID, so broker was able to boot up and register its Node ID * `decommissionGhostBrokers` is performing filtering based on Pod annotation which is incorrect Logs from the wrong execution: ``` {"level":"info","ts":"2023-09-27T06:51:41.644Z","logger":"ClusterReconciler.Reconcile.setPodNodeIDLabel","msg":"setting node-id label","controller":"cluster","controllerGroup":"redpanda.vectorized.io","controllerKind":"Cluster","Cluster":{"name":"repanda-cluster","namespace":"redpanda"},"namespace":"redpanda","name":"repanda-cluster","reconcileID":"57bf8b57-60b5-4248-bedd-3ffcaa5c4e07","pod-name":"repanda-cluster-1","new-node-id":3} {"level":"info","ts":"2023-09-27T06:51:41.718Z","logger":"ClusterReconciler.Reconcile.setPodNodeIDAnnotation","msg":"decommission old node-id","controller":"cluster","controllerGroup":"redpanda.vectorized.io","controllerKind":"Cluster","Cluster":{"name":"repanda-cluster","namespace":"redpanda"},"namespace":"redpanda","name":"repanda-cluster","reconcileID":"fa64d3d8-02cf-4c1f-8a67-13a2dfb339ea","pod-name":"repanda-cluster-1","old-node-id":1} {"level":"info","ts":"2023-09-27T06:51:41.758Z","logger":"ClusterReconciler.Reconcile.setPodNodeIDAnnotation","msg":"setting node-id annotation","controller":"cluster","controllerGroup":"redpanda.vectorized.io","controllerKind":"Cluster","Cluster":{"name":"repanda-cluster","namespace":"redpanda"},"namespace":"redpanda","name":"repanda-cluster","reconcileID":"fa64d3d8-02cf-4c1f-8a67-13a2dfb339ea","pod-name":"repanda-cluster-1","new-node-id":3} {"level":"error","ts":"2023-09-27T06:51:42.497Z","msg":"Reconciler error","controller":"cluster","controllerGroup":"redpanda.vectorized.io","controllerKind":"Cluster","Cluster":{"name":"repanda-cluster","namespace":"redpanda"},"namespace":"redpanda","name":"repanda-cluster","reconcileID":"fa64d3d8-02cf-4c1f-8a67-13a2dfb339ea","error":"deleting ghost brokers: failed to decommission ghost broker: request PUT http://repanda-cluster-0.repanda-cluster.redpanda.svc.cluster.local/.:9644/v1/brokers/1/decommission failed: Bad Request, body: \"{\\\"message\\\": \\\"can not update broker 1 state, invalid state transition requested\\\", \\\"code\\\": 400}\"\n"} ```
In the case when broker is removed from the mock admin API the controller is failing always as Pod can not set up annotation with Node ID. The decommission is never called, so that StatefulSet is not updated. Integration tests are stuck due to that.
37b618b
to
55835ab
Compare
Here you go #13898. I removed all commits that was trying to fix e2e v2 tests. Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, a lot of maintenance changes but good cleaning
src/go/k8s/tests/e2e/update-image-tls-client-auth-external-ca/12-clean.yaml
Show resolved
Hide resolved
/backport v23.2.x |
Given the following operator logic
redpanda/src/go/k8s/controllers/redpanda/cluster_controller.go
Lines 266 to 275 in 5176601
The chain of events that causes wrong decommission logic is:
setPodNodeIDAnnotation
failed to get the Redpanda Node IDsetPodNodeIDAnnotation
does not return an error, but only it logs it outsetPodNodeIDLabel
was able to get Redpanda Node ID, so broker was able to boot up and register its Node IDdecommissionGhostBrokers
is performing filtering based on Pod annotation which is incorrectLogs from the wrong execution:
Backports Required
Release Notes
Bug Fixes