Kubernetes Day2 SOP #2

Balaji632 · 2025-08-21T06:00:32Z

Kubernetes Day2 SOP

avinash-platformatory

Since this document is just for Confluent Platform Day 2 operations, it should be renamed accordingly. A generic Kubernetes SOP and kafka workload specfic SOPs would be ideal so that the Kubernetes specific SOP can be reused for any kafka workload, not just Confluent.
Format the commands using code blocks with the langauge specified. This allows the operator to copy the command using a button and formats the command based on the langauge specified. In most cases, it would be bash.
Example -

3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state.
```bash
kubectl get pods --all-namespaces
```

would become

The document seems too generic and not specific to StreamTime. My suggestion would be to deploy a cluster and perform these activities through StreamTime and manually using kubectl, documenting the actions performed. I have not reviewed it thoroughly since it is too generic.

avinash-platformatory · 2025-08-21T07:51:27Z

operations/day2sop.md

@@ -0,0 +1,521 @@
+---
+title: Kubernetes Day2 Standard Operating Procedures (SOP)


Remove title so that it does not show up in the navbar. We will still be able to accessing using /operations/day2sop.html

avinash-platformatory · 2025-08-21T07:52:15Z

operations/day2sop.md

+---
+title: Kubernetes Day2 Standard Operating Procedures (SOP)
+nav_order: 4
+layout: Operations


It should be parent: Operations

avinash-platformatory · 2025-08-21T09:11:34Z

operations/day2sop.md

+**Rollback**
+* Not applicable; this SOP is for verification only.
+
+### 2. Namespace Creation & Resource Quota Adjustment


Namespace Creation will not be part of Day2 operations since the namespaces would already be created during the creation of the cluster. Rollback mentions deleting the namespace - not sure about the use case where this is helpful from a Day2 perspective.

Section deleted. Its not required for clusters created via fleet manager.

avinash-platformatory · 2025-08-21T09:39:59Z

operations/day2sop.md

+**Prerequisites**
+
+* kubectl access with cluster-admin privileges  
+* Maintenance window approved in change management system


Prerequisite should include adding a new node or ensuring sufficient capacity on the other existing nodes so that workloads on this node do not go unscheduled. If the workload is a kafka pod, just checking for capacity might not suffice due to pod anti affinity rules.

Updated section.

avinash-platformatory · 2025-08-21T09:40:58Z

operations/day2sop.md

+**Procedure**
+1. Identify broker pods:  
+   ```$> kubectl get pods -n <cfk-namespace\> -l app=kafka```
+2. Restart a broker (one at a time):  


If the goal is to restart all brokers sequentially, doing a statefulset rollout is a better option since the rollout will be controlled by Kubernetes and not done manually.

avinash-platformatory · 2025-08-21T09:49:49Z

operations/day2sop.md

+1. Create or update Kubernetes Secret:
+```kubectl create secret generic <secret-name> --from-literal=username=<user> --from-literal=password=<password> -n <namespace>``` 
+2. Patch the deployment or StatefulSet to mount the updated secret.  
+3. Restart affected pods if they don’t pick up new secrets automatically.


No need to restart pods for adding a new SASL/PLAIN user in CFK operator deployed clusters.

avinash-platformatory · 2025-08-21T09:51:26Z

operations/day2sop.md

+* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.
+
+
+### 10. Credential Addition


This section needs to be specific to the cluster type and authentication mechanism i.e., adding a SASL/PLAIN user for CFK kafka is different from adding a basic auth user for CFK Schema Registry and is different from adding SASL/SCRAM users for redpanda or any other cluster type or authentication mechanism. This is too generic and not helpful.

avinash-platformatory · 2025-08-21T09:53:14Z

operations/day2sop.md

+1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP)
+2. Backup existing CFK configuration:
+```$> kubectl get confluent -n <namespace> -o yaml > cfk-all-backup.yaml```
+3. Upgrade the CFK Operator Helm chart or manifest:  


Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. The Confluent documentation is the best resource for this. We will support this through StreamTime soon.

section updated for operatro upgrades.

avinash-platformatory · 2025-08-21T09:54:37Z

operations/day2sop.md

+   * For YAML manifests: apply the updated manifest
+4. Monitor Operator logs to ensure no failures:
+```kubectl logs -n <namespace> deploy/confluent-operator```
+5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs.


This is best done through StreamTime but the steps for doing it manually should also be documented i.e., updating the image versions in the CRD, if it is not a major version upgrade.

Setion updated dor operator upgrades.

avinash-platformatory · 2025-08-21T09:56:32Z

operations/day2sop.md

+**Procedure**
+1. Download the updated operator manifest from Confluent’s repository.  
+2. Apply the updated manifest:  
+```$> kubectl apply -f <cfk-operator-manifest>.yaml```


Upgrades should be done through helm and follow the procedure documented in the Confluent documentation.Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. We will support this through StreamTime soon.

Section updated as per CFK documentation.

avinash-platformatory · 2025-09-30T11:26:44Z

operations/day2sop.md

+```
+2. Get logs for last 2 hrs for a specific pod:
+```bash
+$> kubectl lgos <pod-name> -n <namespace> --since=2h


Typo: logs

avinash-platformatory · 2025-09-30T11:26:53Z

operations/day2sop.md

+```
+3. Get logs for last 1 hr for a specific pod:
+```bash
+$> kubectl lgos <pod-name> -n <namespace> --since=1hbash


Typo: logs

avinash-platformatory · 2025-09-30T11:28:30Z

operations/day2sop.md

+**Procedure**
+1. Review existing alert rules in Prometheus:  
+```$> kubectl get configmap prometheus-server -n <namespace> -o yaml```
+2. Add new alert rules (e.g., high broker CPU, partition under-replicated):  


The section still mentions Update Prometheus rule files with thresholds. without the actual steps i.e.,

kubectl edit prometheusrule <name> -n <namespace>

avinash-platformatory · 2025-09-30T11:29:30Z

operations/day2sop.md

+```bash
+Update Prometheus rule files with thresholds.
+```
+3. Reload Prometheus configuration:  


The Prometheus rules configuration will be auto reloaded by the operator without requiring a restart of the Prometheus server i.e., without requiring to delete the Prometheus pod.

avinash-platformatory · 2025-09-30T11:31:51Z

operations/day2sop.md

+**Using Kubernetes Secrects**
+
+*Kafka (CFK, SASL/PLAIN)*  
+1. Create a Kubernetes Secret with SASL/PLAIN user credentials:  


The secret would already exist if the cluster is using SASL/PLAIN for authentication. Adding new users should be done through the StreamTime UI / API so that there is an audit. If StreamTime is not available, the SOP should be to edit the secret. The secret name for CFK is kafka-sasl-plain-credential Ref - https://github.com/Platformatory/fleet-manager-helm/blob/8e2a730bc49cd71e03a4ccf239c0e7615a8884d8/charts/fleet-manager-cfk/templates/kafka.yaml#L177

avinash-platformatory · 2025-09-30T11:42:14Z

operations/day2sop.md

+3. Restart the Schema Registry pod(s) if required.
+
+
+*Redpanda (SASL/SCRAM)*


The document is meant for Confluent workloads. We should create a separate document for Redpanda since there are other aspects not covered for Redpanda in this document.

avinash-platformatory · 2025-09-30T11:43:38Z

operations/day2sop.md

+2. Disable resource reconciliation - To prevent Confluent Platform components from rolling restarts, temporarily disable resource reconciliation of the components in each namespace where the Confluent Platform is deployed, specifying the CR kinds and CR names (*whichever is applicable*):
+
+```bash
+kubectl annotate connect connect \


Connect, KSQL and RestProxy are not applicable for workloads deployed through StreamTime.

avinash-platformatory · 2025-09-30T11:46:11Z

operations/day2sop.md

+    application: <component image>:<tag>
+```
+
+b. If upgrading Control Center, specify the Control Center release as the Control Center image tag, the Prometheus image tag, and the Alertmanager image tag in the ControlCenter CR. Control Center is on independent versions and does not follow Confluent Platform releases.


The section needs to be formatted since the markdown formatting is broken when rendered by jekyll.

avinash-platformatory · 2025-09-30T11:47:03Z

operations/day2sop.md

+    application: <component image>:<tag>
+```
+
+b. If upgrading Control Center, specify the Control Center release as the Control Center image tag, the Prometheus image tag, and the Alertmanager image tag in the ControlCenter CR. Control Center is on independent versions and does not follow Confluent Platform releases.


We do not integrate Control Center with Prometheus and AlertManager, hence the changes to these fields in the CR is not relevant.

avinash-platformatory · 2025-09-30T11:52:18Z

operations/day2sop.md

+
+Upgrade Confluent Platform components as below:
+
+a. In the component CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to:


This can be done through StreamTime and is recommended to be done through StreamTime. If there is a new version of Confluent supported by StreamTime, there would be an "Upgrade" button in the cluster detail page, which can be used to safely perform the upgrade.

Kubernetes Day2 SOP

79e3919

Balaji632 requested review from Sathishkumar0404 and avinash-platformatory August 21, 2025 06:00

Balaji632 added 3 commits August 21, 2025 11:33

Minor correction with formatting. Removed Table of contents

a426c69

Corrected formatting

e7b0668

Corrected formatting in Cluster Health Verification

8ba58b5

avinash-platformatory requested changes Aug 21, 2025

View reviewed changes

Updated day2 sop with review comments.

4b76d45

avinash-platformatory requested changes Sep 30, 2025

View reviewed changes

avinash-platformatory force-pushed the streamtime-sop branch from 01f551d to 4b76d45 Compare October 29, 2025 18:41

		@@ -0,0 +1,521 @@
		---
		title: Kubernetes Day2 Standard Operating Procedures (SOP)

		* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.


		### 10. Credential Addition

		3. Restart the Schema Registry pod(s) if required.


		Redpanda (SASL/SCRAM)


		Upgrade Confluent Platform components as below:

		a. In the component CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to:

Kubernetes Day2 SOP #2

Are you sure you want to change the base?

Kubernetes Day2 SOP #2

Uh oh!

Conversation

Balaji632 commented Aug 21, 2025

Uh oh!

avinash-platformatory left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avinash-platformatory Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avinash-platformatory Sep 30, 2025 •

edited

Loading