Skip to content

Conversation

@Balaji632
Copy link

Kubernetes Day2 SOP

Copy link
Contributor

@avinash-platformatory avinash-platformatory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since this document is just for Confluent Platform Day 2 operations, it should be renamed accordingly. A generic Kubernetes SOP and kafka workload specfic SOPs would be ideal so that the Kubernetes specific SOP can be reused for any kafka workload, not just Confluent.
  • Format the commands using code blocks with the langauge specified. This allows the operator to copy the command using a button and formats the command based on the langauge specified. In most cases, it would be bash.
    Example -
3. Check pod health across namespaces. No pods should be stuck in CrashLoopBackOff or Pending state.
```bash
kubectl get pods --all-namespaces
```

would become
Screenshot_20250821_132448

  • The document seems too generic and not specific to StreamTime. My suggestion would be to deploy a cluster and perform these activities through StreamTime and manually using kubectl, documenting the actions performed. I have not reviewed it thoroughly since it is too generic.

@@ -0,0 +1,521 @@
---
title: Kubernetes Day2 Standard Operating Procedures (SOP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove title so that it does not show up in the navbar. We will still be able to accessing using /operations/day2sop.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

---
title: Kubernetes Day2 Standard Operating Procedures (SOP)
nav_order: 4
layout: Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be parent: Operations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

**Rollback**
* Not applicable; this SOP is for verification only.

### 2. Namespace Creation & Resource Quota Adjustment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace Creation will not be part of Day2 operations since the namespaces would already be created during the creation of the cluster. Rollback mentions deleting the namespace - not sure about the use case where this is helpful from a Day2 perspective.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section deleted. Its not required for clusters created via fleet manager.

**Prerequisites**

* kubectl access with cluster-admin privileges
* Maintenance window approved in change management system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prerequisite should include adding a new node or ensuring sufficient capacity on the other existing nodes so that workloads on this node do not go unscheduled. If the workload is a kafka pod, just checking for capacity might not suffice due to pod anti affinity rules.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated section.

**Procedure**
1. Identify broker pods:
```$> kubectl get pods -n <cfk-namespace\> -l app=kafka```
2. Restart a broker (one at a time):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to restart all brokers sequentially, doing a statefulset rollout is a better option since the rollout will be controlled by Kubernetes and not done manually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

1. Create or update Kubernetes Secret:
```kubectl create secret generic <secret-name> --from-literal=username=<user> --from-literal=password=<password> -n <namespace>```
2. Patch the deployment or StatefulSet to mount the updated secret.
3. Restart affected pods if they don’t pick up new secrets automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to restart pods for adding a new SASL/PLAIN user in CFK operator deployed clusters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

* Destroy the cluster using the same tool (terraform destroy, eksctl delete cluster, etc.) if provisioning fails.


### 10. Credential Addition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs to be specific to the cluster type and authentication mechanism i.e., adding a SASL/PLAIN user for CFK kafka is different from adding a basic auth user for CFK Schema Registry and is different from adding SASL/SCRAM users for redpanda or any other cluster type or authentication mechanism. This is too generic and not helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

1. Review release notes for the target CFK version (check compatibility with Kubernetes and CP)
2. Backup existing CFK configuration:
```$> kubectl get confluent -n <namespace> -o yaml > cfk-all-backup.yaml```
3. Upgrade the CFK Operator Helm chart or manifest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. The Confluent documentation is the best resource for this. We will support this through StreamTime soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

section updated for operatro upgrades.

* For YAML manifests: apply the updated manifest
4. Monitor Operator logs to ensure no failures:
```kubectl logs -n <namespace> deploy/confluent-operator```
5. Sequentially upgrade platform components (Kafka, Zookeeper, Connect, SR, etc.) using the updated CRDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is best done through StreamTime but the steps for doing it manually should also be documented i.e., updating the image versions in the CRD, if it is not a major version upgrade.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setion updated dor operator upgrades.

**Procedure**
1. Download the updated operator manifest from Confluent’s repository.
2. Apply the updated manifest:
```$> kubectl apply -f <cfk-operator-manifest>.yaml```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrades should be done through helm and follow the procedure documented in the Confluent documentation.Operator upgrades involve more steps such as stopping reconciliation and resume reconciliation post upgrade. We will support this through StreamTime soon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section updated as per CFK documentation.

```
2. Get logs for last 2 hrs for a specific pod:
```bash
$> kubectl lgos <pod-name> -n <namespace> --since=2h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: logs

```
3. Get logs for last 1 hr for a specific pod:
```bash
$> kubectl lgos <pod-name> -n <namespace> --since=1hbash
Copy link
Contributor

@avinash-platformatory avinash-platformatory Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: logs

**Procedure**
1. Review existing alert rules in Prometheus:
```$> kubectl get configmap prometheus-server -n <namespace> -o yaml```
2. Add new alert rules (e.g., high broker CPU, partition under-replicated):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section still mentions Update Prometheus rule files with thresholds. without the actual steps i.e.,

kubectl edit prometheusrule <name> -n <namespace>

```bash
Update Prometheus rule files with thresholds.
```
3. Reload Prometheus configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Prometheus rules configuration will be auto reloaded by the operator without requiring a restart of the Prometheus server i.e., without requiring to delete the Prometheus pod.

**Using Kubernetes Secrects**

*Kafka (CFK, SASL/PLAIN)*
1. Create a Kubernetes Secret with SASL/PLAIN user credentials:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secret would already exist if the cluster is using SASL/PLAIN for authentication. Adding new users should be done through the StreamTime UI / API so that there is an audit. If StreamTime is not available, the SOP should be to edit the secret. The secret name for CFK is kafka-sasl-plain-credential Ref - https://github.com/Platformatory/fleet-manager-helm/blob/8e2a730bc49cd71e03a4ccf239c0e7615a8884d8/charts/fleet-manager-cfk/templates/kafka.yaml#L177

3. Restart the Schema Registry pod(s) if required.


*Redpanda (SASL/SCRAM)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document is meant for Confluent workloads. We should create a separate document for Redpanda since there are other aspects not covered for Redpanda in this document.

2. Disable resource reconciliation - To prevent Confluent Platform components from rolling restarts, temporarily disable resource reconciliation of the components in each namespace where the Confluent Platform is deployed, specifying the CR kinds and CR names (*whichever is applicable*):

```bash
kubectl annotate connect connect \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect, KSQL and RestProxy are not applicable for workloads deployed through StreamTime.

application: <component image>:<tag>
```

b. If upgrading Control Center, specify the Control Center release as the Control Center image tag, the Prometheus image tag, and the Alertmanager image tag in the ControlCenter CR. Control Center is on independent versions and does not follow Confluent Platform releases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section needs to be formatted since the markdown formatting is broken when rendered by jekyll.
Screenshot_20250930_171510

application: <component image>:<tag>
```

b. If upgrading Control Center, specify the Control Center release as the Control Center image tag, the Prometheus image tag, and the Alertmanager image tag in the ControlCenter CR. Control Center is on independent versions and does not follow Confluent Platform releases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not integrate Control Center with Prometheus and AlertManager, hence the changes to these fields in the CR is not relevant.


Upgrade Confluent Platform components as below:

a. In the component CR, update the component image tag. The tag is the Confluent Platform release you want to upgrade to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done through StreamTime and is recommended to be done through StreamTime. If there is a new version of Confluent supported by StreamTime, there would be an "Upgrade" button in the cluster detail page, which can be used to safely perform the upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants