From a671277939cb32d35c4eb3f31a333bf9edce9954 Mon Sep 17 00:00:00 2001 From: homernajafi Date: Mon, 11 Mar 2019 13:36:00 -0700 Subject: [PATCH] Adding Qwicklab README file --- expand-contract-upgrade/README-QWICKLABS.md | 371 ++++++++++++++++++++ 1 file changed, 371 insertions(+) create mode 100644 expand-contract-upgrade/README-QWICKLABS.md diff --git a/expand-contract-upgrade/README-QWICKLABS.md b/expand-contract-upgrade/README-QWICKLABS.md new file mode 100644 index 0000000..c5b03f9 --- /dev/null +++ b/expand-contract-upgrade/README-QWICKLABS.md @@ -0,0 +1,371 @@ +# Kubernetes Engine Expand And Contract Update + +## Table of Contents + + + +* [Introduction](#introduction) +* [Architecture](#architecture) +* [Deployment](#deployment) + * [Manual Deployment](#manual-deployment) + * [Automated Deployment](#automated-deployment) +* [Validation](#validation) +* [Tear Down](#tear-down) +* [Troubleshooting](#troubleshooting) +* [Relevant Material](#relevant-material) + + + +## Introduction + +This demo illustrates how to use the 'expand and contract' pattern to upgrade +a Kubernetes Engine cluster. The pattern is designed to avoid issues with +resource availability in the course of a Kubernetes Engine upgrade. + +The amount of available resources is often called Headroom. Evaluating a +cluster's headroom requires looking at two related aspects: +1. **Cluster Headroom:** The sum of all unused resources across all nodes. +1. **Node Headroom:** The amount of unused resources on a specific node. +It is possible for a cluster to have sufficient Cluster Headroom to handle +losing a single node while not having enough Node Headroom to reschedule every +pod from the lost node. This is especially true with StatefulSet pods that +and only attach Google Cloud Disks in a specific Availability Zone. + +The Expand and Contract Upgrade pattern increases both Node Headroom and Cluster +Headroom by adding 1 or more new nodes to the node pool prior to starting the +upgrade. Once the upgrade has completed, the extra nodes are removed. + +## Architecture + +In this demo, we walk through the steps to correctly perform an expand and +contract upgrade. + +1. We will create a regional Kubernetes Engine cluster and deploy an + application to it +1. We will increase the node pool size of the cluster +1. We will upgrade the Kubernetes Engine control plane and the Kubernetes + Engine node pool separately +1. We will monitor pod activity during the upgrade process +1. We will resize the cluster back down to its original size upon successful + completion of the Kubernetes Engine upgrade. + +To complete this example, you will run `cluster_ops.sh` contained in +this repository. It uses `gcloud` and `kubectl` commands to interact with +the Google Cloud Platform and the Kubernetes Engine cluster. + + + +## Deployment + +### Manual Deployment + +Run each command below from the root of this repository. The validation +section describes commands to monitor the status of the cluster and application +during the upgrade procedure. + +Note, every time the `cluster_ops.sh` script is run, it will always check the +the following items: dependencies are installed, the project specified in the +properties file exists, and that the appropriate api's have been enabled + +```console + +Checking dependencies are installed ..... + +Checking the project specified for the demo exists ..... + +Checking the appropriate api's are enabled ..... +``` + +1. **Create the Kubernetes Engine cluster:** + The `create` action will create a regional Kubernetes Engine Cluster and + deploy the example application. + + ```console + ./cluster_ops.sh create + ``` + + After a few minutes the Kubernetes Engine cluster will be created, the + Elasticearch cluster will be installed, and an index containing the works + of Shakespeare will loaded. The last several lines of output will look like + this: + + ```console + Creating the Shakespeare index + {"acknowledged":true,"shards_acknowledged":true,"index":"shakespeare"} + Loading Shakespeare sample data into Elasticsearch + % Total % Received % Xferd Average Speed Time Time Time Current + Dload Upload Total Spent Left Speed + 100 62.6M 100 38.5M 100 24.1M 1642k 1029k 0:00:24 0:00:24 --:--:-- 3719k + Sample data successfully loaded! + ``` + +1. **Increase size of the Kubernetes Engine node pool:** + In this step we will use the `resize` command to increase the size of the + node pool, we will also pass the command a numeric argument to indicate + the new size of the node pool + + ```console + ./cluster_ops.sh resize 2 + ``` + + After several minutes the resize process will complete and the node + pool size will now reflect the new size. The last several lines of output + will look something like this: + + ```console + Resizing the node pool to 2 nodes ..... + + Resizing expand-contract-cluster... + .....................done. + Updated [https://container.googleapis.com/v1/projects/rolling-updates-poc-expand/zones/us-central1/clusters/expand-contract-cluster]. + ``` + +1. **Upgrade the control plane:** + + ```console + ./cluster_ops.sh upgrade-control + ``` + + After several minutes, the upgrade will be complete and the output should + look similar to this: + + ```console + Upgrading the K8s control plane ..... + + Upgrading expand-contract-cluster... + ......................................................................... + ........done. + Updated [https://container.googleapis.com/v1/projects/rolling-updates-poc-expand/zones/us-central1/clusters/expand-contract-cluster]. + ``` + +1. **Upgrade the cluster nodes:** + + ```console + ./cluster_ops.sh upgrade-nodes + ``` + After several minutes, the upgrade will be complete and the output should + look similar to this: + + ```console + Upgrading the K8s nodes ..... + + Upgrading expand-contract-cluster... + ........................................................................... + ......................................................................... + ........done. + Updated [https://container.googleapis.com/v1/projects/rolling-updates-poc-expand/zones/us-central1/clusters/expand-contract-cluster]. + ``` + +1. **Decrease size of the Kubernetes Engine node pool:** + In this step we will use the `resize` command to decrease the size of the + node pool back to its original size, we will again use a numeric argument + to indicate the new size of the node pool + + ```console + ./cluster_ops.sh resize 1 + ``` + + After several minutes the resize process will complete and the node + pool size will now reflect the new size. The last several lines of output + will look something like this: + + ```console + Resizing the node pool to 1 nodes ..... + + Resizing expand-contract-cluster... + ................................................................................................................................................................................done. + Updated [https://container.googleapis.com/v1/projects/rolling-updates-poc-expand/zones/us-central1/clusters/expand-contract-cluster]. + ``` + +### Automated Deployment + +The cluster creation, upgrade, and validation can be run with one command: +```console +./cluster_ops.sh auto +``` + +## Validation + +* **Control Plane Upgrade:** While the control plane is upgrading, you can + verify that Regional Kubernetes Engine clusters have an HA control plane by + querying the API server in a loop: + + ```console + while true; do kubectl get pods; sleep 5; done + ``` + + To exit the loop, use `ctrl + c`. + + As each control plane node is replaced, other running control plane nodes + will serve requests to the `kubectl` commands providing a zero-downtime + upgrade. When the control plane upgrade is complete, you can see the new + `Server Version` with: + + ```console + kubectl version + ``` + +* **`gcloud` monitoring** You can also monitor the progress of cluster + upgrades with the the `glcoud` command. Both completed and in-progress + upgrades will be listed. Find the appropriate operation ID to get details + of an upgrade. + + ```console + gcloud container operations list + gcloud container operations describe \ + --region + ``` + +* **Cloud console monitoring** You can also monitor the progress of cluster + upgrades under GCP Kubernetes Engine, select your cluster and see the progress + showing in %. + +* **Rescheduling:** As pods are deleted and nodes are drained, you can view + the progress of rescheduling: + + ```console + kubectl get pods --all-namespaces + ``` + +* **Application Health:** Throughout all upgrade steps, an HA application + with appropriate number of pods should continue running uninterrupted. The + Elasticsearch cluster in this example will continue serving search queries + as long as the cluster health is `green` or `yellow`. It has 3 Data Nodes, + 3 Client Nodes, and 3 Master Eligible Nodes with one elected Master. + + In one terminal, configure a port-forward from the elasticsearch service to + your workstation's localhost: + + ```console + kubectl port-forward svc/elasticsearch 9200 + ``` + + Then in another terminal check the cluster health in a loop: + + ```console + while true; do \ + date "+%H:%M:%S,%3N" \ + curl --max-time 1 'http://localhost:9200/_cluster/health' | jq . + echo "" \ + sleep 1 \ + done + ``` + + A healthy cluster with all nodes available will look like this: + + ```console + { + "cluster_name": "myesdb", + "status": "green", + "timed_out": false, + "number_of_nodes": 9, + "number_of_data_nodes": 3, + "active_primary_shards": 5, + "active_shards": 10, + "relocating_shards": 0, + "initializing_shards": 0, + "unassigned_shards": 0, + "delayed_unassigned_shards": 0, + "number_of_pending_tasks": 0, + "number_of_in_flight_fetch": 0, + "task_max_waiting_in_queue_millis": 0, + "active_shards_percent_as_number": 100 + } + ``` + + In yet another terminal window, you can run a loop to test the availability + of the search API which should continue working during a Master re-election: + ```console + while true; do \ + date "+%H:%M:%S,%3N" \ + curl --max-time 1 'http://localhost:9200/shakespeare/_search?q=happy%20dagger' + echo "" \ + sleep 1 \ + done + ``` + +* **Completed Upgrade:** After the upgrade steps have been completed, the + `validation.sh` script will check the control plane version and each + node's version. Execute it from within this directory: + ```console + ./validation.sh + ``` + Successful output will look like this: + ```console + Validating the control plane version... + Control plane is upgraded to 1.10.4-gke.2! + Validating the Nodes... + All nodes upgraded to 1.10.4-gke.2! + Validating the number of hello-server pods running... + All hello-server pods have been running. + ``` + + +## Tear Down +To delete the Kubernetes Engine cluster and all other resources generated during +this example run the following command: + +```console +./cluster_ops.sh delete +``` + +## Troubleshooting + +* `E0717 09:45:59.417020 1245 portforward.go:178] lost connection to pod` + + The port-forward command will occasionally fail, especially as the cluster is + being manipulated. Execute the following command to reconnect: + ```console + kubectl port-forward svc/elasticsearch 9200 + ``` + +* `Currently upgrading cluster` Error: + + ```console + ERROR: (gcloud.container.node-pools.delete) ResponseError: code=400, message=Operation operation-1529415957904-496c7278 is currently upgrading cluster blue-green-test. Please wait and try again once it is done. + ``` + +* `IN_USE_ADDRESSES` Quota Error: + + ```console + ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request for resource: "IN_USE_ADDRESSES". The request requires '9.0' and is short '1.0'. The regional quota is '8.0' with '8.0' available. + ``` + + 1. Open the GCP Console and navigate to `IAM & admin` -> `Quotas`. + 1. Filter the quotas by selecting your region under `Location`. + 1. Check the box next to `Compute Engine API In-use IP addresses global`, + then click `EDIT QUOTAS`. + 1. Follow the steps to increase the quota. Quotas are not immediately + increased. + +* `CPUS` Quota Error: + + ```console + ERROR: (gcloud.container.node-pools.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request for resource: "CPUS". The request requires '12.0' and is short '3.0'. The regional quota is '24.0' with '9.0' available. + ``` + 1. Open the GCP Console and navigate to `IAM & admin` -> `Quotas`. + 1. Filter the quotas by selecting your region under `Location`. + 1. Check the box next to `Compute Engine API CPUs`, then click `EDIT QUOTAS`. + 1. Follow the steps to increase the quota. Quotas are not immediately + increased. + +* `Upgrade` Error after resize: + + ```console + ERROR: (gcloud.container.clusters.upgrade) ResponseError: code=400, + message=Operation operation-1528990089723-411a9049 is currently upgrading + cluster expand-contract-cluster. Please wait and try again once it is done. + ``` + + 1. This is expected behavior if a resize of a node pool goes beyond 5 nodes, + Kubernetes Engine will automatically scale up the control plane to manage + additional resources, not to be confused with the Kubernetes Engine + version upgrade we are doing as part of this demo. + 1. Wait for the cluster to to be in a green state and continue with next + step in the demo. + +## Relevant Material + +* `PodDisruptionBudgets` - [Kubernetes Disruptions](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) +* `readinessProbe` - [Pod lifecycle](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/) +* [Kubernetes Engine Release Notes](https://cloud.google.com/kubernetes-engine/release-notes)