Create section in EKS docs on how to clone an instance #1502

mrjones-plip · 2024-08-23T20:33:21Z

Description

Create section in EKS docs on how to clone an instance

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

mrjones-plip · 2024-08-28T19:29:02Z

These steps currently don't work. Instead of a copy of the snapshotted -> new volume data showing up in the new CHT instance, there is a clean install of the CHT instead.

@henokgetachew suggests:

The volume you created is in the wrong availability zone. For the development EKS cluster - use eu-west-2b and For the prod EKS cluster - use eu-west-2a . You are trying to attach the volume in eu-west-2a to the dev cluster. That won't work. Can you change that and test?

So I'll delete the volume (and snapshot if it's a dev instance), update the steps in this PR and try again!

mrjones-plip · 2024-08-28T20:19:54Z

@henokgetachew - can you take another look at what I might be doing wrong? I deleted the volume I created before and then created a new one, being sure to specify the AZ:

$ aws ec2 create-volume --region eu-west-2 --snapshot-id snap-0d0840a657afe84e7 --availability-zone eu-west-2b

Here's the description from $ aws ec2 describe-volumes --region eu-west-2 --volume-id vol-0fee7609aa7757984 | jq:

{
  "Volumes": [
    {
      "Attachments": [],
      "AvailabilityZone": "eu-west-2b",
      "CreateTime": "2024-08-28T19:42:35.650000+00:00",
      "Encrypted": false,
      "Size": 900,
      "SnapshotId": "snap-0d0840a657afe84e7",
      "State": "available",
      "VolumeId": "vol-0fee7609aa7757984",
      "Iops": 2700,
      "Tags": [
        {
          "Key": "owner",
          "Value": "mrjones"
        },
        {
          "Key": "kubernetes.io/cluster/dev-cht-eks",
          "Value": "owned"
        },
        {
          "Key": "KubernetesCluster",
          "Value": "dev-cht-eks"
        },
        {
          "Key": "use",
          "Value": "allies-hosting-tco-testing"
        },
        {
          "Key": "snapshot-from",
          "Value": "moh-zanzibar-Aug-26-2024"
        }
      ],
      "VolumeType": "gp2",
      "MultiAttachEnabled": false
    }
  ]
}

I set the volume ID in my values file:

# tail -n4 mrjones.yml
remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

And then run deploy:

$ ./cht-deploy -f mrjones.yml     

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Wed Aug 28 13:17:27 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

However I get a 503 in the browser, despite all pods being up:

$ ./troubleshooting/list-all-resources mrjones-dev
NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-sgqgt                   1/1     Running   0          20m
pod/cht-couchdb-f86c9cf47-jcsxl                1/1     Running   0          20m
pod/cht-haproxy-756f896d6d-s54ns               1/1     Running   0          20m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-wtzsx   1/1     Running   0          20m
pod/cht-sentinel-7d8987d4db-m8tr2              1/1     Running   0          20m
pod/upgrade-service-67f48c5fc4-fs7fx           1/1     Running   0          20m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.25.243    <none>        5988/TCP                     20m
service/couchdb           ClusterIP   172.20.65.81     <none>        5984/TCP,4369/TCP,9100/TCP   20m
service/haproxy           ClusterIP   172.20.249.24    <none>        5984/TCP                     20m
service/healthcheck       ClusterIP   172.20.176.77    <none>        5555/TCP                     20m
service/upgrade-service   ClusterIP   172.20.125.132   <none>        5008/TCP                     20m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           20m
deployment.apps/cht-couchdb               1/1     1            1           20m
deployment.apps/cht-haproxy               1/1     1            1           20m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           20m
deployment.apps/cht-sentinel              1/1     1            1           20m
deployment.apps/upgrade-service           1/1     1            1           20m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       20m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       20m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       20m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       20m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       20m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       20m

Here's my values file - password and secret changed to protect the inocent:

project_name: mrjones-dev 
namespace: "mrjones-dev"
chtversion: 4.5.2
#cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
couchdb:
  password: hunter2
  secret: Correct-Horse-Battery-Staple 
  user: medic
  uuid: 1c9b420e-1847-49e9-9cdf-5350b32f6c85
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 20Gi
clusteredCouch:
  noOfCouchDBNodes: 1
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local" or "remote"

remote:
  existingEBS: "true"
  existingEBSVolumeID: "vol-0fee7609aa7757984"
  existingEBSVolumeSize: "900Gi"

henokgetachew · 2024-08-30T17:29:48Z

@mrjones-plip Okay I have finally figured out why this didn't work for you. values.yaml.

Your settings file:

You basically missed the main flag that tells helm to look for pre-existing volumes within the next sections. It should be configured like this:

I have tested this one and it has worked for me.

{
    "Volumes": [
        {
            "Attachments": [
                {
                    "AttachTime": "2024-08-30T16:49:08+00:00",
                    "Device": "/dev/xvdbh",
                    "InstanceId": "i-0ad3b6f9c8c82a5c9",
                    "State": "attached",
                    "VolumeId": "vol-0fee7609aa7757984",
                    "DeleteOnTermination": false
                }
            ],
            "AvailabilityZone": "eu-west-2b",
            "CreateTime": "2024-08-28T19:42:35.650000+00:00",
            "Encrypted": false,
            "Size": 900,
            "SnapshotId": "snap-0d0840a657afe84e7",
            "State": "in-use",
            "VolumeId": "vol-0fee7609aa7757984",
            "Iops": 2700,
            "Tags": [
                {
                    "Key": "owner",
                    "Value": "mrjones"
                },
                {
                    "Key": "kubernetes.io/cluster/dev-cht-eks",
                    "Value": "owned"
                },
                {
                    "Key": "KubernetesCluster",
                    "Value": "dev-cht-eks"
                },
                {
                    "Key": "use",
                    "Value": "allies-hosting-tco-testing"
                },
                {
                    "Key": "snapshot-from",
                    "Value": "moh-zanzibar-Aug-26-2024"
                }
            ],
            "VolumeType": "gp2",
            "MultiAttachEnabled": false
        }
    ]
}

mrjones-plip · 2024-08-31T05:02:22Z

Thanks @henokgetachew !

However, this is still not working :(

I've updated this PR with the exact steps I did. I'm wondering if all the IDs in my cloned instance need to match the production instance maybe?

Anyway, here's my values file with password changed:

project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
# cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.

# Don't change upstream-servers unless you know what you're doing.
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32

# CouchDB Settings
couchdb:
  password: "changme" # Avoid using non-url-safe characters in password
  secret: "0b0802c7-f6e5-4b21-850a-3c43fed2f885" # Any value, e.g. a UUID.
  user: "medic"
  uuid: "d586f89b-e849-4327-a6a8-0def2161b501" # Any UUID
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 900Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::<account-id>:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  # Ensure the host is not already taken. Valid characters for a subdomain are:
  #   a-z, 0-9, and - (but not as first or last character).
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"

environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"

nodes:
  # If using clustered couchdb, add the nodes here: node-1: name-of-first-node, node-2: name-of-second-node, etc.
  # Add equal number of nodes as specified in clusteredCouch.noOfCouchDBNodes
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
  # For single couchdb node, use the following:
  # Leave it commented out if you don't know what it means.
  # Leave it commented out if you want to let kubernetes deploy this on any available node. (Recommended)
  # single_node_deploy: "gamma-cht-node" # This is the name of the node where all components will be deployed - for non-clustered configuration. 

# Applicable only if using k3s
k3s_use_vSphere_storage_class: "false" # "true" or "false"
# vSphere specific configurations. If you set "true" for k3s_use_vSphere_storage_class, fill in the details below.
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.

# If preExistingDataAvailable is true, fill in the details below.
# For local_storage, fill in the details if you are using k3s-k3d cluster type.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
# For ebs storage when using eks cluster type, fill in the details below.
ebs:
  preExistingEBSVolumeID: "vol-0fee7609aa7757984" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.

And the deploy goes well:

  deploy git:(master) ✗ ./cht-deploy -f mrjones-muso.yml 
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release exists. Performing upgrade.
Release "mrjones-dev" has been upgraded. Happy Helming!
NAME: mrjones-dev
LAST DEPLOYED: Fri Aug 30 21:57:00 2024
NAMESPACE: mrjones-dev
STATUS: deployed
REVISION: 2
TEST SUITE: None
Instance at https://mrjones.dev.medicmobile.org upgraded successfully.

And all the resources show as started:

NAME                                           READY   STATUS    RESTARTS   AGE
pod/cht-api-8554fc5b4c-xr79j                   1/1     Running   0          14m
pod/cht-couchdb-f86c9cf47-dvqdv                1/1     Running   0          14m
pod/cht-haproxy-756f896d6d-p58h6               1/1     Running   0          14m
pod/cht-haproxy-healthcheck-7c8d4dbfb4-z4wd5   1/1     Running   0          14m
pod/cht-sentinel-7d8987d4db-j44tz              1/1     Running   0          14m
pod/upgrade-service-67f48c5fc4-r9q7h           1/1     Running   0          14m

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/api               ClusterIP   172.20.0.14      <none>        5988/TCP                     14m
service/couchdb           ClusterIP   172.20.192.240   <none>        5984/TCP,4369/TCP,9100/TCP   14m
service/haproxy           ClusterIP   172.20.8.14      <none>        5984/TCP                     14m
service/healthcheck       ClusterIP   172.20.92.132    <none>        5555/TCP                     14m
service/upgrade-service   ClusterIP   172.20.233.206   <none>        5008/TCP                     14m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cht-api                   1/1     1            1           14m
deployment.apps/cht-couchdb               1/1     1            1           14m
deployment.apps/cht-haproxy               1/1     1            1           14m
deployment.apps/cht-haproxy-healthcheck   1/1     1            1           14m
deployment.apps/cht-sentinel              1/1     1            1           14m
deployment.apps/upgrade-service           1/1     1            1           14m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/cht-api-8554fc5b4c                   1         1         1       14m
replicaset.apps/cht-couchdb-f86c9cf47                1         1         1       14m
replicaset.apps/cht-haproxy-756f896d6d               1         1         1       14m
replicaset.apps/cht-haproxy-healthcheck-7c8d4dbfb4   1         1         1       14m
replicaset.apps/cht-sentinel-7d8987d4db              1         1         1       14m
replicaset.apps/upgrade-service-67f48c5fc4           1         1         1       14m

But I get a 502 - bad gateway in the browser.

Couch seems in a bad way, which is likely the main problem:

[warning] 2024-08-31T04:49:55.337953Z [email protected] <0.1449.0> e669322402 couch_httpd_auth: Authentication failed for user medic from 100.64.213.104
[notice] 2024-08-31T04:49:55.338171Z [email protected] <0.1449.0> e669322402 couchdb.mrjones-dev.svc.cluster.local:5984 100.64.213.104 undefined GET /_membership 401 ok 1
[notice] 2024-08-31T04:49:55.703891Z [email protected] <0.394.0> -------- chttpd_auth_cache changes listener died because the _users database does not exist. Create the database to silence this notice.
[error] 2024-08-31T04:49:55.704074Z [email protected] emulator -------- Error in process <0.1467.0> on node '[email protected]' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

[error] 2024-08-31T04:49:55.704137Z [email protected] emulator -------- Error in process <0.1467.0> on node '[email protected]' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,430}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,405}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,434}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,100}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,39}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,214}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,160}]}]}

With couch down, it's not worth checking, but API and sentinel are unhappy - they both have near identical 503 errors:

StatusCodeError: 503 - {"error":"503 Service Unavailable","reason":"No server is available to handle this request","server":"haproxy"}
    at new StatusCodeError (/service/api/node_modules/request-promise-core/lib/errors.js:32:15)
    at Request.plumbing.callback (/service/api/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/service/api/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at Request.self.callback (/service/api/node_modules/request/request.js:185:22)
    at Request.emit (node:events:513:28)
    at Request.<anonymous> (/service/api/node_modules/request/request.js:1154:10)
    at Request.emit (node:events:513:28)
    at IncomingMessage.<anonymous> (/service/api/node_modules/request/request.js:1076:12)
    at Object.onceWrapper (node:events:627:28)
    at IncomingMessage.emit (node:events:525:35) {
  statusCode: 503,
  error: {
    error: '503 Service Unavailable',
    reason: 'No server is available to handle this request',
    server: 'haproxy'
  }
}

HA Proxy is unsurprisingly 503ing:

<150>Aug 31 04:58:22 haproxy[12]: 100.64.213.102,<NOSRV>,503,0,0,0,GET,/,-,medic,'-',241,-1,-,'-'

mrjones-plip · 2024-09-03T23:24:02Z

@dianabarsan and I did deep dive into this today and my test instance now starts up instead of 502ing! However, it's on a clean install of CHT core instead of showing the cloned prod data.

At this point we suspect it might be a permissions error maybe? Per below, the volume mounts but we can't see any of the data, so that's our guess.

We found out that:

It was important to comment out the four lines starting with nodes: in the local storage section.

the volume is indeed being mounted in the cht-couchdb pod

 $ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- df -h             
 Filesystem      Size  Used Avail Use% Mounted on
 overlay         485G   40G  445G   9% /
 /dev/nvme2n1    886G   71G  816G   8% /opt/couchdb/data

but there's simply no data in it:

 $ kubectl -n mrjones-dev exec -it cht-couchdb-f86c9cf47-5msts -- du --max-depth=1 -h /opt/couchdb/data
 1.8M	/opt/couchdb/data/.shards
 5.0M	/opt/couchdb/data/shards
 4.0K	/opt/couchdb/data/.delete
 6.9M	/opt/couchdb/data

looking at the prod instance where it was cloned from, it's mounted in the same path AND there's data in it:


 kubectl config set-context arn:aws:eks:eu-west-2:720541322708:cluster/prod-cht-eks
 
 kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- df -h     
 Filesystem      Size  Used Avail Use% Mounted on
 overlay         485G   14G  472G   3% /
 /dev/nvme1n1    886G   67G  820G   8% /opt/couchdb/data

 kubectl -n moh-zanzibar-prod exec -it cht-couchdb-1-cb788fc65-vjsn5  -- du --max-depth=1 -h /opt/couchdb/data
 12K	/opt/couchdb/data/._users_design
 12K	/opt/couchdb/data/._replicator_design
 33G	/opt/couchdb/data/.shards
 31G	/opt/couchdb/data/shards
 4.0K	/opt/couchdb/data/.delete
 63G	/opt/couchdb/data

Just to be safe, we set the password: and secret: and user: in the values file to be identical to the prod instances we cloned, and this did not fix things.
To be extra sure the data on the volume was still "valid" (in quotes because I don't know why it wouldn't be valid?!?) - I made a new clone of the volume (see vol-0cf04a56d8d59f74b) off the most recent snapshot (see snap-01a976f6a4e51684c), being sure to set all the tags correctly. This failed in the same way as above (mounted correctly, but clean install)

henokgetachew · 2024-09-04T07:16:34Z

I have some downtime today. I will try to have a look if it's a quick thing.

henokgetachew · 2024-09-04T11:39:40Z

Pushed a PR here. Let me know if that solves it.

mrjones-plip · 2024-09-04T20:34:19Z

Thanks so much for coming off your holiday to do some work!

Per my slack comment, I don't know how to test this branch in the cht-conf script

content/en/contribute/code/core/deploy-on-eks.md

Co-authored-by: Andy Alt <[email protected]>

henokgetachew · 2024-09-05T15:44:01Z

It doesn't release beta builds for now. If the code looks good for you then only way to test right now is to approve and merge the PR which should release a patch version of the helm charts which cht-deploy will pick when deploying

mrjones-plip · 2024-09-05T15:56:36Z

Despite it only being 7 lines of change, I'm not really in a position to know if these changes look good. I don't know helm, I don't know EKS and I believe these charts are used for every production CHT Core instance we run - which gives me pause.

I would very much like to be able to test this or defer to someone else who knows what these changes actually do.

I'll pursue the idea of running the changes manually via helm install... per this slack thread and see how far I can get.

mrjones-plip · 2024-09-05T22:33:12Z

I tried this just now and got the same result:

fully remove the current deployment: helm delete mrjones-dev --namespace mrjones-dev

make sure I'm on the correct branch:

git status
On branch user-root-for-couchdb-container
Your branch is up to date with 'origin/user-root-for-couchdb-container'.

run the extracted helm upgrade command, passing in the full path to the branch of helm charts with the changes to test: helm upgrade mrjones-dev /home/mrjones/Documents/MedicMobile/helm-charts/charts/cht-chart-4x --install --version 1.0.* --namespace mrjones-dev --values mrjones-muso.yml --set cht_image_tag=4.5.2

note that it runs successfully:

mrjones-muso.yml --set cht_image_tag=4.5.2                                                
Release "mrjones-dev" does not exist. Installing it now.                
NAME: mrjones-dev
LAST DEPLOYED: Thu Sep  5 15:22:10 2024
NAMESPACE: mrjones-dev                                                                   
STATUS: deployed            
REVISION: 1                     
TEST SUITE: None

check that the volume is mounted: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- df -h:

 Filesystem      Size  Used Avail Use% Mounted on                        
 overlay         485G   42G  444G   9% /
 /dev/nvme2n1    886G   67G  820G   8% /opt/couchdb/data

check that there's actually a lot of data in the volume: kubectl -n mrjones-dev exec -it cht-couchdb-57c74f9fc-qtrx5 -- du --max-depth=1 -h /opt/couchdb/data:
```
 1.9M    /opt/couchdb/data/.shards
 5.0M    /opt/couchdb/data/shards
 4.0K    /opt/couchdb/data/.delete
 6.9M    /opt/couchdb/data
```

henokgetachew · 2024-09-24T09:10:38Z

I think this is working now. I did a clone. Initially, I had an issue where it wasn't working for me. The issue was that I was not using the same password and secrets from the instance being cloned.

root@cht-couchdb-54cd59777f-nk7rv:/opt/couchdb/data# ls -lha
total 96K
drwxr-sr-x  5 couchdb couchdb 4.0K May  1 21:25 .
drwxr-xr-x  1 couchdb couchdb 4.0K Aug 13 01:11 ..
drwxr-sr-x  2 couchdb couchdb 4.0K Sep 24 08:56 .delete
drwxr-sr-x 14 couchdb couchdb 4.0K May  1 21:25 .shards
-rw-r--r--  1 couchdb couchdb  53K Jul 22 07:41 _dbs.couch
-rw-r--r--  1 couchdb couchdb 8.2K May  1 21:24 _nodes.couch
drwxr-sr-x 14 couchdb couchdb 4.0K May  1 21:25 shards
root@cht-couchdb-54cd59777f-nk7rv:/opt/couchdb/data# du --max-depth=1 -h /opt/couchdb/data
10M	/opt/couchdb/data/.shards
4.0K	/opt/couchdb/data/.delete
6.5M	/opt/couchdb/data/shards
17M	/opt/couchdb/data

Going to try cloning an instance with more test data like reports now to be absolutely sure.

henokgetachew · 2024-09-24T10:56:56Z

Update: I have reproduced the issue and debugging right now.

mrjones-plip · 2024-09-24T14:25:40Z

Thanks @henokgetachew!

It sounds like you reproduced the issue, but to be clear - the issue wasn't that the password was wrong after starting CHT, the issue was that the data wasn't even showing up on disk. That is, we'd mount a 900GB volume to /opt/couchdb/data , but du --max-depth=1 -h /opt/couchdb/data only showed ~15MB of data on disk.

henokgetachew · 2024-09-24T14:26:52Z

Correct. That's what I reproduced.

Hareet · 2024-09-25T19:01:57Z

Here's your subPath issue

Unfortunately, you picked to clone project that had medic-os pre-existing data that migrated from 3.x to 4.x on an edge scenario. We are stuck in helm-chart madness, and haven't gotten around to adding all stipulating scenarios. In a cht-core 3.x upgrade to 4.x, we didn't use the helm chart every time due to time constraints and modified deployment templates directly. The main thing that was needed to be modified was subPath. Essentially, on your clone deployment trials, couchDB was searching for data in a new directory and therefore starting a fresh install.

In medic-os we kept couchdb data in /storage/medic-core/couchdb/data. We didn't keep that data directory format for fresh 4.x installs, and have a difficult helm chart setup to add all these varying scenarios. Some of our migrations, we moved data from 3.x to 4.x directory structure, but that was an unnecessary step that if not done correctly, caused views to re-build or other problems. Perhaps @henokgetachew has a long term fix ready for it now - we could convert subPath to a values variable and write documentation/comments on how to use it depending on how old the pre-existing data is (cht-core version 2.x+)

Sorry this was a headache for you @mrjones-plip !

To review, not a permissions issue, but some new scenarios to add to our helm charts.

Investigating production cht-core-old-prod deployment:

kubectl -n cht-core-old-prod get deploy
kubectl -n cht-core-old-prod get deploy cht-couchdb-1 -o yaml

     volumeMounts:
        - mountPath: /opt/couchdb/data
          name: couchdb1-cht-core-old-prod-claim
          subPath: storage/medic-core/couchdb/data
        - mountPath: /opt/couchdb/etc/local.d
          name: couchdb1-cht-core-old-prod-claim
          subPath: local.d

Here is subPath from helm-charts, you can see its a diff directory than medic-os installs

henokgetachew · 2024-09-25T19:35:22Z

Yup that's correct.

I just pushed two PRs earlier fixing the issues (Helm-charts, cht-deploy)

The change that needs to be made is basically what the mount uses as path. That value for this project needs to be: storage/medic-core/couchdb/data. That's not always the case. For some projects it could be "data". For others, it could just be empty or null - depending on where the disk was mounted. How do you know what to set this value to? I have made a new troubleshooting script in the cht-deploy PR above so that the user knows what that value needs to be set to:

~ ./troubleshooting/get-volume-binding moh-zanzibar-prod cht-couchdb-1
{
  "mountPath": "/opt/couchdb/data",
  "name": "couchdb1-moh-zanzibar-prod-claim",
  "subPath": "storage/medic-core/couchdb/data",
  "volumeType": "PVC",
  "path": "couchdb1-moh-zanzibar-prod-claim"
}

The syntax is ./troubleshooting/get-volume-binding namespace deployment-name

So in your new values.yaml after the PR gets merged:

# -----------------------------------------
#       Pre-existing data section
# -----------------------------------------
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
  dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
    # To mount to a specific subpath (If data is from an old 3.x instance for example): dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data"
    # To mount to the root of the volume: dataPathOnDiskForCouchDB: ""
    # To use the default "data" subpath, remove the subPath line entirely from values.yaml or name it "data" or use null.

Also make sure you use the new values.yaml from the new PR. The key that has changed that's relevant to you is below (i.e. We're now supporting pre-existing data for clustered couchdb too.)

ebs:
  preExistingEBSVolumeID-1: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.

mrjones-plip · 2024-09-26T00:19:48Z

Thanks for the updates @Hareet and @henokgetachew ! Having just restored production copies from 3 production snapshots, I can attest that the paths to couchdb/data are indeed all over the map. Here's two samplings

COUCHDB_DATA=/opt/couchdb/moh1-data/home/ubuntu/cht/couchdb
COUCHDB_DATA=/opt/couchdb/moh2-data/storage/medic-core/couchdb/data/

That said - there's a lot of info above which makes it complicated to test here - I'm not exactly sure what my next steps are. I'd love to just update the happy path steps and follow them to ensure it works. Please feel free to update this PR's docs directly!

I've done a bit of testing over on the troubleshooting script PR in hopes of moving everything along.

content/en/contribute/code/core/deploy-on-eks.md

mrjones-plip · 2024-10-01T21:36:32Z

Thanks so much for adding directly to this PR's docs content @henokgetachew ! I plan on getting to this early next week.

Co-authored-by: Henok <[email protected]>

mrjones-plip · 2024-10-05T04:52:43Z

Thanks for the commits @henokgetachew !

I'm getting a new error follow the exact steps here:

$ helm delete mrjones-dev --namespace mrjones-dev
release "mrjones-dev" uninstalled

$ ./cht-deploy -f mrjones-moh-zanz-prod.yml      
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "medic" chart repository
Update Complete. ⎈Happy Helming!⎈
Release does not exist. Performing install.
Error: INSTALLATION FAILED: cannot patch "couchdb-pv-mrjones-dev" with kind PersistentVolume: PersistentVolume "couchdb-pv-mrjones-dev" is invalid: spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation
  core.PersistentVolumeSource{
        ... // 19 identical fields
        Local:     nil,
        StorageOS: nil,
        CSI: &core.CSIPersistentVolumeSource{
                Driver:       "ebs.csi.aws.com",
-               VolumeHandle: "vol-0123456789abcdefg",
+               VolumeHandle: "vol-05b22d15773376c76",
                ReadOnly:     false,
                FSType:       "ext4",
                ... // 5 identical fields
        },
  }

Command failed: helm install mrjones-dev medic/cht-chart-4x --version 1.1.* --namespace mrjones-dev --values mrjones-moh-zanz-prod.yml --set cht_image_tag=4.5.2
Error: Command failed: helm install mrjones-dev medic/cht-chart-4x --version 1.1.* --namespace mrjones-dev --values mrjones-moh-zanz-prod.yml --set cht_image_tag=4.5.2
    at genericNodeError (node:internal/errors:984:15)
    at wrappedFn (node:internal/errors:538:14)
    at checkExecSyncError (node:child_process:891:11)
    at Object.execSync (node:child_process:963:15)
    at helmCmd (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:78:24)
    at helmInstallOrUpdate (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:109:5)
    at install (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/src/install.js:165:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async runInstallScript (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/cht-deploy:55:5)
    at async main (file:///home/mrjones/Documents/MedicMobile/cht-core/scripts/deploy/cht-deploy:66:3)

I note that the new yaml file I started with has fields I wasn't expecting - like node-1 through node-3 , but maybe those are ignored when you're doing only single node?

Here's mrjones-moh-zanz-prod.yml used to generate the error:

project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
upstream_servers:
  docker_registry: "public.ecr.aws/medic"
  builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
  tag: 0.32
couchdb:
  password: "hunter2" # Avoid using non-url-safe characters in password
  secret: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any value, e.g. a UUID.
  user: "medic"
  uuid: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any UUID
  clusteredCouch_enabled: false
  couchdb_node_storage_size: 900Mi
clusteredCouch:
  noOfCouchDBNodes: 3
toleration:   # This is for the couchdb pods. Don't change this unless you know what you're doing.
  key: "dev-couchdb-only"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"
ingress:
  annotations:
    groupname: "dev-cht-alb"
    tags: "Environment=dev,Team=QA"
    certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
  host: "mrjones.dev.medicmobile.org"
  hosted_zone_id: "Z3304WUAJTCM7P"
  load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote"  # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"
nodes:
  node-1: "" # This is the name of the first node where couchdb will be deployed
  node-2: "" # This is the name of the second node where couchdb will be deployed
  node-3: "" # This is the name of the third node where couchdb will be deployed
k3s_use_vSphere_storage_class: "false" # "true" or "false"
vSphere:
  datastoreName: "DatastoreName"  # Replace with your datastore name
  diskPath: "path/to/disk"         # Replace with your disk path
couchdb_data:
  preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
  dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
  partition: "0" # This is the partition number for the EBS volume. Leave it as 0 if you don't have a partitioned disk.
local_storage:  #If using k3s-k3d cluster type and you already have existing data.
  preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
  preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
  preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
ebs:
  preExistingEBSVolumeID-1: "vol-05b22d15773376c76" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-2: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeID-3: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
  preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume.

mrjones-plip · 2024-10-07T20:06:45Z

@Hareet - we have a call scheduled this week to go over this PR. Confirming that the above error happens when I follow the latest steps, including the latest commits. Hope to resolve all this on our call!

henokgetachew · 2024-10-08T11:59:07Z

@mrjones-plip the patch error is because you'd need to first delete the pv. It should work if you delete the pv and re-run the command.

Hareet · 2024-10-09T15:52:39Z

Okay, I've figured it out. You got your deployment stuck in a weird state (terminating persistentVolume) because it seems at some point you pushed up a default values file.

Here is the persistentVolume that your current helm-chart was unable to over-write and create:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    meta.helm.sh/release-name: mrjones-dev
    meta.helm.sh/release-namespace: mrjones-dev
                 pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [external-attacher/ebs-csi-aws-com]
StorageClass:
Status:          Terminating (lasts 13d)
Claim:           mrjones-dev/couchdb-claim0
Reclaim Policy:  Retain
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        900Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-0123456789abcdefg

Essentially, the finalizer is listed as the aws ebs container storage interface, and its been stuck waiting for AWS to terminate a volume abcdefg for 13d. All of your following helm chart deployments would use the same persistentVolume name, and so were unable to create that named resource until this termination was complete.

You can see this listed in the error output that you pasted, with two volumeHandle values shown.

As Henok noted, we got to delete the pv, but since its in a terminating state, we have to patch it to stop using a finalizer.
kubectl edit pv couchdb-pv-mrjones-dev

Find the finalizer and its value and delete both lines
this should trigger the pv to delete completely. You can verify or run kubectl delete pv couchdb-pv-mrjones-dev --grace-period=0 --force
We will now have to delete the existing helm chart deployment and re-deploy it so the pv resource can be created and the race conditions for mounting data and starting couchdb can go

mrjones-plip · 2024-10-09T17:07:51Z

Wow - thanks @henokgetachew and @Hareet for all the debugging. I finally finally FINALLY got it working. I have a few last tweaks (need to call get-volume-binding on prod cluster, but call depoy on dev cluster), but I'd love for someone to approve this as is, we can get it merged and in use and go from there

cc @1yuv as maybe the only other non-SRE, not-mrjones teammate who can use this!?

Hareet

Thanks so much for all your effort!!

mrjones-plip added 4 commits August 23, 2024 13:30

Create section in EKS docs on how to clone an instance

b370cdc

add raw steps, will refine next

890a5eb

flesh out remianing steps, still need change password steps

854c230

fix ebs volume to be true

5a77d53

update AZ for volume create call

c989f05

Use new values file per feedback

8fa02d2

call out to use dev cluster

411580d

mrjones-plip mentioned this pull request Sep 4, 2024

Use root for couchdb container medic/helm-charts#23

Merged

andy5995 reviewed Sep 4, 2024

View reviewed changes

content/en/contribute/code/core/deploy-on-eks.md Outdated Show resolved Hide resolved

andy5995 reviewed Sep 5, 2024

View reviewed changes

content/en/contribute/code/core/deploy-on-eks.md Outdated Show resolved Hide resolved

mrjones-plip and others added 2 commits September 4, 2024 20:07

Update content/en/contribute/code/core/deploy-on-eks.md

4ce6a23

Co-authored-by: Andy Alt <[email protected]>

Update content/en/contribute/code/core/deploy-on-eks.md

20fb069

Co-authored-by: Andy Alt <[email protected]>

henokgetachew mentioned this pull request Sep 25, 2024

Failure when mounting pre-existing data to the CHT medic/cht-core#9468

Closed

henokgetachew reviewed Sep 26, 2024

View reviewed changes

content/en/contribute/code/core/deploy-on-eks.md Outdated Show resolved Hide resolved

henokgetachew mentioned this pull request Sep 26, 2024

fix(#9468): use latest helm-charts and troubleshooting for volumes medic/cht-core#9466

Merged

1 task

mrjones-plip and others added 2 commits October 4, 2024 21:22

Update content/en/contribute/code/core/deploy-on-eks.md

696ced3

Co-authored-by: Henok <[email protected]>

renumber steps, update values to change, formatting

dd0eda7

Merge branch 'main' into clone-eks-prod-instance

fbc388b

Hareet self-requested a review October 9, 2024 17:10

Hareet approved these changes Oct 9, 2024

View reviewed changes

finalize steps per feedback

aff2f94

mrjones-plip merged commit e2db3b9 into main Oct 9, 2024
2 checks passed

mrjones-plip deleted the clone-eks-prod-instance branch October 9, 2024 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create section in EKS docs on how to clone an instance #1502

Create section in EKS docs on how to clone an instance #1502

mrjones-plip commented Aug 23, 2024

mrjones-plip commented Aug 28, 2024 •

edited

Loading

mrjones-plip commented Aug 28, 2024 •

edited

Loading

henokgetachew commented Aug 30, 2024

mrjones-plip commented Aug 31, 2024

mrjones-plip commented Sep 3, 2024 •

edited

Loading

henokgetachew commented Sep 4, 2024

henokgetachew commented Sep 4, 2024

mrjones-plip commented Sep 4, 2024 •

edited

Loading

henokgetachew commented Sep 5, 2024 •

edited

Loading

mrjones-plip commented Sep 5, 2024

mrjones-plip commented Sep 5, 2024 •

edited

Loading

henokgetachew commented Sep 24, 2024

henokgetachew commented Sep 24, 2024

mrjones-plip commented Sep 24, 2024

henokgetachew commented Sep 24, 2024

Hareet commented Sep 25, 2024 •

edited

Loading

henokgetachew commented Sep 25, 2024

mrjones-plip commented Sep 26, 2024

mrjones-plip commented Oct 1, 2024

mrjones-plip commented Oct 5, 2024

mrjones-plip commented Oct 7, 2024

henokgetachew commented Oct 8, 2024

Hareet commented Oct 9, 2024

mrjones-plip commented Oct 9, 2024

Hareet left a comment

Create section in EKS docs on how to clone an instance #1502

Create section in EKS docs on how to clone an instance #1502

Conversation

mrjones-plip commented Aug 23, 2024

Description

License

mrjones-plip commented Aug 28, 2024 • edited Loading

mrjones-plip commented Aug 28, 2024 • edited Loading

henokgetachew commented Aug 30, 2024

mrjones-plip commented Aug 31, 2024

mrjones-plip commented Sep 3, 2024 • edited Loading

henokgetachew commented Sep 4, 2024

henokgetachew commented Sep 4, 2024

mrjones-plip commented Sep 4, 2024 • edited Loading

henokgetachew commented Sep 5, 2024 • edited Loading

mrjones-plip commented Sep 5, 2024

mrjones-plip commented Sep 5, 2024 • edited Loading

henokgetachew commented Sep 24, 2024

henokgetachew commented Sep 24, 2024

mrjones-plip commented Sep 24, 2024

henokgetachew commented Sep 24, 2024

Hareet commented Sep 25, 2024 • edited Loading

henokgetachew commented Sep 25, 2024

mrjones-plip commented Sep 26, 2024

mrjones-plip commented Oct 1, 2024

mrjones-plip commented Oct 5, 2024

mrjones-plip commented Oct 7, 2024

henokgetachew commented Oct 8, 2024

Hareet commented Oct 9, 2024

mrjones-plip commented Oct 9, 2024

Hareet left a comment

Choose a reason for hiding this comment

mrjones-plip commented Aug 28, 2024 •

edited

Loading

mrjones-plip commented Aug 28, 2024 •

edited

Loading

mrjones-plip commented Sep 3, 2024 •

edited

Loading

mrjones-plip commented Sep 4, 2024 •

edited

Loading

henokgetachew commented Sep 5, 2024 •

edited

Loading

mrjones-plip commented Sep 5, 2024 •

edited

Loading

Hareet commented Sep 25, 2024 •

edited

Loading