-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create section in EKS docs on how to clone an instance #1502
Conversation
These steps currently don't work. Instead of a copy of the snapshotted -> new volume data showing up in the new CHT instance, there is a clean install of the CHT instead.
So I'll delete the volume (and snapshot if it's a dev instance), update the steps in this PR and try again! |
@henokgetachew - can you take another look at what I might be doing wrong? I deleted the volume I created before and then created a new one, being sure to specify the AZ:
Here's the description from {
"Volumes": [
{
"Attachments": [],
"AvailabilityZone": "eu-west-2b",
"CreateTime": "2024-08-28T19:42:35.650000+00:00",
"Encrypted": false,
"Size": 900,
"SnapshotId": "snap-0d0840a657afe84e7",
"State": "available",
"VolumeId": "vol-0fee7609aa7757984",
"Iops": 2700,
"Tags": [
{
"Key": "owner",
"Value": "mrjones"
},
{
"Key": "kubernetes.io/cluster/dev-cht-eks",
"Value": "owned"
},
{
"Key": "KubernetesCluster",
"Value": "dev-cht-eks"
},
{
"Key": "use",
"Value": "allies-hosting-tco-testing"
},
{
"Key": "snapshot-from",
"Value": "moh-zanzibar-Aug-26-2024"
}
],
"VolumeType": "gp2",
"MultiAttachEnabled": false
}
]
} I set the volume ID in my values file: # tail -n4 mrjones.yml
remote:
existingEBS: "true"
existingEBSVolumeID: "vol-0fee7609aa7757984"
existingEBSVolumeSize: "900Gi" And then run deploy:
However I get a
Here's my values file - password and secret changed to protect the inocent: project_name: mrjones-dev
namespace: "mrjones-dev"
chtversion: 4.5.2
#cht_image_tag: 4.1.1-4.1.1 #- This is filled in automatically by the deploy script. Don't uncomment this line.
couchdb:
password: hunter2
secret: Correct-Horse-Battery-Staple
user: medic
uuid: 1c9b420e-1847-49e9-9cdf-5350b32f6c85
clusteredCouch_enabled: false
couchdb_node_storage_size: 20Gi
clusteredCouch:
noOfCouchDBNodes: 1
toleration: # This is for the couchdb pods. Don't change this unless you know what you're doing.
key: "dev-couchdb-only"
operator: "Equal"
value: "true"
effect: "NoSchedule"
ingress:
annotations:
groupname: "dev-cht-alb"
tags: "Environment=dev,Team=QA"
certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
host: "mrjones.dev.medicmobile.org"
hosted_zone_id: "Z3304WUAJTCM7P"
load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote" # "local" or "remote"
remote:
existingEBS: "true"
existingEBSVolumeID: "vol-0fee7609aa7757984"
existingEBSVolumeSize: "900Gi" |
@mrjones-plip Okay I have finally figured out why this didn't work for you. values.yaml. You basically missed the main flag that tells helm to look for pre-existing volumes within the next sections. It should be configured like this: I have tested this one and it has worked for me.
|
Thanks @henokgetachew ! However, this is still not working :( I've updated this PR with the exact steps I did. I'm wondering if all the IDs in my cloned instance need to match the production instance maybe? Anyway, here's my values file with
And the deploy goes well:
And all the resources show as started:
But I get a Couch seems in a bad way, which is likely the main problem:
With couch down, it's not worth checking, but API and sentinel are unhappy - they both have near identical
HA Proxy is unsurprisingly
|
@dianabarsan and I did deep dive into this today and my test instance now starts up instead of At this point we suspect it might be a permissions error maybe? Per below, the volume mounts but we can't see any of the data, so that's our guess. We found out that:
|
I have some downtime today. I will try to have a look if it's a quick thing. |
Pushed a PR here. Let me know if that solves it. |
Thanks so much for coming off your holiday to do some work! Per my slack comment, I don't know how to test this branch in the cht-conf script |
Co-authored-by: Andy Alt <[email protected]>
Co-authored-by: Andy Alt <[email protected]>
It doesn't release beta builds for now. If the code looks good for you then only way to test right now is to approve and merge the PR which should release a patch version of the helm charts which cht-deploy will pick when deploying |
Despite it only being 7 lines of change, I'm not really in a position to know if these changes look good. I don't know I would very much like to be able to test this or defer to someone else who knows what these changes actually do. I'll pursue the idea of running the changes manually via |
I tried this just now and got the same result:
|
Update: I have reproduced the issue and debugging right now. |
Thanks @henokgetachew! It sounds like you reproduced the issue, but to be clear - the issue wasn't that the password was wrong after starting CHT, the issue was that the data wasn't even showing up on disk. That is, we'd mount a 900GB volume to |
Correct. That's what I reproduced. |
Here's your subPath issue Unfortunately, you picked to clone project that had medic-os pre-existing data that migrated from 3.x to 4.x on an edge scenario. We are stuck in helm-chart madness, and haven't gotten around to adding all stipulating scenarios. In a cht-core 3.x upgrade to 4.x, we didn't use the helm chart every time due to time constraints and modified deployment templates directly. The main thing that was needed to be modified was subPath. Essentially, on your clone deployment trials, couchDB was searching for data in a new directory and therefore starting a fresh install. In medic-os we kept couchdb data in Sorry this was a headache for you @mrjones-plip ! To review, not a permissions issue, but some new scenarios to add to our helm charts. Investigating production
Here is subPath from helm-charts, you can see its a diff directory than medic-os installs |
Yup that's correct. I just pushed two PRs earlier fixing the issues (Helm-charts, cht-deploy) The change that needs to be made is basically what the mount uses as path. That value for this project needs to be:
The syntax is So in your new values.yaml after the PR gets merged:
Also make sure you use the new values.yaml from the new PR. The key that has changed that's relevant to you is below (i.e. We're now supporting pre-existing data for clustered couchdb too.)
|
Thanks for the updates @Hareet and @henokgetachew ! Having just restored production copies from 3 production snapshots, I can attest that the paths to
That said - there's a lot of info above which makes it complicated to test here - I'm not exactly sure what my next steps are. I'd love to just update the happy path steps and follow them to ensure it works. Please feel free to update this PR's docs directly! I've done a bit of testing over on the troubleshooting script PR in hopes of moving everything along. |
Thanks so much for adding directly to this PR's docs content @henokgetachew ! I plan on getting to this early next week. |
Thanks for the commits @henokgetachew ! I'm getting a new error follow the exact steps here:
I note that the new yaml file I started with has fields I wasn't expecting - like Here's project_name: "mrjones-dev"
namespace: "mrjones-dev" # e.g. "cht-dev-namespace"
chtversion: 4.5.2
upstream_servers:
docker_registry: "public.ecr.aws/medic"
builds_url: "https://staging.dev.medicmobile.org/_couch/builds_4"
upgrade_service:
tag: 0.32
couchdb:
password: "hunter2" # Avoid using non-url-safe characters in password
secret: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any value, e.g. a UUID.
user: "medic"
uuid: "45e46ee4-540e-4c21-814f-8d0e6dd88f2d" # Any UUID
clusteredCouch_enabled: false
couchdb_node_storage_size: 900Mi
clusteredCouch:
noOfCouchDBNodes: 3
toleration: # This is for the couchdb pods. Don't change this unless you know what you're doing.
key: "dev-couchdb-only"
operator: "Equal"
value: "true"
effect: "NoSchedule"
ingress:
annotations:
groupname: "dev-cht-alb"
tags: "Environment=dev,Team=QA"
certificate: "arn:aws:iam::720541322708:server-certificate/2024-wildcard-dev-medicmobile-org-chain"
host: "mrjones.dev.medicmobile.org"
hosted_zone_id: "Z3304WUAJTCM7P"
load_balancer: "dualstack.k8s-devchtalb-3eb0781cbb-694321496.eu-west-2.elb.amazonaws.com"
environment: "remote" # "local", "remote"
cluster_type: "eks" # "eks" or "k3s-k3d"
cert_source: "eks-medic" # "eks-medic" or "specify-file-path" or "my-ip-co"
certificate_crt_file_path: "/path/to/certificate.crt" # Only required if cert_source is "specify-file-path"
certificate_key_file_path: "/path/to/certificate.key" # Only required if cert_source is "specify-file-path"
nodes:
node-1: "" # This is the name of the first node where couchdb will be deployed
node-2: "" # This is the name of the second node where couchdb will be deployed
node-3: "" # This is the name of the third node where couchdb will be deployed
k3s_use_vSphere_storage_class: "false" # "true" or "false"
vSphere:
datastoreName: "DatastoreName" # Replace with your datastore name
diskPath: "path/to/disk" # Replace with your disk path
couchdb_data:
preExistingDataAvailable: "true" #If this is false, you don't have to fill in details in local_storage or remote.
dataPathOnDiskForCouchDB: "storage/medic-core/couchdb/data" # This is the path where couchdb data will be stored. Leave it as data if you don't have pre-existing data.
partition: "0" # This is the partition number for the EBS volume. Leave it as 0 if you don't have a partitioned disk.
local_storage: #If using k3s-k3d cluster type and you already have existing data.
preExistingDiskPath-1: "/var/lib/couchdb1" #If node1 has pre-existing data.
preExistingDiskPath-2: "/var/lib/couchdb2" #If node2 has pre-existing data.
preExistingDiskPath-3: "/var/lib/couchdb3" #If node3 has pre-existing data.
ebs:
preExistingEBSVolumeID-1: "vol-05b22d15773376c76" # If you have already created the EBS volume, put the ID here.
preExistingEBSVolumeID-2: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
preExistingEBSVolumeID-3: "vol-0123456789abcdefg" # If you have already created the EBS volume, put the ID here.
preExistingEBSVolumeSize: "900Gi" # The size of the EBS volume. |
@Hareet - we have a call scheduled this week to go over this PR. Confirming that the above error happens when I follow the latest steps, including the latest commits. Hope to resolve all this on our call! |
@mrjones-plip the patch error is because you'd need to first delete the pv. It should work if you delete the pv and re-run the command. |
Okay, I've figured it out. You got your deployment stuck in a weird state (terminating persistentVolume) because it seems at some point you pushed up a default values file. Here is the persistentVolume that your current helm-chart was unable to over-write and create:
Essentially, the finalizer is listed as the aws ebs container storage interface, and its been stuck waiting for AWS to terminate a volume You can see this listed in the error output that you pasted, with two volumeHandle values shown. As Henok noted, we got to delete the pv, but since its in a terminating state, we have to patch it to stop using a finalizer.
|
Wow - thanks @henokgetachew and @Hareet for all the debugging. I finally finally FINALLY got it working. I have a few last tweaks (need to call cc @1yuv as maybe the only other non-SRE, not-mrjones teammate who can use this!? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for all your effort!!
Description
Create section in EKS docs on how to clone an instance
License
The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.