Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get move-shards or remove-node to work on Couch 3 #21

Open
Hareet opened this issue Aug 7, 2024 · 4 comments
Open

Can't get move-shards or remove-node to work on Couch 3 #21

Hareet opened this issue Aug 7, 2024 · 4 comments

Comments

@Hareet
Copy link
Member

Hareet commented Aug 7, 2024

I'm trying to cluster an echis (cht-core 4.9, couchdb3) project from single node to multi node.

Can you clarify how getUrl that calls preparedCouchUrl is populating and returning var couchClusterUrl in utils?

I'm suspecting that /_node/_local/ is not getting sent to latter functions like getDbs and updateDbMetadata

Specifically, in couchdb 3.x , getDbs needs to use getUrl without the additional /_node/_local, such that _all_dbs is at root endpoint (http://medic:pw@couchdb-1-host:5984/_all_dbs). Then when updateDbMetadata calls getUrl, it needs to add /_node/_local/ before it iterates on each db that it retrieved, otherwise I believe we are running into our current error:

Error while getting database metadata
An unexpected error occurred HTTPResponseError: HTTP Error Response: 404 Object Not Found
    at request (/app/src/utils.js:110:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Object.getDbMetadata (/app/src/utils.js:142:12)
    at async moveShard (/app/src/move-shard.js:49:22)
    at async /app/bin/move-shards.js:10:7 {
  response: { error: 'not_found', reason: 'Database does not exist.' },
  status: 404
}

When I ran remove-node, it gave further evidence that the additional /_node/_local was missing:

Error while getting node info HTTPResponseError: HTTP Error Response: 404 Object Not Found
    at request (/app/src/utils.js:110:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Object.getNodeInfo (/app/src/utils.js:173:12)
    at async removeNode (/app/src/remove-node.js:11:20)
    at async /app/bin/remove-node.js:8:5 {
  response: { error: 'not_found', reason: 'Database does not exist.' },
  status: 404
}
An unexpected error occurred Error: Error while getting node info
    at Object.getNodeInfo (/app/src/utils.js:176:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async removeNode (/app/src/remove-node.js:11:20)
    at async /app/bin/remove-node.js:8:5

getNodeInfo runs into the same issue.

I'm able to remove the node by finding the rev-id and hitting _nodes directly.

When running move-shards and getting the first error that was pasted above as output, I'm able to curl that db:

root@af:/# curl http://medic:[email protected]:5984/_dbs/medic-users-meta
{"error":"not_found","reason":"Database does not exist."}

root@af:/# curl http://medic:[email protected]:5984/_node/_local/_dbs/medic-users-meta
{"_id":"medic-users-meta","_rev":"13-f6a89db990","shard_suffix":[46,49,54,56,50,57,53,54,55,57,56],"changelog"
....

Am I missing something that forces couch-migration to prepare couch-3 urls inside the compose file I'm using for this repo?

migration.yml

version: '3.9'

services:
  couch-migration:
    image: public.ecr.aws/medic/couchdb-migration:1.0.3
    networks:
      - cht-net
    environment:
      - "COUCH_URL=http://medic:[email protected]:5984"
      - "COUCH_CLUSTER_PORT=5984"

networks:
  cht-net:
    name: ${CHT_NETWORK:-cht-net}
    external: true
@Hareet
Copy link
Member Author

Hareet commented Aug 7, 2024

here's /_membership from couchdb-2:

root@af:/# curl http://medic:[email protected]:5984/_membership
{"all_nodes":["[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]"]}

@Hareet
Copy link
Member Author

Hareet commented Aug 7, 2024

Now, I'm confused as to how move-shards worked when renaming [email protected] to single-node [email protected]?

investigator@echisiclone:~/cht$ echo "$shard_matrix"
{
  "00000000-15555554": "[email protected]",
  "15555555-2aaaaaa9": "[email protected]",
  "2aaaaaaa-3ffffffe": "[email protected]",
  "3fffffff-55555553": "[email protected]",
  "55555554-6aaaaaa8": "[email protected]",
  "6aaaaaa9-7ffffffd": "[email protected]",
  "7ffffffe-95555552": "[email protected]",
  "95555553-aaaaaaa7": "[email protected]",
  "aaaaaaa8-bffffffc": "[email protected]",
  "bffffffd-d5555551": "[email protected]",
  "d5555552-eaaaaaa6": "[email protected]",
  "eaaaaaa7-ffffffff": "[email protected]"
}
For that shard_matrix, this worked:

sudo shard_matrix="$shard_matrix" docker compose -f migration.yml run couch-migration move-shards "$shard_matrix"

But now, when running with a shard-map:
investigator@echisiclone:~/cht$ echo "$shard_matrix"
{
  "00000000-15555554": "[email protected]",
  "15555555-2aaaaaa9": "[email protected]",
  "2aaaaaaa-3ffffffe": "[email protected]",
  "3fffffff-55555553": "[email protected]",
  "55555554-6aaaaaa8": "[email protected]",
  "6aaaaaa9-7ffffffd": "[email protected]",
  "7ffffffe-95555552": "[email protected]",
  "95555553-aaaaaaa7": "[email protected]",
  "aaaaaaa8-bffffffc": "[email protected]",
  "bffffffd-d5555551": "[email protected]",
  "d5555552-eaaaaaa6": "[email protected]",
  "eaaaaaa7-ffffffff": "[email protected]"
}
 We get the errors that I discussed in the ticket.
 
Nothing I've changed in migration.yml (docker compose template that contains couchdb-migration declaration).
So I'm at a loss as to why move-shards is now not using couch3 urls?

@dianabarsan
Copy link
Member

We should have support for Couch3 out of the box, added in v 1.0.2. We have e2e tests which move shards in Couch3.

Can you check the logs in the migration container to see how the error looks like?

@dianabarsan
Copy link
Member

It looks like the CouchDb 3 detection script gets confused when the COUCH_CLUSTER_PORT is set as 5984. The way it works is it goes by process of elimination, but when the clustering port is set to 5984, one of the Couch2 urls that it tests responds. So the migration script thinks it's couch2 it is migrating.

Either not setting the COUCH_CLUSTER_PORT or setting it to COUCH_CLUSTER_PORT should fix it.
I'll either document this or change the detection to be bespoke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants