delayed response from mainnet nodes #2

jvtezsure · 2020-10-29T12:46:54Z

I've deployed AWS cloudforamtion setup following https://assets.tqtezos.com/docs/run-a-node/1-run-a-node-intro/ for mainnet.
The setup is successful but I'm facing issues with it.

With tezos-client, it takes upto 2-3 minutes to get a response for '/chains/main/blocks/head' rpc call
with JS library (conseiljs), it throws connection timeout. I guess it's because of the delay on the nodes part.
nodes are in all synced
When I ssh into EC2 instances for mainnet, only 1 tezos-node is running at any given time out of 2 (1 on each EC2)
I'm able to fetch balances for the addresses using tezos-client, but it takes more than 1-2 minutes on average.

lyoungblood · 2020-10-29T14:56:26Z

I'm sorry you're having a bad experience; you should get responses much faster.

The way the system is designed, it is normal to have 1 container/node running on each EC2 instance. If you go into the ECS console, click on the cluster, and click on Tasks, then look at each task, you should see that it is "Healthy: True" which indicates that it is fully in sync.
What instance type are you using for your nodes?
If you ssh into the node, type docker ps, then type docker exec -it <container hash> /bin/sh and run tezos-client inside the container, is it still slow?
How are you connecting to the nodes, through the NLB? You should use the DNS name of the NLB itself on port 8732 without TLS, so the endpoint would be something like http://nodes-prd-dub-node-NLB-493fe64fa2696fc6.elb.eu-west-1.amazonaws.com:8732 or tezos-client -A nodes-prd-dub-node-NLB-493fe64fa2696fc6.elb.eu-west-1.amazonaws.com -P 8732 (this is just an example, that endpoint won't be reachable for you)

jvtezsure · 2020-10-29T17:06:32Z

Thanks a lot Luke for quick reply

my nodes and updater tasks are running successfuly but tasks are 'UNHEALTHY'.
How shall I debug this problem and let me know what logs/configs I can share with you.

lyoungblood · 2020-10-29T21:11:56Z

Most likely the updater just hasn't had a chance to sync completely, so when the nodes launch, they also launch without being completely in sync, and after ~45 minutes of being out of sync/unhealthy, they will be terminated and replaced by new nodes. It can take weeks to sync from scratch. If you don't want to wait weeks for your updater to sync all the way from scratch, the best option is to copy the files from an existing S3 bucket maintained by the Tezos Foundation updaters.

There are instructions for doing this here: https://assets.tqtezos.com/docs/run-a-node/4-tezos-updater/ under the section titled "Initiate the data copy," but you should first shut down your updater by updating the CloudFormation stack and setting desired tasks to 0 so that it doesn't try to overwrite those files while you are copying them.

Once the files are copied, follow the next step "Update ECS tasks" to start your updater again. At that point, your node tasks should get the latest data from the updater and be able to start and get healthy/in sync in just a few minutes.

jvtezsure · 2020-10-30T04:58:58Z

I've deployed a node updater in my region after syncing data from s3 bucket of other region as mentioned in documentation.
I've followd the documentation for each deployment.
I think node updater is up and running, though it logs 'too few connections (18/19)' in between.
Let me know If you want to take a look at my setup. I can arrange a call or can provide logs.

lyoungblood · 2020-10-31T16:33:41Z

Can you take a look at the S3 bucket and get the total size of the node1 or node2 folder to see if they look similar to this? Should be about 79.2GB of size.

Also, if you can send logs for the updater, not the entire thing, but maybe the most recent loop (it runs in a loop where it shuts down every 30 minutes and copies the latest data to the S3 bucket). Having only 18-19 connections is probably fine, as long as it is syncing properly.

jvtezsure · 2020-11-02T08:45:35Z

Total size for node1/node2 is 77.6GB for mainnet-updater bucket.
these are the latest logs for updater.
mainnet-update.log

lyoungblood · 2020-11-02T13:32:57Z

Hi, so sorry about the problems you are having. I figured out what is happening. A few months ago, we switched from --history-mode=normal to --history-mode=archive for the updaters that feed the S3 bucket you copied, but the tezos-updater and node-docker repositories didn't get updated with this change.

So, what is happening is that your updater and nodes are never able to properly sync data from other nodes. I just merged 2 PRs that will fix this, if you please merge them into your repositories, this should fix things and your updater will be able to sync fully to chain tip. It may take several hours for the updater to fully sync, since the data you have is behind several days, however, if you wanted to copy the S3 data again this will take less time.

The most important changes are here: https://github.com/tqtezos/tezos-updater/blob/testnet/start-updater.sh#L13 and here: https://github.com/tqtezos/node-docker/blob/testnet/start-tezos.sh#L12

If you just add that line --history-mode=archive \ it should actually fix this issue, but you may want to just merge all the changes in.

jvtezsure · 2020-11-02T15:02:46Z

Thanks a lot luke for quick help.
I have updated all my forked repos with the latest code and also syncing my s3 bucket. I'll keep you updated on the outcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delayed response from mainnet nodes #2

delayed response from mainnet nodes #2

jvtezsure commented Oct 29, 2020

lyoungblood commented Oct 29, 2020

jvtezsure commented Oct 29, 2020

lyoungblood commented Oct 29, 2020

jvtezsure commented Oct 30, 2020

lyoungblood commented Oct 31, 2020

jvtezsure commented Nov 2, 2020

lyoungblood commented Nov 2, 2020

jvtezsure commented Nov 2, 2020

delayed response from mainnet nodes #2

delayed response from mainnet nodes #2

Comments

jvtezsure commented Oct 29, 2020

lyoungblood commented Oct 29, 2020

jvtezsure commented Oct 29, 2020

lyoungblood commented Oct 29, 2020

jvtezsure commented Oct 30, 2020

lyoungblood commented Oct 31, 2020

jvtezsure commented Nov 2, 2020

lyoungblood commented Nov 2, 2020

jvtezsure commented Nov 2, 2020