Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delayed response from mainnet nodes #2

Open
jvtezsure opened this issue Oct 29, 2020 · 8 comments
Open

delayed response from mainnet nodes #2

jvtezsure opened this issue Oct 29, 2020 · 8 comments

Comments

@jvtezsure
Copy link

I've deployed AWS cloudforamtion setup following https://assets.tqtezos.com/docs/run-a-node/1-run-a-node-intro/ for mainnet.
The setup is successful but I'm facing issues with it.

  • With tezos-client, it takes upto 2-3 minutes to get a response for '/chains/main/blocks/head' rpc call
  • with JS library (conseiljs), it throws connection timeout. I guess it's because of the delay on the nodes part.
  • nodes are in all synced
  • When I ssh into EC2 instances for mainnet, only 1 tezos-node is running at any given time out of 2 (1 on each EC2)
  • I'm able to fetch balances for the addresses using tezos-client, but it takes more than 1-2 minutes on average.
@lyoungblood
Copy link
Contributor

I'm sorry you're having a bad experience; you should get responses much faster.

  • The way the system is designed, it is normal to have 1 container/node running on each EC2 instance. If you go into the ECS console, click on the cluster, and click on Tasks, then look at each task, you should see that it is "Healthy: True" which indicates that it is fully in sync.
  • What instance type are you using for your nodes?
  • If you ssh into the node, type docker ps, then type docker exec -it <container hash> /bin/sh and run tezos-client inside the container, is it still slow?
  • How are you connecting to the nodes, through the NLB? You should use the DNS name of the NLB itself on port 8732 without TLS, so the endpoint would be something like http://nodes-prd-dub-node-NLB-493fe64fa2696fc6.elb.eu-west-1.amazonaws.com:8732 or tezos-client -A nodes-prd-dub-node-NLB-493fe64fa2696fc6.elb.eu-west-1.amazonaws.com -P 8732 (this is just an example, that endpoint won't be reachable for you)

@jvtezsure
Copy link
Author

Thanks a lot Luke for quick reply

  • my nodes and updater tasks are running successfuly but tasks are 'UNHEALTHY'.
  • How shall I debug this problem and let me know what logs/configs I can share with you.

@lyoungblood
Copy link
Contributor

Most likely the updater just hasn't had a chance to sync completely, so when the nodes launch, they also launch without being completely in sync, and after ~45 minutes of being out of sync/unhealthy, they will be terminated and replaced by new nodes. It can take weeks to sync from scratch. If you don't want to wait weeks for your updater to sync all the way from scratch, the best option is to copy the files from an existing S3 bucket maintained by the Tezos Foundation updaters.

There are instructions for doing this here: https://assets.tqtezos.com/docs/run-a-node/4-tezos-updater/ under the section titled "Initiate the data copy," but you should first shut down your updater by updating the CloudFormation stack and setting desired tasks to 0 so that it doesn't try to overwrite those files while you are copying them.

Once the files are copied, follow the next step "Update ECS tasks" to start your updater again. At that point, your node tasks should get the latest data from the updater and be able to start and get healthy/in sync in just a few minutes.

@jvtezsure
Copy link
Author

I've deployed a node updater in my region after syncing data from s3 bucket of other region as mentioned in documentation.
I've followd the documentation for each deployment.
I think node updater is up and running, though it logs 'too few connections (18/19)' in between.
Let me know If you want to take a look at my setup. I can arrange a call or can provide logs.

@lyoungblood
Copy link
Contributor

Can you take a look at the S3 bucket and get the total size of the node1 or node2 folder to see if they look similar to this? Should be about 79.2GB of size.
Screen Shot 2020-10-31 at 9 31 44 AM
Screen Shot 2020-10-31 at 9 31 54 AM

Also, if you can send logs for the updater, not the entire thing, but maybe the most recent loop (it runs in a loop where it shuts down every 30 minutes and copies the latest data to the S3 bucket). Having only 18-19 connections is probably fine, as long as it is syncing properly.

@jvtezsure
Copy link
Author

Total size for node1/node2 is 77.6GB for mainnet-updater bucket.
these are the latest logs for updater.
mainnet-update.log

@lyoungblood
Copy link
Contributor

Hi, so sorry about the problems you are having. I figured out what is happening. A few months ago, we switched from --history-mode=normal to --history-mode=archive for the updaters that feed the S3 bucket you copied, but the tezos-updater and node-docker repositories didn't get updated with this change.

So, what is happening is that your updater and nodes are never able to properly sync data from other nodes. I just merged 2 PRs that will fix this, if you please merge them into your repositories, this should fix things and your updater will be able to sync fully to chain tip. It may take several hours for the updater to fully sync, since the data you have is behind several days, however, if you wanted to copy the S3 data again this will take less time.

The most important changes are here: https://github.com/tqtezos/tezos-updater/blob/testnet/start-updater.sh#L13 and here: https://github.com/tqtezos/node-docker/blob/testnet/start-tezos.sh#L12

If you just add that line --history-mode=archive \ it should actually fix this issue, but you may want to just merge all the changes in.

@jvtezsure
Copy link
Author

Thanks a lot luke for quick help.
I have updated all my forked repos with the latest code and also syncing my s3 bucket. I'll keep you updated on the outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants