Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Stability #194

Open
faddat opened this issue Apr 17, 2020 · 12 comments
Open

Node Stability #194

faddat opened this issue Apr 17, 2020 · 12 comments

Comments

@faddat
Copy link
Contributor

faddat commented Apr 17, 2020

Platform:

  • Vultr.com
  • Ubuntu 18.04
  • 8GB RAM
  • 4 Cores
  • 160GB Storage

Configuration:

I followed these directions exactly:
https://github.com/virtualeconomy/v-systems/wiki/How-to-Install-V-Systems-Mainnet-Node

Problem:

The node stopped syncing at 6126862, and the systemd log showed that it was handshaking with just one peer, over and over.

Resolution:

I ran:
systemctl restart vsys

and the node began to sync again. I also created a teeny tiny sync monitor tool:

while true
do
curl -X GET "http://127.0.0.1:9922/blocks/height" -H "accept: application/json"
sleep 1
done

Users may want to restrict their API to localhost for security reasons, and this allows them to easily monitor sync progress, albeit in a very basic way.

@faddat faddat changed the title Full stops syncing Full node stops syncing Apr 17, 2020
@faddat faddat added the bug Something isn't working label May 1, 2020
@faddat
Copy link
Contributor Author

faddat commented May 2, 2020

I haven't been able to reproduce this, so I am closing it because I assume it was a one-off issue.

@faddat faddat closed this as completed May 2, 2020
@faddat faddat reopened this May 29, 2020
@faddat
Copy link
Contributor Author

faddat commented May 29, 2020

Reopened due to user report

@ghost
Copy link

ghost commented May 29, 2020

running on a vps with ryzen 7 8 core processor, 128gb memory, 2gb ssd raid storage.... It didn't seem to start syncing. It was restarted a couple of times but then after a week I tried restarting it and now it is syncing http://95.217.121.243:9922/blocks/height

I can only speculate there is something going into a deadlock when the chain has no blocks yet.

Here is the log:

vsys.log.gz

ok, second issue, the log leaks the private key, about 8th line from the top.

@Icermli
Copy link
Collaborator

Icermli commented May 29, 2020

@stalker-loki I didn't find any clues that may cause syncing to stop from the log. But I suspect it could be a problem of poor network sometimes, especially when your machine is behind a giant firewall for example. We will keep checking any other reasons that may result in this.

I recommend to optimize network and add more peers.

for issue 2: If this is a supernode, I suggest u use a cold wallet to receive reward. And the wallet address in your log file is used only for minting. Don't put any balance in it. If u generate a wallet first and then start the node, the private key won't show up in the log.

BTW, cold wallet minting is one of V Systems Chain's advantages. One just fill in the reward address with a cold wallet address in the config file.Then rewards will goes into the cold wallet rather than the minting address. This secure safety of ur property.

@ghost
Copy link

ghost commented May 30, 2020

that's this one down the bottom of this section: ?

  miner {
    enable = yes
    offline = no
    quorum = 1
    generation-delay = 1s
    interval-after-last-block-then-generation-is-allowed = 120h
    tf-like-scheduling = no
    reward-address = "ARNzXkeSq81HbzxKLQ9hsAZUpEtvq6sgwj1"
  }

@Icermli
Copy link
Collaborator

Icermli commented Jun 1, 2020

Yes, reward-address here is not necessarily the minting address. It could be any other address for example a cold wallet address.

@Icermli Icermli closed this as completed Jun 10, 2020
@faddat faddat reopened this Jun 28, 2020
@faddat
Copy link
Contributor Author

faddat commented Jun 28, 2020

I run several VSYS full nodes, some on mainnet, some on testnet. One of them, @stalker-loki runs on my behalf.

They stay up, but stop syncing. Sometimes, they reach a full sync state, and run for a while with the chain's current height. Other times, they just stop. All of my VSYS full nodes are in top-tier datacenters, specifically the hetzner.de datacenters in Germany and Finland.

There are other blockchain nodes on those machines.

The other chains run very happily and without interruption.

Unfortunately, VSYS does not run happily and without interruption.

Ethereum the blockchain weighs in at 236GB. On my Hetzner node, I'm able to sync it in about 12 hours.

VSYS weighs in at ~10GB, and sync takes 24 hours. Additionally, in my experience VSYS full nodes aren't very stable.

This is the spec of my server.
image

It is in a professionally run datacenter and it's highly unlikely that there are network issues. I run additional nodes on Hetzner machines and VSYS is the only one that frequently either stops syncing during initial standup, or stops advancing block height after it has already synced.

I've observed this issue with VSYS losing sync on machines at my home, as well, where I also run nodes for other blockchains, which do not lose sync.

Today, I was attempting to record a video on one of VSYS' unique concepts, the minting average balance.

Unfortunately, across several nodes, I was unable to discern if the node had simply gone down, or if my once every one to two seconds API request had crashed it:

 for (( ; ; )); do sleep 1; curl -X GET "http://localhost:9922/addresses/balance/details/ARB1zND1qDuNHyVpX5pCVAZSYghGNZSfvAC" -H "accept: application/json"; done

I was just using that to show the increase in MAB, I think it makes a great visual and conversation starter since liquid staking is such a hot topic right now. Interestingly, on VSYS we already have liquid staking, no derivatives needed.

Anyhow, I was unable to complete my video, because the node had crashed:

{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}{
  "address" : "ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7",
  "regular" : 1300000000,
  "mintingAverage" : 1299974400,
  "available" : 1300000000,
  "effective" : 1300000000,
  "height" : 12146933
}

As you can see, the block height stopped advancing on June 15th at block 12146933. Looking at that node's addresses transaction logs, I did not see anything happen around the 15th. The last transactions that happened through that node are here:

https://explorer.v.systems/address/ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7

That's the node that I used for the db put tutorial. But my db put transactions were on June 9th, so I'm forced to conclude that this is a case of garden-variety instability.

Next, we can look at another node that I run, this one on testnet 0.3. I wanted to try my MAB queries against it. First from my web browser in the swagger UI I confirmed that it had stopped. Then I logged into my machine at Hetzner and I:

root@buildbox ~ # systemctl status vsys
● vsys.service - VSYS full node
   Loaded: loaded (/lib/systemd/system/vsys.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-06-11 13:32:23 CEST; 2 weeks 2 days ago
 Main PID: 30025 (java)
    Tasks: 141 (limit: 4915)
   CGroup: /system.slice/vsys.service
           └─30025 java -server -Xms128m -Xmx2g -XX:+UseG1GC -XX:+UseNUMA -XX:+AlwaysPreTouch -XX:+PerfDisableSharedMem -XX:+ParallelRefProcE

Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerConte
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:1
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
Jun 28 13:02:19 buildbox vsys[30025]:         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
Jun 28 13:02:19 buildbox vsys[30025]:         at java.lang.Thread.run(Thread.java:748)
root@buildbox ~ # systemctl restart vsys
root@buildbox ~ # docker run --rm -itd --name vsys -p 8822:9922 -v `pwd`/vsys-chain-data:/opt/coin/data mixhq/vsystems
7d1f65092ac29e55e5ad1d42166835abc67a3e6a43550f962ab37262371b4ba2
root@buildbox ~ # docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
7d1f65092ac2        mixhq/vsystems        "java -jar v-systems…"   4 seconds ago       Up 2 seconds        9921/tcp, 0.0.0.0:8822->9922/tcp                                         vsys
bec164193a0b        condenser_condenser   "docker-entrypoint.s…"   7 days ago          Up 7 days           0.0.0.0:8080->8080/tcp, 0.0.0.0:35729->35729/tcp, 0.0.0.0:80->8080/tcp   condenser
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 315323
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 316838
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 317646
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 318454
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 319262
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 320070
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 334497
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 335422
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 336230
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 336832
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 337341
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 337846
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 338310
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 338755
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 339058
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 339370
}root@buildbox ~ # for (( ; ; )); do sleep 2; curl -X GET "http://95.217.196.54^C922/addresses/balance/details/ARKYdc1pgGSefeCjNNkzWwnoKBVVMDYzex7" -H "accept: application/json"; done
root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 361148
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 361884
}root@buildbox ~ # curl -X GET "http://95.217.196.54:9922/blocks/height" -H "accept: application/json"
{
  "height" : 362389

So, after restarting the service, the node began to sync again.

By this point, I'd gotten pretty curious about stability issues. I'd noticed stability problems on three machines.

So, I put up another mainnet node on a Hetzner server, in docker, on port 8822.

While I was writing this issue, it crashed in exactly the manner that this issue describes:

root@buildbox ~ #curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228
}root@buildbox ~ # curl -X GET "http://95.217.196.54:8822/blocks/height" -H "accept: application/json"
{
  "height" : 558228

API stays up, block height stops advancing.

This server at hetzner runs a full node for whaleshares, an application specific blockchain focused on social media.

Screen Shot 2020-06-28 at 7 20 04 PM

My whaleshares node never ships a beat, staying in sync with the chain's three second block time.

Here's my ethereum node, again, same machine:

Screen Shot 2020-06-28 at 7 22 17 PM

This has been quite a long issue, but I decided that it was necessary to provide exhaustive evidence that at present there are at least two stability problems with VSYS nodes:

  1. Crashes during initial sync, which are usually resolved by restarting the node
  2. Crashes after initial sync, which are also usually resolved by restarting the node

I chose to compare with 2 of the other chains that I run nodes on because unfortunately, my VSYS node is the only one that exhibits this particular issue, and this is not restricted to just machines running in Hetzner datacenters, but instead effects VSYS nodes that I have attempted to run at various times on my personal mac laptop, a home server, and a node on vultr.com.

Log files are available on Slack.

@faddat faddat changed the title Full node stops syncing Node Stability Jun 28, 2020
@faddat
Copy link
Contributor Author

faddat commented Jun 28, 2020

I thought that this may be helpful in troubleshooting. The node mentioned above which is stuck at block: 12146933

is showing this:

curl -X GET "http://95.217.121.243:9922/peers/all" -H "accept: application/json"

{
  "peers": [
    {
      "address": "/3.121.94.10:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.52.40.227:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.55.174.115:9921",
      "lastSeen": 9223372036854776000
    },
    {
      "address": "/13.113.98.91:9921",
      "lastSeen": 9223372036854776000
    }
  ]
}

Healthy VSYS mainnet nodes typically have 34 or 35 peers.

root@buildbox ~ # curl -X GET "https://wallet.v.systems/api/peers/all" -H "accept: application/json"
{
  "peers" : [ {
    "address" : "/13.52.96.166:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/35.177.188.74:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/138.197.196.78:9921",
    "lastSeen" : 1593266885446
  }, {
    "address" : "/54.95.22.119:9921",
    "lastSeen" : 1593349688901
  }, {
    "address" : "/3.121.94.10:9921",
    "lastSeen" : 1593349688178
  }, {
    "address" : "/13.115.105.184:9921",
    "lastSeen" : 1593349688909
  }, {
    "address" : "/3.104.62.227:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/34.196.27.234:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.17.31.9:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/165.227.64.201:9921",
    "lastSeen" : 1593266908452
  }, {
    "address" : "/52.60.124.131:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/18.191.26.101:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/35.180.246.64:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/45.76.155.8:9921",
    "lastSeen" : 1593349688889
  }, {
    "address" : "/52.35.120.221:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/54.92.10.151:9921",
    "lastSeen" : 1593349688190
  }, {
    "address" : "/13.52.40.227:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.16.244.131:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/54.69.23.204:9921",
    "lastSeen" : 9223372036854775807
  }, {
    "address" : "/3.17.187.179:9921",
    "lastSeen" : 1593349688147
  } ]
}

@ncying
Copy link
Member

ncying commented Jun 28, 2020

This issue may cause by the connection of known peers, and users may solve this by following strategies,

  1. add more in/out bound connections in conf (120 or larger)
network {
    # How many network inbound network connections can be made
    max-inbound-connections = 120

    # Number of outbound network connections
    max-outbound-connections = 120
}
  1. add more outbound buffer size in conf (64M or larger)
network {
    # Network buffer size
    outbound-buffer-size = 64M
}
  1. use jar directly (may not use the .deb service one, the stable issue may cause by the service logic), in this case, I used
java -jar v-systems-v***.jar, vsys.conf

for more than 10 machines(last week), all of them synced well until now. So, I guess the issue may in the .deb service(may need some extra network-related right/resouce).

  1. About the sync speed, sadly, it also took me 12 hours to sync the whole database. Here are the reasons:
    a. block mint speed, compare height, eth ~ 10M blocks, but v systems really have 12M blocks, the sync speed is related to the number of blocks. Although in each block eth may record more data than v systems. Btw, such comparison is still meanless, since if you sync the bitcoin network for the core wallet, you need days/weeks for that. It is also related with peers you connected and the network connections in design.
    b. In order to let the cheaper machine to sync the whole database, we only required less CPUs and lower memory and reduced some performance of the node.
    c. solutions for this: we may give some copies for the database, node users may download the copies first and start the node from some height. but we still suggest the node user can sync the node from height 0.

In conclusion, most of the stable issues (service, not only v node) are caused by the less system resource allocated, if in the same machine, users run other services with higher resource allocated, it may cause the resource allocation issues to some lower resource needed services. In order to avoid this, one may force allocate more resources to the service. For example, if you run the v node in java directly, you can give more memory in

java -Xmx4096m -jar ***.jar **.conf

allocated more threads in

java -Dscala.concurrent.context.maxExtraThreads=1024 -jar ***.jar **.conf

At last, I will not tag this issue to bug.

@ncying ncying removed the bug Something isn't working label Jun 28, 2020
@faddat
Copy link
Contributor Author

faddat commented Jun 29, 2020

If there is an issue in the .deb file that we ship to users that causes instability and downtime, it's a problem.

I mean, we can triage / label this how we would like to, but we are shipping the .deb to users so it is very important that it work properly or not be shipped.

@faddat
Copy link
Contributor Author

faddat commented Sep 11, 2020

This issue is #230 I imagine.

Unfortunately, #230 has not resolved it yet.

When syncing, nodes still connect to an ancient node and then stop advancing their block height.

@faddat faddat mentioned this issue Jan 9, 2021
9 tasks
@faddat
Copy link
Contributor Author

faddat commented Feb 20, 2021

my node stopped advancing again

image

@faddat faddat mentioned this issue Feb 20, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants