Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for improvement of documentation of Bridge Node [documentation] #1633

Open
cryptomolot opened this issue Jun 27, 2024 · 2 comments

Comments

@cryptomolot
Copy link

Dear Celestia Foundation,

I'm Vladislav from cryptomolot and we want to provide transparency and clarify the reason for our downtime on 24/06/2024. This is the link to our node.

We are participating in MOCHA-4 testnet and applied for a delegation program.

We were facing this problem: our system with a bridge node installed crashes when LimitNOFILE=1400000 was set due to latest update.

We are using the recommended server requirements: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-144-generic x86_64). Hardware also meets the requirements (Memory: 16 GB RAM, although we currently have 32 GB) ,CPU: 6 cores (we have 24 cores actually), Disk: 10 TB SSD Storage,Bandwidth: 1 Gbps for Download/1 Gbps for Upload)

We had to reduce this value to LimitNOFILE=65535 to ensure system stability after the latest update.

By setting the soft limit to 1400000 in the service file, we exceeded the hard limit of the system, which I believe caused the crash.

Log from the system

INFO: task ksmd:311 bloc
ked for more than 120 seconds
INFO: task scsi_eh_0:537 blocked for more than 120 seconds
INFO: task jbd2/sda2-8:702 blocked for more than 120 seconds
INFO: task NetworkManager:120743 blocked for more than 120 seconds
INFO: task xfsaild/sdb1:123535 blocked for more than 120 seconds

We don't think disks were the reason. They weren't overloaded (and the hardware in general related to our monitoring). It is likely related to the processor or the operating system. The system has the hard limit file descriptor set to 1048576. When we changed the Soft Limit in the service file, we exceeded the Hard Limit, we think that was the reason why the system crashes.

What could help: It may be worth revising this value in the documentation to a lower setting or providing more detailed system requirements. https://docs.celestia.org/nodes/systemd#celestia-bridge-node

Btw, we checked the 1390 Issue and saw that was already discussed but probably worth mentioning again after our case. Thanks a lot for your time!

Best regards,

Vladislav.

Let's keep in touch! My telegram is t.me/tommmymlt
Скриншот 27-06-2024 141431

photo_2024-06-27_13-42-24
photo_2024-06-27_13-42-26
photo_2024-06-27_13-43-20

@cryptomolot cryptomolot changed the title Suggestion for improvement of documentation of Bridge Node Suggestion for improvement of documentation of Bridge Node [documentation] Jun 27, 2024
@jcstein
Copy link
Member

jcstein commented Jul 1, 2024

gm @cryptomolot please feel free to make a PR to make updates for this issue yourself 🙌

cryptomolot added a commit to cryptomolot/docs that referenced this issue Jul 2, 2024
Modify the bridge node guide (LimitNOFILE) according to celestiaorg#1633
@Wondertan
Copy link
Member

Hey @cryptomolot, I don't think the soft limit of 1400000 should exceed the default system-wide limit unless you change the default. My system's default is 9223372036854775807, which is much more than we recommend for the per-process limit.

Did you change the system-wide default? If so, what for?

P.S You can check your system-wide default via cat /proc/sys/fs/file-max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants