Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: node getting stuck and missing messages #2921

Open
gabrielmer opened this issue Jul 19, 2024 · 2 comments
Open

bug: node getting stuck and missing messages #2921

gabrielmer opened this issue Jul 19, 2024 · 2 comments
Assignees
Labels
bug Something isn't working effort/weeks Estimated to be completed in a few weeks

Comments

@gabrielmer
Copy link
Contributor

Problem

When running simulations, @AlbertoSoutullo found that very often there's one node that misses lots of messages.
Looking at the logs, it seems that the node gets stuck for approximately 50 seconds.

Here's the moment where it happens

TRC 2024-07-19 11:17:17.632+00:00 waiting for data                           topics="libp2p pubsubpeer" tid=7 file=pubsubpeer.nim:196 conn=16U*zXAqnw:669a4b1af74509e547f73df3 peer=16U*zXAqnw closed=false
TRC 2024-07-19 11:18:07.995+00:00 running heartbeat                          topics="libp2p gossipsub" tid=7 file=behavior.nim:775 instance=140313549693008

In runs, it is consistent of this happening for precisely 50 seconds. It also happens at a moment where the node establishes lots of connections.

Impact

Critical

Expected behavior

Nodes shouldn't get stuck and should receive all messages

Screenshots/logs

logs.zip

nwaku version/commit hash

branch: release/v0.31 commit b34008e. Also reproduced in v0.30.1

@gabrielmer gabrielmer added the bug Something isn't working label Jul 19, 2024
@gabrielmer gabrielmer self-assigned this Jul 19, 2024
@gabrielmer
Copy link
Contributor Author

It seems to be related to the node establishing lots of connections in a short timespan.

Created an image with only this workaround allowing a maximum of 20 connections in each connectivity loop iteration and the issue stopped getting reproduced

(branch debug-extra-nim-libp2p-logs-over-v0.31.0-with-limited-connections)

image

@AlbertoSoutullo
Copy link

AlbertoSoutullo commented Jul 19, 2024

Thanks for creating the issue! c:

In order to add a little bit more of information, we are confident that this is not an issue related with the lab.
The information that we have right now is:

  • In the simulations, we injected 60 messages in 1 minute. The mesh is formed by 100 nwaku nodes.
  • For all messages, there is one peer that misses 75%~ of the messages.
  • Analyzing the logs, we double-checked that there were multiple nodes that actually sent the message to the problematic node.
  • This no longer happens (tested more than 10 times, before this was happening in 50% of the tests) after the fix mentioned in the previous comment.

@gabrielmer gabrielmer added the effort/weeks Estimated to be completed in a few weeks label Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working effort/weeks Estimated to be completed in a few weeks
Projects
Status: In Progress
Development

No branches or pull requests

2 participants