bug: node getting stuck and missing messages #2921

gabrielmer · 2024-07-19T14:04:46Z

Problem

When running simulations, @AlbertoSoutullo found that very often there's one node that misses lots of messages.
Looking at the logs, it seems that the node gets stuck for approximately 50 seconds.

Here's the moment where it happens

TRC 2024-07-19 11:17:17.632+00:00 waiting for data                           topics="libp2p pubsubpeer" tid=7 file=pubsubpeer.nim:196 conn=16U*zXAqnw:669a4b1af74509e547f73df3 peer=16U*zXAqnw closed=false
TRC 2024-07-19 11:18:07.995+00:00 running heartbeat                          topics="libp2p gossipsub" tid=7 file=behavior.nim:775 instance=140313549693008

In runs, it is consistent of this happening for precisely 50 seconds. It also happens at a moment where the node establishes lots of connections.

Impact

Critical

Expected behavior

Nodes shouldn't get stuck and should receive all messages

Screenshots/logs

logs.zip

nwaku version/commit hash

branch: release/v0.31 commit b34008e. Also reproduced in v0.30.1

The text was updated successfully, but these errors were encountered:

gabrielmer · 2024-07-19T14:10:33Z

It seems to be related to the node establishing lots of connections in a short timespan.

Created an image with only this workaround allowing a maximum of 20 connections in each connectivity loop iteration and the issue stopped getting reproduced

(branch debug-extra-nim-libp2p-logs-over-v0.31.0-with-limited-connections)

AlbertoSoutullo · 2024-07-19T14:13:41Z

Thanks for creating the issue! c:

In order to add a little bit more of information, we are confident that this is not an issue related with the lab.
The information that we have right now is:

In the simulations, we injected 60 messages in 1 minute. The mesh is formed by 100 nwaku nodes.
For all messages, there is one peer that misses 75%~ of the messages.
Analyzing the logs, we double-checked that there were multiple nodes that actually sent the message to the problematic node.
This no longer happens (tested more than 10 times, before this was happening in 50% of the tests) after the fix mentioned in the previous comment.

gabrielmer added the bug Something isn't working label Jul 19, 2024

gabrielmer self-assigned this Jul 19, 2024

gabrielmer added the effort/weeks Estimated to be completed in a few weeks label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: node getting stuck and missing messages #2921

bug: node getting stuck and missing messages #2921

gabrielmer commented Jul 19, 2024

gabrielmer commented Jul 19, 2024

AlbertoSoutullo commented Jul 19, 2024 •

edited

Loading

bug: node getting stuck and missing messages #2921

bug: node getting stuck and missing messages #2921

Comments

gabrielmer commented Jul 19, 2024

Problem

Impact

Expected behavior

Screenshots/logs

nwaku version/commit hash

gabrielmer commented Jul 19, 2024

AlbertoSoutullo commented Jul 19, 2024 • edited Loading

AlbertoSoutullo commented Jul 19, 2024 •

edited

Loading