-
Notifications
You must be signed in to change notification settings - Fork 210
Description
Describe the bug
On a Zenoh network doing peer-to-peer multicast scouting with gossip enabled, we experienced sudden, rapid degradation in Zenoh's ability to communicate, until it generally could not deliver messages at all. We discovered that the OAM message had become very large, in the tens of kilobytes, and had thousands of peer IDs listed. This caused connections to drop because the OAM message was sent blocking, and if it failed to get delivered (or blocked other blocking messages), Zenoh would close the connection. Our use case involves some long-running Zenoh sessions, and we believe the gossip subsystem was remembering all the Zenoh peers that the network had ever seen.
The only mechanism I see for peers to get removed from the gossip subsystem is in gossip::Network::remove_link()
, which can be called in close_face()
, but if I understand correctly, that could only remove directly connected peers and not peers heard indirectly. It seems that even with multihop
disabled, peers still pass indirectly heard IDs, just not their locators (based on propagate_locators()
). This explains why the OAM message became so large despite multihop
being off.
Possible solutions might be to clear indirectly heard IDs, perhaps with a time-based expiration, or to not gossip anything about indirectly heard peers when multihop
is disabled.
To reproduce
Create two Zenoh nodes A and B doing peer-to-peer multicast discovery with gossip enabled and multihop
disabled. Their IDs should be random rather than fixed.
Alternate between restarting A and B, allowing discovery to succeed after each restart.
Monitor the number of peer IDs in the OAM message. It should grow indefinitely.
System info
- Ubuntu 22.04
- Zenoh 1.4.0