Failed nodes reported as alive #627

edsharp · 2021-03-15T16:34:06Z

I'm testing out serf and it seems like a great project. I've hit one issue in my testing so far.

I set up a 3-node serf cluster on 3 VM's. All nodes report alive on all nodes, as expected. Tags update. All seems healthy.

ed@agent-two:~$ serf members
agent-two    192.168.2.17:7946  alive
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   alive

Then I disconnected the VM's network adaptor on agent-two. Agents one and three report agent two is failed as I'd expect:

ed@agent-one:~$ serf members
agent-two    192.168.2.17:7946  failed
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   alive

ed@agent-three:~$ serf members
agent-three  192.168.2.18:7946  alive
agent-two    192.168.2.17:7946  failed
agent-one    192.168.2.6:7946   alive

However agent two only reports agent-one as having failed where I'd have expected it to report both one and three as failed:

ed@agent-two:~$ serf members
agent-two    192.168.2.17:7946  alive
agent-three  192.168.2.18:7946  alive
agent-one    192.168.2.6:7946   failed

In the monitor logs on agent two I can see:

2021/03/15 16:14:33 [ERR] memberlist: Failed to send ping: write udp 192.168.2.17:7946->192.168.2.18:7946: sendto: network is unreachable
2021/03/15 16:14:34 [ERR] memberlist: Push/Pull with agent-three failed: dial tcp 192.168.2.18:7946: connect: network is unreachable

Which suggests to me that agent-two knows it can't communicate with agent-three, so I'm wondering why it reports agent-three as alive rather than failed.

I believe this is a bug in the sense that agent-two falsely believes (or reports) it can communicate with at least one other node when in fact it is entirely isolated.

When I reconnect the network adaptor, after a few seconds all nodes report they are all alive again.

FWIW, if I disconnect the network adaptors on agents one and three, and then check agent two, agent two correctly reports one and three are failed.

My config is:

{
         "interface": "ens33"
    ,    "encrypt_key": "7VpgMKMUFTTluPMNHz7YL1gMPDLPPpkETmec1hI/jkc="
    ,    "snapshot_path": "/opt/serf/serf.snapshot"
    ,    "rejoin_after_leave": true
    ,    "profile": "lan"
    ,    "log_level": "warn"
    ,    "tags": {}
}

During this time, the snapshot file on agent-two looks like this:

alive: agent-two 192.168.2.17:7946
alive: agent-one 192.168.2.6:7946
alive: agent-three 192.168.2.18:7946
clock: 43
not-alive: agent-one
alive: agent-one 192.168.2.6:7946

Does anyone have any suggestions?

Platform details

Ubuntu Focal 20.04.1 LTS VM's on VMWare Fusion 12.1 Pro on macOS Big Sur 11.2.1 using NAT networking.

The text was updated successfully, but these errors were encountered:

Austinpayne mentioned this issue Aug 24, 2021

Fix udp writes not causing nodes to become suspect hashicorp/memberlist#242

Merged

kisunji assigned kyhavlov Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed nodes reported as alive #627

Failed nodes reported as alive #627

edsharp commented Mar 15, 2021

Failed nodes reported as alive #627

Failed nodes reported as alive #627

Comments

edsharp commented Mar 15, 2021

Platform details