Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing FailureNotificationSignal during network failure when non-master is isolated #93

Open
glassfishrobot opened this issue Sep 5, 2009 · 12 comments

Comments

@glassfishrobot
Copy link

I have been trying to use shoal with my application, assume I have a cluster
kind of setup with four nodes running on four different systems. If suddenly one
of the node goes out of network and it is not a master node, I get three FailureSuspectedSignals but not all three FailureNotificationSignals. If the
node which went out of network was a master node then, I get three FailureSuspectedSignals and FailureNotificationSignals. Is this not the way it
should behave even in the first case also.

Environment

Operating System: All
Platform: Windows

Affected Versions

[current]

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
Reported by little_zizou

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
@jfialli said:
More information is necessary to research this issue.

1. Please describe what is meant by "out of network".
Is the network cable being pulled from the machine?

2. We have a shoal qe test that verifies that all failure notifications
are sent to surviving group members when a non-master node is killed
(via kill -9). The test verifies that all FAILURE notifications are sent.
(Tests are run on the main branch of shoal. Please confirm you are running
these tests.)

Please submit logs (by attaching a zip of log files) that illustrate your
issue. Logging of FINE would be sufficient to follow what is occuring.

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
File: Scenario1.zip
Attached By: little_zizou

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
little_zizou said:
Created an attachment (id=20)
Scenario 1 Testcase

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
little_zizou said:

More information is necessary to research this issue.

1. Please describe what is meant by "out of network".
Is the network cable being pulled from the machine?

I have disabled my LAN network to simulate network failure kind of scenario
(similar to unplugging network cable).

2. We have a shoal qe test that verifies that all failure notifications
... Please confirm you are running these tests.)

I have not run the tests which you have mentioned but, instead I have written my
own test cases to verify joining nodes to the network and processing failure
notifications.

TestCase Description:
We have 3 systems with 3 shoal clients (Client1, Client2 & Client3), each
client running on a different system with member token names as server1, server2
and server3 respectively, all in the same group.

Scenario 1:
server2 and server3 are started before server1, now when we disable network on
server1, I could see 2 FailureSuspectedSignals and 2 FailureNotificationsignals
(for server2 and server3 respectively), as expected.

Scenario 2:
Now we have 3 clients running on 3 different systems, but the name of member
token which joins the group as "server1" is renamed as "server5".

Systems are started just like in the previous case. server2 and server3 are
started before server5, and disabled the LAN on server5. This time I could see 2
FailureSuspectedSignals, but only one FailureNotificationSignal.

I have attached the test sources and logs of both Scenario1 and Scenario2 for
your reference.

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
little_zizou said:
Created an attachment (id=21)
Scenario 2 TestCase

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
File: Scenario2.zip
Attached By: little_zizou

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
@jfialli said:
Issue understood.

Code in question is a detected masterFailed and the fact that
only the new master is allowed to announce the failure.

private void assignAndReportFailure(final HealthMessage.Entry entry) {

final boolean masterFailed = (masterNode.getMasterNodeID()).equals(entry.id);
if (masterNode.isMaster() && masterNode.isMasterAssigned())

{ }

else if (masterFailed) {
//remove the failed node
LOG.log(Level.FINE, MessageFormat.format("Master Failed. Removing System
Advertisement :

{0} for master named {1}", entry.id.toString(),
entry.adv.getName()));
manager.getClusterViewManager().remove(entry.adv);
masterNode.resetMaster();
masterNode.appointMasterNode();
if (masterNode.isMaster() && masterNode.isMasterAssigned()) {
LOG.log(Level.FINE, MessageFormat.format("Announcing Failure Event
of {0}

for name

{1}

...", entry.id, entry.adv.getName()));
final ClusterViewEvent cvEvent = new
ClusterViewEvent(ClusterViewEvents.FAILURE_EVENT, entry.adv);
masterNode.viewChanged(cvEvent);
}
}
cleanAllCaches(entry);
}
}

To avoid multiple reports of a FAILURE, only the master is typically allowed to
report failure to rest of cluster. For Scenario 2, when the network lan is
disabled on "server5", the reporter of this issue is looking for both "server2"
and "server3" to have failure events. While the heartbeat failure detection
does detect both server2 and server3 are failed (from server5's point of view,
they are both running in their own subnet)in submitted logs for scenario2, the
failure is not reported for server2 since server3 is calculated to be the new
master for server5. Unfortunately, server3 also can not communicate with
"server5". Thus the missing announce of the failure of server2. When "server3"
is detected to have failed, then server5 is the sole instance left in its subnet
cluster, it becomes the master and reports that server3 has failed.

To summarize, heartbeat failure detection is working correctly. "server5" view
of cluster is correct, just the failure notification for "server2" is missing in
this scenario. Reason for missing failure is in code fragment included above.

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
@jfialli said:
started analysis of issue from submitted logs.
see previous comments made when reassigning issue to myself.

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
@jfialli said:
Summary of issue reported for scenario 2 submitted on Sept 14th.

When the network lan fails for a non-master instance of a group,
the submitter of this issue expects to receive a FAILURE notification for
each instance on the isolated subnet that is no longer reachable.

Shoal's heartbeat failure detection is working to detect that the instances no
longer exist; however, isolated instance will not receive any failure
notifications about the no longer reachable members of the group until it
finally makes itself the master node.

For the submitted scenario 1, "server1" becomes the master node after "server2"
is no longer reachable. So no FAILURE events are dropped for that scenario.
Even though "server1" was not the master before lan is disabled,
"server1" is made the Master node for its subnet of one immediately due to
naming comparisions between it and the other remaining server names in gms group.

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
This issue was imported from java.net JIRA SHOAL-93

@glassfishrobot
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant