Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a post-start script to check node and cluster health #89

Closed

Conversation

MatthiasWinzeler
Copy link

fix for #87

@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@gerhard
Copy link
Contributor

gerhard commented Jun 20, 2018

As the PR currently stands, post-start would exit with a non-zero status in the following scenarios:

  • the node that starts hasn't finished booting
  • the node boots, but it can't connect to its peers
  • mnesia is waiting to synchronise with other nodes, but those nodes are either down or simply not ready to synchronise
  • pause_minority is used and running nodes are in minority, meaning that rabbit will not start until a majority of nodes are running

I want to review the initial scenario that lead to this specific issue before I make specific proposals on making this better.

cc @michaelklishin for visibility, no action required

@MatthiasWinzeler
Copy link
Author

Hi @gerhard
Thanks for getting back to me, much appreciated!

From how I understand you, there are some valid cases where a node can come up, but the cluster is not fully healthy (means we can not run cluster check in post-start) but BOSH needs to move on to the next one.

What do you think about running only the node check (and not the cluster check) in post-start? Could this already prevent some error cases from propagating, while still allowing your (valid) use cases to pass? Or could there be some other health check we could do in post-start that is more than just process checking?

So I agree we should figure out why the nodes failed in the first place. Do you have a starting point to investigate this?
The log looks like a successful start to me:

=INFO REPORT==== 1-Jun-2018::22:32:52 ===
Starting RabbitMQ 3.6.15 on Erlang 19.3.6.4
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 1-Jun-2018::22:32:52 ===
node           : rabbit@05938b8fd4a523bd529c136ca66c20c2
home dir       : /var/vcap/store/rabbitmq
config file(s) : /var/vcap/jobs/rabbitmq-server/bin/../etc/rabbitmq.config
cookie hash    : <redacted>
log            : /var/vcap/sys/log/rabbitmq-server/[email protected]
sasl log       : /var/vcap/sys/log/rabbitmq-server/[email protected]
database dir   : /var/vcap/store/rabbitmq/mnesia/db

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Memory high watermark set to 3193 MiB (3348537344 bytes) of 7983 MiB (8371343360 bytes) total

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Enabling free disk space monitoring

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Disk free limit set to 12557MB

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Limiting to approx 299900 file handles (269908 sockets)

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Starting rabbit_node_monitor

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Management plugin: using rates mode 'basic'

=ERROR REPORT==== 1-Jun-2018::22:32:54 ===
Discarding message {'$gen_call',{<0.1142.0>,#Ref<0.0.1.2663>},{basic_consume,false,<7190.3369.3092>,<7190.10353.2914>,false,1,<<"amq.<redacted>">>,false,[],undefined}} from <0.1142.0> to <0.16824.3255> in an old incarna
tion (1) of this node (2)


=INFO REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=ERROR REPORT==== 1-Jun-2018::22:32:54 ===
Discarding message {'$gen_call',{<0.1147.0>,#Ref<0.0.2.651>},{basic_consume,false,<7367.4494.76>,<7367.4485.76>,false,1,<<"amq.<discarded>">>,false,[],undefined}} from <0.1147.0> to <0.31357.75> in an old incarnation (1)
 of this node (2)


=WARNING REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

... some more discarding and resyncing queue messages

=INFO REPORT==== 1-Jun-2018::22:33:16 ===
Mirrored queue 'amq.gen-<>discarded' in vhost '96621d9c-4ec4-4540-82c2-3ca8920b6409': Adding mirror on node rabbit@05938b8fd4a523bd529c136ca66c20c2: <0.21299.0>

=INFO REPORT==== 1-Jun-2018::22:33:20 ===
started TCP Listener on 0.0.0.0:5672

=INFO REPORT==== 1-Jun-2018::22:33:20 ===
rabbit on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c up

@gerhard
Copy link
Contributor

gerhard commented Jun 22, 2018

I will have more time for this on Monday, some relevant context for that conversation:

@michaelklishin
Copy link
Member

rabbitmqctl ping requires Erlang distribution to be running.

@gerhard
Copy link
Contributor

gerhard commented Jun 22, 2018

Thank you Michael.

If I recall correctly, the Erlang distribution has to be running for the Mnesia to be able to synchronise, so it's a safe post-start check, even if the rabbit app is not running.

@michaelklishin
Copy link
Member

I agree that it is safe but rabbitmqctl eval would test pretty much the same set of things. There's little harm in keeping both.

@nodo
Copy link

nodo commented Jul 12, 2018

Hey @MatthiasWinzeler ,

Thanks for the PR, apologies for the late reply.

At the moment, node-check simply check that the Erlang VM is running (see here). This is a safe check to do, but it does not give the guarantee that the rabbitmq app is up and healthy.

For that guarantee, we need wait that mnesia has successfully synchronised and then execute the cluster-check script (see here). The script, in turn, performs rabbitmqctl node_health_check. There is a bit of a naming clash here, but the rabbitmqctl node_health_check verifies the rabbit application is running, which is not possible until all the nodes are up.

Coming back to your PR, as far as I understand, adding a cluster-check in the post-start will fail because it needs all the nodes to be up and mnesia to be synced to succeed. This would prevent BOSH in marking the new node as healthy and therefore I think it is not an option.

The interesting part is to understand why, in your case, the node is failing, from the logs you posted nothing seems to be wrong. You think you could provide other logs about this issue?

cc @gerhard

@MatthiasWinzeler
Copy link
Author

@nodo Thanks, sounds reasonable to me.

Which additional logs would be of interest for investigation?

@nodo
Copy link

nodo commented Jul 12, 2018

@MatthiasWinzeler - it would be interesting to check the post-deploy script logs, and the other nodes logs as well. Could you provide them?

Also, are you OK to close the PR for now and continue the discussion on the issue you opened here ?

@gerhard
Copy link
Contributor

gerhard commented Jul 12, 2018

rabbit@*.log, including rabbit@*-sasl.log, are most helpful.

@MatthiasWinzeler
Copy link
Author

@nodo Let's close this for now and continue in #87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants