Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

By @frederikbosch: Peer discovery: add an option to opt out of registration #13201

Merged
merged 7 commits into from
Feb 3, 2025

Conversation

michaelklishin
Copy link
Member

This is #13194 by @frederikbosch with some final touches by me.

frederikbosch and others added 7 commits February 1, 2025 13:36
to match that used with Mnesia.

In the case of Mnesia, there are 10 retries
with a 30 second delay each.

For Khepri, a single timeout is used, so it
must be ten times as long.
As described in section 7.1 of filtex-v1.0-wd09:
> Impose a limit on the complexity of each filter expression.

Here, we hard code the maximum properties within a filter expression to 16.
There should never be a use case requiring to filter on more than 16
different properties.
Contains bug fix which would crash at-most-once dead lettering during
node restarts.

Less excessive debug logging around ra log.

Fix issue that could make leader transfers take 5s+ to complete.
for the backends that support it in the first place.

When forming a cluster, registration of the node
joining the cluster might be left to (container)
orchestration tools like Nomad or Kubernetes.

This PR add a new configuration option,
'cluster_formation.registration.enable',
which defaults to true.
When set to false node registration will be skipped.

There is at least one important advantage using a
tool such as Nomad (plus Consul) over the application
(RabbitMQ) doing the registration.

When the application is not stopped gracefully for
any reason, e.g. its OOM killed,
it cannot deregister the service/node.

This leaves behind an unlinked service entry in the registry.
This problem is fundamentally avoided by allowing
Nomad (or similar tools) to register the
node'service.

See #11233  #11045 for prior discussions.

Co-authored-by: Frederik Bosch <[email protected]>
@michaelklishin
Copy link
Member Author

Will have to cherry-pick the relevant commits for v4.0.x manually.

@frederikbosch
Copy link
Contributor

@michaelklishin That is fantastic. Is there an automatic built container for this PR that I can use to test in my cluster?

@michaelklishin
Copy link
Member Author

@frederikbosch there is an OCI build job but I'd need to investigate where it publishes to.

There will be development builds once this PR is merged, too.

@michaelklishin michaelklishin changed the title By @: Peer discovery: add an option to opt out of registration By @frederikbosch: Peer discovery: add an option to opt out of registration Feb 3, 2025
@frederikbosch
Copy link
Contributor

Found it, will test tomorrow.

@michaelklishin
Copy link
Member Author

@frederikbosch that's the right account/image but I am less sure about the tag. Sounds reasonable since it matches the branch name.

@michaelklishin michaelklishin merged commit 491971f into main Feb 3, 2025
271 checks passed
@michaelklishin michaelklishin deleted the rabbitmq-server-13194 branch February 3, 2025 21:18
michaelklishin added a commit that referenced this pull request Feb 4, 2025
@michaelklishin
Copy link
Member Author

Backported manually to v4.0.x for 4.0.6 in f74c13f.

@michaelklishin
Copy link
Member Author

@frederikbosch just to confirm, does this PR work as expected for you with Nomad?

@frederikbosch
Copy link
Contributor

@michaelklishin I will need to postpone my test until tomorrow, sorry for that.

@michaelklishin
Copy link
Member Author

Sure, no rush at all.

@frederikbosch
Copy link
Contributor

frederikbosch commented Feb 5, 2025

I did some initial tests, but I am not sure what is going on. I removed my previous comments. I need to do some more testing to get what's happening. I will come back to it.

@frederikbosch
Copy link
Contributor

@michaelklishin I did some extensive testing this morning.

  • cluster_formation.registration.enabled = false is working properly
  • Cluster is formed on boot with the services registered in Consul by Nomad.
  • However, then the rabbitmq_peer_discovery_consul_health_check_helper kicks in. It queries Consul periodically and looks if the node itself is listed in Consul and if not it executes maybe_re_register causing RabbitMQ to register itself in Consul even though registration is disabled.

So I started building RabbitMQ from source and disabled the timer inside the helper when registration is disabled. This worked perfectly and I can come up with a PR for this.

But then I wondered what if a nodes gets unhealthy. Suppose one of the nodes has a networking problem and Consul notifies this with its health check (also registered with Nomad), and therefore the node does not turn up when querying the list of nodes. But how do the nodes know this? With or without registration; at the moment a node never queries the list of nodes again after formation. Right? The helper only queries whether the own node is healthy and when it is not it executes maybe_re_register.

Is it even necessary that a node knows that peers are temporarily not in the service registry?

@frederikbosch
Copy link
Contributor

After reading the documentation on Node Health Checks and Forced Removal I think I can answer my own question.

Node cleanup is a separate module (rabbit_peer_discovery_cleanup.erl) that - if configured - already periodically cleans up nodes that are not listed in the registry by running its private method service_discovery_nodes(). There is no need to add that in the Consul module itself.

So therefore, I think disabling the health check when registration is disabled is safe. It's really only responsible for executing the maybe_re_register, which is not desired when registration is disabled. Hence, I have created a PR to disable the health check when registration is disabled.

@michaelklishin
Copy link
Member Author

@frederikbosch thank you for contributing, once again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants