Peer discovery: disable consul health check helper when registration is disabled #13210

frederikbosch · 2025-02-06T13:27:40Z

Proposed Changes

This is a follow-up PR for #13201. In that PR a new configuration option was added: cluster_formation.registration.enabled.

For Consul, this does work during boot, but the rabbitmq_peer_discovery_consul_health_check_helper keep querying Consul periodically to see whether the node is healthy. If it is not, it will register the node even though the registration was disabled. Since the periodic health check has no other function than to register the node if unhealthy, the periodic check needs to be disabled when registration is disabled.

I have tested this PR with Consul and Nomad.

Types of Changes

Bug fix (non-breaking change)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

michaelklishin · 2025-02-06T16:03:57Z

Good catch. I have tested #13201 with the etcd mechanism which does not have this periodic operation.

Peer discovery: disable consul health check helper when registration is disabled (backport #13210)

frederikbosch · 2025-02-06T20:47:36Z

@michaelklishin If I may ask this question. I am finalizing my Nomad file and added a health check to the service like this. It is the stage 2 check from the docs.

service {
  address_mode = "driver"
  name = "rabbitmq"
  port = 5672
  tags = ["amqp", "${attr.unique.hostname}"]

  meta {
    erlang-node-name = "rabbit@${attr.unique.hostname}.rabbitmq.service.consul"
  }

  check {
    type    = "script"
    command = "rabbitmq-diagnostics"
    args    = ["-q", "status"]
    interval = "60s"
    timeout  = "5s"
    initial_status = "passing"
  }
}

Now, a problem arises when I add the check.

It is no problem if I first submit the job without check and then resubmit the job, while my cluster is already running with the check. All fine, Nomad adds the check to the Consul service and starts probing.
But if I start my cluster with the check included, then I run into a weird (for me at least) problem: Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces.

So my guess is that rabbitmq-diagnostics -q status already runs when the node was not finished booting yet and that an early execution of the diagnostics interferes with the boot. I don't see any option within Nomad/Consul to postpone the check.

Do you have any idea what might be the cause here?

PS. After I have this all running successfully, I will update the Nomad RabbitMQ integration so others can profit from our efforts here.

michaelklishin · 2025-02-06T21:42:11Z

Your understanding may be correct. If the cookie file does not exist then CLI tools or RabbitMQ will generate a random value to seed it. This can create a race condition.

I would simply switch to the HTTP API for health checks, for every health check command there is an HTTP API endpoint.

frederikbosch · 2025-02-06T21:43:32Z

Merci, I was not aware of that yet.

michaelklishin · 2025-02-06T21:43:57Z

Plus CLI tools support providing the cookie on the command line as an argument or via an environment variable. I don't think we document or at least promote those options.

michaelklishin · 2025-02-07T03:19:48Z

@frederikbosch go to http://localhost:15672/api/ (well, with a different hostname for a K8S/Nomad deployed instance) and search for "health".

The closest endpoint there is to rabbitmq-diagnostics status is GET /api/overview. For runs, say, once 30-60 seconds that sounds perfectly fine.

If you need the simplest health check possible that tells you that the node is up and that's it, a rabbitmq-diagnostics ping equivalent, that's likely going to be GET /api/nodes or even GET /api/whoami.

frederikbosch · 2025-02-07T09:36:08Z

@michaelklishin That seems like the best option then. However, then we enter into a new problem: when username and password are coming from Vault, as in my case, we cannot use Vault variables (yet) inside a health check. So the health checks always return 401.

With that problem being there, I considered the going back to the rabbitmq-diagnostics command and fix the eaccess problem as suggested in this discussion by using a start interval for the healthcheck. However, Consul does not support a start interval and will not in the future.

So, another option was to add a monitoring user on boot, or directly thereafter, via a definitions file using a fixed username and password not coming from Vault. I decided not to do that because it would make the setup very complicated.

In the end I came up with a custom healthcheck script that checks the erlang cookie being there, and if not, consider the node passing the healthcheck. My end result is the following Nomad file. Once v4.0,6 is released I will suggest changing the Nomad RabbitMQ integration into this setup.

job "rabbitmq" {
  datacenters = ["dc"]

  vault {
    policies = ["rabbitmq"]
  }

  group "cluster" {
    count = 3

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    update {
      max_parallel = 1
    }

    migrate {
      max_parallel     = 1
      health_check     = "checks"
      min_healthy_time = "5s"
      healthy_deadline = "60s"
    }

    task "rabbit" {
      driver = "docker"

      config {
        image = "pivotalrabbitmq/rabbitmq:sha-49eeb5e435167c18a2abb3832ccc6b6051f4a283-otp27"
        hostname = "${attr.unique.hostname}.rabbitmq.service.consul"

        volumes = [
          "local/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf",
          "local/enabled_plugins:/etc/rabbitmq/enabled_plugins",
          "local/healthcheck.sh:/usr/bin/healthcheck.sh",
          "secrets/cert.key:/etc/ssl/cert.key",
          "secrets/cert.crt:/etc/ssl/cert.crt",
          "secrets/chain.crt:/etc/ssl/chain.crt",
        ]
      }

      env {
        RABBITMQ_USE_LONGNAME="true"
        RABBITMQ_MANAGEMENT_SSL_CACERTFILE=""
        RABBITMQ_MANAGEMENT_SSL_CERTFILE=""
        RABBITMQ_MANAGEMENT_SSL_KEYFILE=""
      }

      template {
        data        =  <<EOH
[rabbitmq_management,
rabbitmq_federation,
rabbitmq_federation_management,
rabbitmq_shovel,
rabbitmq_shovel_management,
rabbitmq_mqtt,
rabbitmq_stomp,
rabbitmq_peer_discovery_consul].
        EOH
        destination = "local/enabled_plugins"
      }

      template {
        data        = <<EOH
log.console.level = debug
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.node_cleanup.only_log_warning = true
cluster_formation.registration.enabled = false
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.host = {{env "attr.unique.network.ip-address"}}
cluster_formation.consul.acl_token = {{with secret "secret/consul/creds/rabbitmq"}}{{.Data.token}}{{end}}
        EOH
        destination = "local/rabbitmq.conf"
      }

      template {
        data        = <<EOH
#!/usr/bin/env bash
set -e

if [ ! -f /var/lib/rabbitmq/.erlang.cookie ]; then
    echo "Waiting for system to boot"
    exit 0
fi

rabbitmq-diagnostics -q status
        EOH
        destination = "local/healthcheck.sh"
      }

      template {
        data = <<EOH
      # Empty lines are also ignored
      {{with secret "secret/kv1/infrastructure/v1/rabbitmq/v1"}}
        RABBITMQ_DEFAULT_USER = "{{.Data.USER}}"
        RABBITMQ_DEFAULT_PASS = "{{.Data.PASS}}"
        RABBITMQ_ERLANG_COOKIE = "{{.Data.ERLANG_COOKIE}}"
      {{end}}
      EOH

        destination = "secrets/file.env"
        env         = true
      }

      service {
        address_mode = "driver"
        name = "rabbitmq"
        port = 5672
        tags = ["amqp", "${attr.unique.hostname}"]

        meta {
          erlang-node-name = "rabbit@${attr.unique.hostname}.rabbitmq.service.consul"
        }

        check {
          type    = "script"
          command = "bash"
          args    = ["/usr/bin/healthcheck.sh"]
          interval = "60s"
          timeout  = "5s"
        }
      }

      service {
        address_mode = "driver"
        name = "rabbitmq-nomad-management"
        port = 15672
        tags = [
          "http",
          "traefik.enable=true",
          "traefik.http.routers.rabbitmq-http.rule=Host(`mq.myhost`)",
          "traefik.http.routers.rabbitmq-http.entrypoints=web",
          "traefik.http.routers.rabbitmq-http.middlewares=redirect-https@file",
          "traefik.http.routers.rabbitmq-https.rule=Host(`mq.myhost`)",
          "traefik.http.routers.rabbitmq-https.entrypoints=websecure",
          "traefik.http.routers.rabbitmq-https.tls=true",
        ]

        check {
          address_mode = "driver"
          type = "http"
          path = "/"
          port = 15672
          interval = "10s"
          timeout = "2s"
        }
      }

      resources {
        memory = 1000
      }
    }
  }
}

disable consul health check helper when registration is disabled

b821e52

michaelklishin added this to the 4.1.0 milestone Feb 6, 2025

michaelklishin added the backport-v4.0.x label Feb 6, 2025

michaelklishin approved these changes Feb 6, 2025

View reviewed changes

michaelklishin merged commit 9bc3d5e into rabbitmq:main Feb 6, 2025
262 of 268 checks passed

mergify bot mentioned this pull request Feb 6, 2025

Peer discovery: disable consul health check helper when registration is disabled (backport #13210) #13213

Merged

12 tasks

michaelklishin added a commit that referenced this pull request Feb 6, 2025

Merge pull request #13213 from rabbitmq/mergify/bp/v4.0.x/pr-13210

49eeb5e

Peer discovery: disable consul health check helper when registration is disabled (backport #13210)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer discovery: disable consul health check helper when registration is disabled #13210

Peer discovery: disable consul health check helper when registration is disabled #13210

frederikbosch commented Feb 6, 2025

michaelklishin commented Feb 6, 2025

frederikbosch commented Feb 6, 2025 •

edited

Loading

michaelklishin commented Feb 6, 2025 •

edited

Loading

frederikbosch commented Feb 6, 2025

michaelklishin commented Feb 6, 2025

michaelklishin commented Feb 7, 2025 •

edited

Loading

frederikbosch commented Feb 7, 2025 •

edited

Loading

Peer discovery: disable consul health check helper when registration is disabled #13210

Peer discovery: disable consul health check helper when registration is disabled #13210

Conversation

frederikbosch commented Feb 6, 2025

Proposed Changes

Types of Changes

Checklist

michaelklishin commented Feb 6, 2025

frederikbosch commented Feb 6, 2025 • edited Loading

michaelklishin commented Feb 6, 2025 • edited Loading

frederikbosch commented Feb 6, 2025

michaelklishin commented Feb 6, 2025

michaelklishin commented Feb 7, 2025 • edited Loading

frederikbosch commented Feb 7, 2025 • edited Loading

frederikbosch commented Feb 6, 2025 •

edited

Loading

michaelklishin commented Feb 6, 2025 •

edited

Loading

michaelklishin commented Feb 7, 2025 •

edited

Loading

frederikbosch commented Feb 7, 2025 •

edited

Loading