Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer discovery: disable consul health check helper when registration is disabled #13210

Merged
merged 1 commit into from
Feb 6, 2025

Conversation

frederikbosch
Copy link
Contributor

Proposed Changes

This is a follow-up PR for #13201. In that PR a new configuration option was added: cluster_formation.registration.enabled.

For Consul, this does work during boot, but the rabbitmq_peer_discovery_consul_health_check_helper keep querying Consul periodically to see whether the node is healthy. If it is not, it will register the node even though the registration was disabled. Since the periodic health check has no other function than to register the node if unhealthy, the periodic check needs to be disabled when registration is disabled.

I have tested this PR with Consul and Nomad.

Types of Changes

  • Bug fix (non-breaking change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Sorry, something went wrong.

@michaelklishin michaelklishin added this to the 4.1.0 milestone Feb 6, 2025
@michaelklishin
Copy link
Member

Good catch. I have tested #13201 with the etcd mechanism which does not have this periodic operation.

@michaelklishin michaelklishin merged commit 9bc3d5e into rabbitmq:main Feb 6, 2025
262 of 268 checks passed
michaelklishin added a commit that referenced this pull request Feb 6, 2025
Peer discovery: disable consul health check helper when registration is disabled (backport #13210)
@frederikbosch
Copy link
Contributor Author

frederikbosch commented Feb 6, 2025

@michaelklishin If I may ask this question. I am finalizing my Nomad file and added a health check to the service like this. It is the stage 2 check from the docs.

service {
  address_mode = "driver"
  name = "rabbitmq"
  port = 5672
  tags = ["amqp", "${attr.unique.hostname}"]

  meta {
    erlang-node-name = "rabbit@${attr.unique.hostname}.rabbitmq.service.consul"
  }

  check {
    type    = "script"
    command = "rabbitmq-diagnostics"
    args    = ["-q", "status"]
    interval = "60s"
    timeout  = "5s"
    initial_status = "passing"
  }
}

Now, a problem arises when I add the check.

  • It is no problem if I first submit the job without check and then resubmit the job, while my cluster is already running with the check. All fine, Nomad adds the check to the Consul service and starts probing.
  • But if I start my cluster with the check included, then I run into a weird (for me at least) problem: Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces.

So my guess is that rabbitmq-diagnostics -q status already runs when the node was not finished booting yet and that an early execution of the diagnostics interferes with the boot. I don't see any option within Nomad/Consul to postpone the check.

Do you have any idea what might be the cause here?

PS. After I have this all running successfully, I will update the Nomad RabbitMQ integration so others can profit from our efforts here.

@michaelklishin
Copy link
Member

michaelklishin commented Feb 6, 2025

Your understanding may be correct. If the cookie file does not exist then CLI tools or RabbitMQ will generate a random value to seed it. This can create a race condition.

I would simply switch to the HTTP API for health checks, for every health check command there is an HTTP API endpoint.

@frederikbosch
Copy link
Contributor Author

Merci, I was not aware of that yet.

@michaelklishin
Copy link
Member

Plus CLI tools support providing the cookie on the command line as an argument or via an environment variable. I don't think we document or at least promote those options.

@michaelklishin
Copy link
Member

michaelklishin commented Feb 7, 2025

@frederikbosch go to http://localhost:15672/api/ (well, with a different hostname for a K8S/Nomad deployed instance) and search for "health".

The closest endpoint there is to rabbitmq-diagnostics status is GET /api/overview. For runs, say, once 30-60 seconds that sounds perfectly fine.

If you need the simplest health check possible that tells you that the node is up and that's it, a rabbitmq-diagnostics ping equivalent, that's likely going to be GET /api/nodes or even GET /api/whoami.

@frederikbosch
Copy link
Contributor Author

frederikbosch commented Feb 7, 2025

@michaelklishin That seems like the best option then. However, then we enter into a new problem: when username and password are coming from Vault, as in my case, we cannot use Vault variables (yet) inside a health check. So the health checks always return 401.

With that problem being there, I considered the going back to the rabbitmq-diagnostics command and fix the eaccess problem as suggested in this discussion by using a start interval for the healthcheck. However, Consul does not support a start interval and will not in the future.

So, another option was to add a monitoring user on boot, or directly thereafter, via a definitions file using a fixed username and password not coming from Vault. I decided not to do that because it would make the setup very complicated.

In the end I came up with a custom healthcheck script that checks the erlang cookie being there, and if not, consider the node passing the healthcheck. My end result is the following Nomad file. Once v4.0,6 is released I will suggest changing the Nomad RabbitMQ integration into this setup.

job "rabbitmq" {
  datacenters = ["dc"]

  vault {
    policies = ["rabbitmq"]
  }

  group "cluster" {
    count = 3

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    update {
      max_parallel = 1
    }

    migrate {
      max_parallel     = 1
      health_check     = "checks"
      min_healthy_time = "5s"
      healthy_deadline = "60s"
    }

    task "rabbit" {
      driver = "docker"

      config {
        image = "pivotalrabbitmq/rabbitmq:sha-49eeb5e435167c18a2abb3832ccc6b6051f4a283-otp27"
        hostname = "${attr.unique.hostname}.rabbitmq.service.consul"

        volumes = [
          "local/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf",
          "local/enabled_plugins:/etc/rabbitmq/enabled_plugins",
          "local/healthcheck.sh:/usr/bin/healthcheck.sh",
          "secrets/cert.key:/etc/ssl/cert.key",
          "secrets/cert.crt:/etc/ssl/cert.crt",
          "secrets/chain.crt:/etc/ssl/chain.crt",
        ]
      }

      env {
        RABBITMQ_USE_LONGNAME="true"
        RABBITMQ_MANAGEMENT_SSL_CACERTFILE=""
        RABBITMQ_MANAGEMENT_SSL_CERTFILE=""
        RABBITMQ_MANAGEMENT_SSL_KEYFILE=""
      }

      template {
        data        =  <<EOH
[rabbitmq_management,
rabbitmq_federation,
rabbitmq_federation_management,
rabbitmq_shovel,
rabbitmq_shovel_management,
rabbitmq_mqtt,
rabbitmq_stomp,
rabbitmq_peer_discovery_consul].
        EOH
        destination = "local/enabled_plugins"
      }

      template {
        data        = <<EOH
log.console.level = debug
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.node_cleanup.only_log_warning = true
cluster_formation.registration.enabled = false
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.host = {{env "attr.unique.network.ip-address"}}
cluster_formation.consul.acl_token = {{with secret "secret/consul/creds/rabbitmq"}}{{.Data.token}}{{end}}
        EOH
        destination = "local/rabbitmq.conf"
      }

      template {
        data        = <<EOH
#!/usr/bin/env bash
set -e

if [ ! -f /var/lib/rabbitmq/.erlang.cookie ]; then
    echo "Waiting for system to boot"
    exit 0
fi

rabbitmq-diagnostics -q status
        EOH
        destination = "local/healthcheck.sh"
      }

      template {
        data = <<EOH
      # Empty lines are also ignored
      {{with secret "secret/kv1/infrastructure/v1/rabbitmq/v1"}}
        RABBITMQ_DEFAULT_USER = "{{.Data.USER}}"
        RABBITMQ_DEFAULT_PASS = "{{.Data.PASS}}"
        RABBITMQ_ERLANG_COOKIE = "{{.Data.ERLANG_COOKIE}}"
      {{end}}
      EOH

        destination = "secrets/file.env"
        env         = true
      }

      service {
        address_mode = "driver"
        name = "rabbitmq"
        port = 5672
        tags = ["amqp", "${attr.unique.hostname}"]

        meta {
          erlang-node-name = "rabbit@${attr.unique.hostname}.rabbitmq.service.consul"
        }

        check {
          type    = "script"
          command = "bash"
          args    = ["/usr/bin/healthcheck.sh"]
          interval = "60s"
          timeout  = "5s"
        }
      }

      service {
        address_mode = "driver"
        name = "rabbitmq-nomad-management"
        port = 15672
        tags = [
          "http",
          "traefik.enable=true",
          "traefik.http.routers.rabbitmq-http.rule=Host(`mq.myhost`)",
          "traefik.http.routers.rabbitmq-http.entrypoints=web",
          "traefik.http.routers.rabbitmq-http.middlewares=redirect-https@file",
          "traefik.http.routers.rabbitmq-https.rule=Host(`mq.myhost`)",
          "traefik.http.routers.rabbitmq-https.entrypoints=websecure",
          "traefik.http.routers.rabbitmq-https.tls=true",
        ]

        check {
          address_mode = "driver"
          type = "http"
          path = "/"
          port = 15672
          interval = "10s"
          timeout = "2s"
        }
      }

      resources {
        memory = 1000
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants