-
Notifications
You must be signed in to change notification settings - Fork 95
consul service deregistration fails if the mesos slave is down #98
Comments
And if you do not have consul-timeout option (defaults 30 seconds) .. and you have plenty of such hosts, then your service refresh might finish cycle in 10+ minutes. I think, it might be useful to split registration and de-registration, each in own thread.
|
There's another option to consider. I run local consul agent (in client/agent mode) which talks to a mult9-master cluster )(recommended HA setup). So ideally the registration/deregistration should flow through the local agent. (maybe another reason to override/force a specific consul IP address when running mesos-consul) |
If you push registration through a local agent then all of the services are linked with that agent. So if that agent becomes unavailable, all of the services that were registered through that agent become unavailable. |
Consul docs has a guide specifically to overcome the local agent owning service(s) case: https://www.consul.io/docs/guides/external.html |
When I originally wrote |
I see. I had a mesos slave cluster on top of ec2 spot instances (which would/could get killed any time). This would leave quite a bit of a mess in consul. Checking the log would show that mesos-consul was unable to contact the agent on the killed nodes. |
I have the same issue: mesos-consul is so busy trying to deregister services that it doesn't register new ones. |
Let's summarize: I'm still convinced that my first idea separating to different threads is best way to manage.
I can give real example :) I think if we would start just splitting to two tasks or doing some async stuff would improve performance greatly. |
BTW, simple solution, would it be possible to add option to
I could run two of them .. one would only register, another de-register :) |
Has anyone solved this? I'm having this issue as well when Mesos loses a slave. |
It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.
What if you run a multi-master cluster consul setup and with local agents running on each box and you lost the slave that was running the service? The service is never deregistered and when running with log level DEBUG it seems that it will try forever to deregister the service by continuing to hit the old IP address of where the service was running.
The text was updated successfully, but these errors were encountered: