consul service deregistration fails if the mesos slave is down #98

gena01 · 2016-08-25T18:24:42Z

It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.

What if you run a multi-master cluster consul setup and with local agents running on each box and you lost the slave that was running the service? The service is never deregistered and when running with log level DEBUG it seems that it will try forever to deregister the service by continuing to hit the old IP address of where the service was running.

evilezh · 2016-09-20T11:46:52Z

And if you do not have consul-timeout option (defaults 30 seconds) .. and you have plenty of such hosts, then your service refresh might finish cycle in 10+ minutes.
Even, I configure 1 second timeout and have 20+ failed nodes ... time is too large.

I think, it might be useful to split registration and de-registration, each in own thread.
Even more .. de-registration could have two threads/lists:

hot (new de-registrations)
retry (failed de-registrations)
Or another way is to implement pushback on failed entries.

gena01 · 2016-09-26T22:22:26Z

There's another option to consider. I run local consul agent (in client/agent mode) which talks to a mult9-master cluster )(recommended HA setup). So ideally the registration/deregistration should flow through the local agent. (maybe another reason to override/force a specific consul IP address when running mesos-consul)

ChrisAubuchon · 2016-09-27T00:47:10Z

If you push registration through a local agent then all of the services are linked with that agent. So if that agent becomes unavailable, all of the services that were registered through that agent become unavailable.

gena01 · 2016-09-27T02:18:57Z

Consul docs has a guide specifically to overcome the local agent owning service(s) case: https://www.consul.io/docs/guides/external.html

ChrisAubuchon · 2016-09-27T02:26:01Z

When I originally wrote mesos-consul I used the catalog for registration. The services were always getting removed by anti-entropy.

gena01 · 2016-09-27T14:43:01Z

I see. I had a mesos slave cluster on top of ec2 spot instances (which would/could get killed any time). This would leave quite a bit of a mess in consul. Checking the log would show that mesos-consul was unable to contact the agent on the killed nodes.

caussourd · 2016-09-28T16:38:39Z

I have the same issue: mesos-consul is so busy trying to deregister services that it doesn't register new ones.
Even if the agent leaves the consul cluster, mesos-consul tries to deregister services that were running on that node. It's trying to deregister services that are already deregistered. Is that normal behaviour?
If there is a way to get around this issue, please let me know. Cheers

evilezh · 2016-10-05T04:21:44Z

Let's summarize:
Servers in cloud is very dynamic thing. They go up and down and we have short living servers (temporary for some task), which joins for hour or two to cluster and then leaves etc.

I'm still convinced that my first idea separating to different threads is best way to manage.

we want robust registration
we want robust de-registration (hot list )
and some servers might just reboot (retry list)
and finally after day or two they might disappear from retry list as well.

I can give real example :)
AWS .. small spot fleet around 100 cores, running around 200 tasks :) I need to upgrade base image etc.
It's dev - I chose simple way .. all down, all up.
Start second fleet, terminate fist fleet.
In 1-2 minutes all 200 tasks re-scheduled on new fleet. But now we have big problem with registration.
I've 1 second timeout configured (down from 30 :) ) ... 200 tasks ... 200 seconds as minimum .. each run ... ;)

I think if we would start just splitting to two tasks or doing some async stuff would improve performance greatly.

evilezh · 2016-10-15T18:07:42Z

BTW, simple solution, would it be possible to add option to

do only registration
do only de-registration.

I could run two of them .. one would only register, another de-register :)

txbm · 2017-02-15T23:39:45Z

Has anyone solved this? I'm having this issue as well when Mesos loses a slave.

gena01 changed the title ~~consul service deregistration fails if the mesosslave is down~~ consul service deregistration fails if the mesos slave is down Aug 25, 2016

thomasvincent assigned Theaxiom Jun 5, 2017

thomasvincent added the bug label Jun 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul service deregistration fails if the mesos slave is down #98

consul service deregistration fails if the mesos slave is down #98

gena01 commented Aug 25, 2016

evilezh commented Sep 20, 2016

gena01 commented Sep 26, 2016

ChrisAubuchon commented Sep 27, 2016

gena01 commented Sep 27, 2016 •

edited

Loading

ChrisAubuchon commented Sep 27, 2016

gena01 commented Sep 27, 2016

caussourd commented Sep 28, 2016

evilezh commented Oct 5, 2016

evilezh commented Oct 15, 2016

txbm commented Feb 15, 2017

consul service deregistration fails if the mesos slave is down #98

consul service deregistration fails if the mesos slave is down #98

Comments

gena01 commented Aug 25, 2016

evilezh commented Sep 20, 2016

gena01 commented Sep 26, 2016

ChrisAubuchon commented Sep 27, 2016

gena01 commented Sep 27, 2016 • edited Loading

ChrisAubuchon commented Sep 27, 2016

gena01 commented Sep 27, 2016

caussourd commented Sep 28, 2016

evilezh commented Oct 5, 2016

evilezh commented Oct 15, 2016

txbm commented Feb 15, 2017

gena01 commented Sep 27, 2016 •

edited

Loading