Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

consul service deregistration fails if the mesos slave is down #98

Open
gena01 opened this issue Aug 25, 2016 · 10 comments
Open

consul service deregistration fails if the mesos slave is down #98

gena01 opened this issue Aug 25, 2016 · 10 comments
Assignees
Labels

Comments

@gena01
Copy link

gena01 commented Aug 25, 2016

It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.

What if you run a multi-master cluster consul setup and with local agents running on each box and you lost the slave that was running the service? The service is never deregistered and when running with log level DEBUG it seems that it will try forever to deregister the service by continuing to hit the old IP address of where the service was running.

@gena01 gena01 changed the title consul service deregistration fails if the mesosslave is down consul service deregistration fails if the mesos slave is down Aug 25, 2016
@evilezh
Copy link

evilezh commented Sep 20, 2016

And if you do not have consul-timeout option (defaults 30 seconds) .. and you have plenty of such hosts, then your service refresh might finish cycle in 10+ minutes.
Even, I configure 1 second timeout and have 20+ failed nodes ... time is too large.

I think, it might be useful to split registration and de-registration, each in own thread.
Even more .. de-registration could have two threads/lists:

  1. hot (new de-registrations)
  2. retry (failed de-registrations)
    Or another way is to implement pushback on failed entries.

@gena01
Copy link
Author

gena01 commented Sep 26, 2016

There's another option to consider. I run local consul agent (in client/agent mode) which talks to a mult9-master cluster )(recommended HA setup). So ideally the registration/deregistration should flow through the local agent. (maybe another reason to override/force a specific consul IP address when running mesos-consul)

@ChrisAubuchon
Copy link
Contributor

If you push registration through a local agent then all of the services are linked with that agent. So if that agent becomes unavailable, all of the services that were registered through that agent become unavailable.

@gena01
Copy link
Author

gena01 commented Sep 27, 2016

Consul docs has a guide specifically to overcome the local agent owning service(s) case: https://www.consul.io/docs/guides/external.html

@ChrisAubuchon
Copy link
Contributor

When I originally wrote mesos-consul I used the catalog for registration. The services were always getting removed by anti-entropy.

@gena01
Copy link
Author

gena01 commented Sep 27, 2016

I see. I had a mesos slave cluster on top of ec2 spot instances (which would/could get killed any time). This would leave quite a bit of a mess in consul. Checking the log would show that mesos-consul was unable to contact the agent on the killed nodes.

@caussourd
Copy link

I have the same issue: mesos-consul is so busy trying to deregister services that it doesn't register new ones.
Even if the agent leaves the consul cluster, mesos-consul tries to deregister services that were running on that node. It's trying to deregister services that are already deregistered. Is that normal behaviour?
If there is a way to get around this issue, please let me know. Cheers

@evilezh
Copy link

evilezh commented Oct 5, 2016

Let's summarize:
Servers in cloud is very dynamic thing. They go up and down and we have short living servers (temporary for some task), which joins for hour or two to cluster and then leaves etc.

I'm still convinced that my first idea separating to different threads is best way to manage.

  1. we want robust registration
  2. we want robust de-registration (hot list )
  3. and some servers might just reboot (retry list)
  4. and finally after day or two they might disappear from retry list as well.

I can give real example :)
AWS .. small spot fleet around 100 cores, running around 200 tasks :) I need to upgrade base image etc.
It's dev - I chose simple way .. all down, all up.
Start second fleet, terminate fist fleet.
In 1-2 minutes all 200 tasks re-scheduled on new fleet. But now we have big problem with registration.
I've 1 second timeout configured (down from 30 :) ) ... 200 tasks ... 200 seconds as minimum .. each run ... ;)

I think if we would start just splitting to two tasks or doing some async stuff would improve performance greatly.

@evilezh
Copy link

evilezh commented Oct 15, 2016

BTW, simple solution, would it be possible to add option to

  1. do only registration
  2. do only de-registration.

I could run two of them .. one would only register, another de-register :)

@txbm
Copy link

txbm commented Feb 15, 2017

Has anyone solved this? I'm having this issue as well when Mesos loses a slave.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants