-
Notifications
You must be signed in to change notification settings - Fork 95
mesos-consul acts up when mesos-agent is not responding #77
Comments
Is the host running the mesos agent still available or is the entire node unreachable? mesos-consul connects to the consul agent running on the same server as the mesos agent and if it can't reach it the service won't be deregistered. |
Well the entire node was unreachable because of load issue. That brings to an important point. If any of the mesos slave(consul agent) machine in the cluster become unresponsive, then mesos-consul is going to break? -Imran |
It currently can't deregister that nodes' services if the consul agent is down. You're seeing a lot of |
Yes, I was seeing a lot of de-register messages. Also, for some I was seeing bunch of "no route to host" as well as "i/o errors". So what I am saying is that if there are errors like that for some hosts, why would mesos-consul not update member list correctly when I scale applications up/down in marathon. -Imran |
And it wasn't that the apps were scaled on those affected nodes. It was getting scaled up and down on other hosts, but still mesos-consul was failing to update member list in consul. -Imran |
Those errors are what I would expect since it can't connect to Consul on the down node. mesos-consul currently can't update the nodes of the service because in order to do so it needs to deregister the service from the down node. And since it can't connect it is never deregistered. I am going to look in to using the Consul catalog endpoint to deregister the service if the agent endpoint fails to deregister it. mesos-consul doesn't use the catalog endpoint normally because services deregistered via the catalog will be re-registered by Consul's anti-entropy process. |
Hi Chris, |
Or the other scenario is mesos-consul doesn't use service registration through nodes. Rather it registers a service through an catalog endpoint. That way it is treated as an external service. And we wont have to have consul agents running on every mesos agents. And the only job of meos-consul would be to read marathon event bus and update member list in consul. Something like how Bamboo/HAProxy( https://github.com/QubitProducts/bamboo ) or mesos-dns (https://github.com/mesosphere/mesos-dns ) does it. |
@imrangit Managing the services through the catalog endpoint doesn't work. They need to be registered via an agent otherwise they get removed via Consul's anti-entropy process. The problem here is that your node isn't actually down. The consul cluster doesn't seem to recognize the failing node as gone. If it did, consul would have removed the services from the catalog. mesos-consul would still report that the deregistration failed but the agent's services are no longer in the catalog. If you run |
@ChrisAubuchon so lets start from the scratch.
Now my concern is how can we fix it if we go through service registration through consul agents. Because in the end what I want is consul to reflect the right members as they are in Marathon. Also, one thing I have noticed is when I look at STDOUT of mesos-consul, it see bunch of "Unable to register" messages for the failed nodes. When I run "consul members" for the failed node, it reported "failed". And when I did "force-leave" of that failed node, my members were still out of sync. Somehow mesos-consul is not able to pick up the changes and do the right thing. Let me know if you need more information to troubleshoot this. Thanks, |
Out of curiosity, what version of mesos-consul are you running? |
Also, can you get the output of the |
mesos-consul: 0.3.1 Here is the output of catalog for that service. It only contains 11 members as of now but my actual members in marathon are only 5. |
Is Are you seeing any deregistration errors in mesos-consul's stdout about any of these services? The oddball one here looks to be |
logstash-ypec maps only a single port. This is what docker ps said: I am seeing deregistration messages for other services. Actually none of the services in consul are in sync with marathon. Here is the mesos-consul STDOUT: warning msg="Unable to register mesos-consul:mesos:05da0a57-8351-4b6f-9318-9e499a4085c5-S187:ms11.ev1: Put http://10.3.161.41:8500/v1/agent/service/register: dial tcp 10.3.161.41:8500: connection refused" |
Those are Can you look for the logstash-ypec tasks in the mesos state? |
Yes, you are right. state.json properly reflect the 5 members that marathon shows it is running. But I did notice extra entries for "logstash-ypec" in state.json but they had the status of "TASK_LOST". The 5 running entries had "TASK_RUNNING" status. Does mesos-consul account for various states of a task? |
mesos-consul looks at the Task.State field and only registers if that is |
5 of them have: 2 of them have: |
|
@imrangit Can you try mesos-consul 0.3.2? 0.3.1 had a deregistration bug where the deregistration process would halt if it failed to deregister a service. I'll bet that's what's happening here. It's trying to deregister a service from an agent that is unreachable. It fails and stops deregistering other services. |
@ChrisAubuchon: should i also upgrade to the new consul 0.6.3? -Imran |
@imrangit consul 0.6.0 is fine |
@ChrisAubuchon : does the master branch have 0.3.2 changes committed? main.go still shows 0.3.1 version. |
@imrangit It does. I don't know why the version is still 0.3.1. My local branch has it as 0.3.2 in |
maybe git push origin master? |
I have a big mesos cluster with 200 nodes participating.
I am seeing issues with mesos-consul that when one or more mesos-slaves are not working:
And the only way I am able to get it to work is first to resolve the issue with the mesos-slave which it thinks is a part of the cluster. And then restart mesos-consul. Then only it updates the members properly from marathon to consul.
I think we need to make mesos-consul more robust to handle these scenarios for running in production. Because consul services are the the source of truth. And if the consul services are not reflecting the right members from marathon, it defeats the whole purpose of using mesos-consul.
-Imran
The text was updated successfully, but these errors were encountered: