Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

mesos-consul keep on connecting to terminated mesos-agents, causing high latency in updating records to consul #124

Open
rahul9999 opened this issue Jan 12, 2018 · 0 comments

Comments

@rahul9999
Copy link

rahul9999 commented Jan 12, 2018

Our environment is on aws ec2 with mesos cluster size of 150 nodes. Nodes scaling is managed by aws autoscaling. We are using latest version of mesos-consul 0.4.0.

We are seeing issue with mesos-consul when a node is terminated (by ASG or by other operational activities)-

  • mesos-consul try to connect to the terminated node for de-registration until its timeout (timeout is like 30 secs) and it keeps on doing so in all upcoming runs. This begin the latency problem in updates to Consul. And at some stage when the queue of such nodes will grow huge (due to autoscaling terminations, genuine node failure or due to operational activities) then latency becomes worse due to wait for each such node connection timeout.

Workaround is re-start mesos-consul which then re-build consul data from scratch (not a desired solution).
meso-consul-deregistrationerror-timeouts.txt

Example logs (also in attachment):
time="2018-01-12T12:50:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11061: dial tcp 110.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092"
time="2018-01-12T12:50:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092: dial tcp 10.65.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973"
time="2018-01-12T12:51:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910"
time="2018-01-12T12:51:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:55Z" level=info msg="Deregistering mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17"
time="2018-01-12T12:52:25Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:52:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094"
time="2018-01-12T12:52:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:52:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911"
time="2018-01-12T12:53:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059"
time="2018-01-12T12:53:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.11:uat-dcp-api-profile-uat-dcp:172.xx.xx.13:25553"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.xx.xx.38"
time="2018-01-12T12:54:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.65.228.38: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.33:msg-ui:172.xx.xx.11:61067"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829"
time="2018-01-12T12:54:55Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:54:55Z" level=info msg="Zookeeper leader: 10.xx.xx.202:5050"
time="2018-01-12T12:54:55Z" level=info msg="reloading from master 10.xx.xx.202"
time="2018-01-12T12:54:55Z" level=info msg="Running parseState"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"

We are requesting a configuration parameter like 'max-attempts for failed nodes' which can be used to control the connection attempt count and when that count is reached mesos-consul should remove the failed node from its memory i.e should never ever connect again.

-Rahul

@rahul9999 rahul9999 changed the title mesos-consul doesnt forget terminated mesos-agents, cause high latency in updating records to consul mesos-consul keep on connecting to terminated mesos-agents, causing high latency in updating records to consul Jan 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant