You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.
Our environment is on aws ec2 with mesos cluster size of 150 nodes. Nodes scaling is managed by aws autoscaling. We are using latest version of mesos-consul 0.4.0.
We are seeing issue with mesos-consul when a node is terminated (by ASG or by other operational activities)-
mesos-consul try to connect to the terminated node for de-registration until its timeout (timeout is like 30 secs) and it keeps on doing so in all upcoming runs. This begin the latency problem in updates to Consul. And at some stage when the queue of such nodes will grow huge (due to autoscaling terminations, genuine node failure or due to operational activities) then latency becomes worse due to wait for each such node connection timeout.
We are requesting a configuration parameter like 'max-attempts for failed nodes' which can be used to control the connection attempt count and when that count is reached mesos-consul should remove the failed node from its memory i.e should never ever connect again.
-Rahul
The text was updated successfully, but these errors were encountered:
rahul9999
changed the title
mesos-consul doesnt forget terminated mesos-agents, cause high latency in updating records to consul
mesos-consul keep on connecting to terminated mesos-agents, causing high latency in updating records to consul
Jan 12, 2018
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Our environment is on aws ec2 with mesos cluster size of 150 nodes. Nodes scaling is managed by aws autoscaling. We are using latest version of mesos-consul 0.4.0.
We are seeing issue with mesos-consul when a node is terminated (by ASG or by other operational activities)-
Workaround is re-start mesos-consul which then re-build consul data from scratch (not a desired solution).
meso-consul-deregistrationerror-timeouts.txt
Example logs (also in attachment):
time="2018-01-12T12:50:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11061: dial tcp 110.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092"
time="2018-01-12T12:50:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092: dial tcp 10.65.xx.38:8500: i/o timeout"
time="2018-01-12T12:50:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973"
time="2018-01-12T12:51:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910"
time="2018-01-12T12:51:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:51:55Z" level=info msg="Deregistering mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17"
time="2018-01-12T12:52:25Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:52:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094"
time="2018-01-12T12:52:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:52:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911"
time="2018-01-12T12:53:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059"
time="2018-01-12T12:53:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.11:uat-dcp-api-profile-uat-dcp:172.xx.xx.13:25553"
time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.xx.xx.38"
time="2018-01-12T12:54:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.65.228.38: dial tcp 10.xx.xx.38:8500: i/o timeout"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.33:msg-ui:172.xx.xx.11:61067"
time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829"
time="2018-01-12T12:54:55Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829: dial tcp 10.xx.xx.17:8500: i/o timeout"
time="2018-01-12T12:54:55Z" level=info msg="Zookeeper leader: 10.xx.xx.202:5050"
time="2018-01-12T12:54:55Z" level=info msg="reloading from master 10.xx.xx.202"
time="2018-01-12T12:54:55Z" level=info msg="Running parseState"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"
time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"
We are requesting a configuration parameter like 'max-attempts for failed nodes' which can be used to control the connection attempt count and when that count is reached mesos-consul should remove the failed node from its memory i.e should never ever connect again.
-Rahul
The text was updated successfully, but these errors were encountered: