-
Notifications
You must be signed in to change notification settings - Fork 35
API Server crashlooping because it can't find etcd #94
Comments
The etcd server seems to be healthy. Here's the log from that:
|
It looks like at some point the liveness probe started failing:
|
@csrwng we saw an issue earlier where the livenessprobe started failing because of slow DNS resolving... Can you confirm that all SDN pods are running on the master node? Also you can try to add static /etc/hosts entry and see if it resolves the crashloop. |
@mfojtik I am seeing this as well, only after 10~ minutes. I did oc cluster up on some AWS instance, I don't think see any SDN pods running. I just ran ansible-playbook contrib/ansible/deploy-devel-playbook.yml and ansible-playbook contrib/ansible/create-cluster-playbook.yml and after a while cluster-operator-apiserver started crashlooping irrecoverably. What should I add to my /etc/hosts? |
logs very similar to above, apiserver container panics the same way.
|
So etcd is not healthy after all, readding logs more carefully etcd is what gets OOMKilled first and then the apiserver container starts panicking. My etcd log is filled with messages like 2018-09-27 19:33:29.803997 W | etcdserver: request "header:<ID:7587833219866082036 > txn:<compare:<target:MOD key:"/registry/k8s.io/cluster.k8s.io/clusters/myproject/fedora-9c8wk" mod_revision:1689 > success:<request_put:<key:"/registry/k8s.io/cluster.k8s.io/clusters/myproject/fedora-9c8wk" value_size:5901 >> failure:<request_range:<key:"/registry/k8s.io/cluster.k8s.io/clusters/myproject/fedora-9c8wk" > >>" with result "size:16" took too long (398.706759ms) to execute and 2018-09-27 19:33:33.000241 W | etcdserver: read-only range request "key:"/registry/k8s.io/cluster.k8s.io/clusters/myproject/fedora-9c8wk" " with result "range_response_count:1 size:5987" took too long (2.196201228s) to execute |
Well it seems like etcd is simply running out of memory idle because I increased Line 134 in ae319de
How much memory does the cluster operator's apiserver etcd need typically? (Sorry for noise, could have done all this debugging first before commenting.) |
After running on my cluster for a couple of hours, I started seeing the api server crash loop. Here's a log from a run:
The text was updated successfully, but these errors were encountered: