Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error rebuilding the cluster with a different number of nodes #2637

Closed
hitman249 opened this issue Aug 5, 2021 · 12 comments
Closed

Error rebuilding the cluster with a different number of nodes #2637

hitman249 opened this issue Aug 5, 2021 · 12 comments

Comments

@hitman249
Copy link

hitman249 commented Aug 5, 2021

RKE version: v1.2.11

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:38 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.8
  GitCommit:        7eba5930496d9bbe375fdf71603e610ad737d2b2
 runc:
  Version:          1.0.0
  GitCommit:        v1.0.0-0-g84113ee
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7eba5930496d9bbe375fdf71603e610ad737d2b2
 runc version: v1.0.0-0-g84113ee
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-80-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.665GiB
 Name: node-2
 ID: QEKF:ILSY:KC3P:LQ7N:H5YD:YMWH:AM4C:QSXJ:6646:RIVY:XBVO:2XIS
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Linux node-2 5.4.0-80-generic  #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Type/provider of hosts: Bare-metal

cluster.yml file:

nodes:
  - address: node-1
    internal_address: 192.168.1.11
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-2
    internal_address: 192.168.1.199
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-3
    internal_address: 192.168.1.38
    user: server
    role:
      - controlplane
      - etcd
      - worker

authentication:
  strategy: x509
  sans:
    - "192.168.1.11"
    - "192.168.1.199"
    - "192.168.1.38"
    - "node-1"
    - "node-2"
    - "node-3"
    - "localhost"
    - "127.0.0.1"
    - "127.0.1.1"

ssh_key_path: ~/.ssh/id_rsa

kubernetes_version: v1.20.9-rancher1-1

authorization:
  mode: rbac

network:
  plugin: canal
  # mtu: 1500
  options:
    canal_flannel_backend_type: host-gw

dns:
  provider: coredns

ingress:
  provider: "nginx"
  options:
    use-forwarded-headers: "true"

cluster_name: cluster.local

addons_include:
  - https://download.elastic.co/downloads/eck/1.6.0/all-in-one.yaml

services:
  scheduler:
    extra_args:
      leader-elect: true
      leader-elect-renew-deadline: 20s
      leader-elect-lease-duration: 30s
      leader-elect-retry-period: 4s
  kube-api:
    service_cluster_ip_range: 10.96.0.0/12
    extra_args:
      allow-privileged: true
  kube-controller:
    cluster_cidr: 10.244.0.0/16
    service_cluster_ip_range: 10.96.0.0/12
    extra_args:
      leader-elect: true
      leader-elect-renew-deadline: 20s
      leader-elect-lease-duration: 30s
      leader-elect-retry-period: 4s
      cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
      cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"
  kubelet:
    cluster_dns_server: 10.96.0.10
    extra_args:
      pod-manifest-path: "/etc/kubernetes/manifests"
      minimum-container-ttl-duration: "10s"
      kube-reserved: "cpu=300m,memory=1024Mi,ephemeral-storage=1Gi,pid=100"
      system-reserved: "cpu=100m,memory=512Mi,ephemeral-storage=500Mi,pid=100"
    extra_binds:
      - "/opt/local-storage/galera:/opt/local-storage/galera"
      - "/opt/local-storage/galera-backup:/opt/local-storage/galera-backup"
      - "/opt/local-storage/elasticsearch:/opt/local-storage/elasticsearch"
      - "/opt/local-storage/store-volume1:/opt/local-storage/store-volume1"
      - "/opt/local-storage/store-volume2:/opt/local-storage/store-volume2"

Steps to Reproduce:

  1. Build a cluster from nodes: cluster, node-1, node-2, node-3
    Control node: cluster
    Command: rke up
  2. Remove cluster
    Command: rke remove
  3. Remove cluster node in cluster.yml file.
  4. Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash
  5. Build a cluster from nodes: node-1, node-2, node-3
    Control node: node-2
    Command: rke up

Results:
An error will appear on a random node:

Conflict. The container name "/rke-etcd-port-listener" is already in use by container

INFO[0000] Running RKE version: v1.2.11                 
INFO[0000] Initiating Kubernetes cluster                
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates 
INFO[0000] [certificates] Generating Kubernetes API server certificates 
INFO[0000] [certificates] Generating admin certificates and kubeconfig 
INFO[0000] Successfully Deployed state file at [./cluster.rkestate] 
INFO[0000] Building Kubernetes cluster                  
INFO[0000] [dialer] Setup tunnel for host [node-3]      
INFO[0000] [dialer] Setup tunnel for host [node-2]      
INFO[0000] [dialer] Setup tunnel for host [node-1]      
INFO[0000] [network] Deploying port listener containers 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-2], try  #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-1], try  #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-3], try  #1 
INFO[0026] Image [rancher/rke-tools:v0.1.77] exists on host [node-2] 
INFO[0026] Image [rancher/rke-tools:v0.1.77] exists on host [node-1] 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
INFO[0027] Starting container [rke-etcd-port-listener] on host [node-2], try  #1 
INFO[0027] [network] Successfully started [rke-etcd-port-listener] container on host [node-2] 
INFO[0038] Image [rancher/rke-tools:v0.1.77] exists on host [node-3] 
INFO[0039] Starting container [rke-etcd-port-listener] on host [node-3], try  #1 
INFO[0040] [network] Successfully started [rke-etcd-port-listener] container on host [node-3] 
FATA[0040] [Failed to create [rke-etcd-port-listener] container on host [node-1]: Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name.] 
@hitman249 hitman249 reopened this Aug 5, 2021
@superseb
Copy link
Contributor

superseb commented Aug 5, 2021

If the nodes are cleaned properly, it seems you are hitting the same as #2632. Can you share logging with --debug.

@hitman249
Copy link
Author

hitman249 commented Aug 5, 2021

Repeated all the cleaning steps above

Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash

rke --debug up:
first run https://paste.4040.io/osapocawor.sql
second run https://paste.4040.io/uqugeforav.sql

@superseb
Copy link
Contributor

superseb commented Aug 5, 2021

The title of the issue is, error rebuilding the cluster, with what version, how many and what nodes and when was the last time you ran rke up successfully with these (or a subset of) these nodes?

@hitman249
Copy link
Author

This is a test cluster of 4 nodes, one was taken away, so I had to rebuild the cluster on 3 nodes.
I ran into this problem during build.

Before rebuilding, rke v1.2.9 + v1.20.8-rancher1-1 was running there.
After rebuilding rke v1.2.11 + v1.20.9-rancher1-1.

There is a subtlety, I did not immediately notice that the version of the image changed in rke v1.2.11, so I tried several times to start rke v1.2.11 + v1.20.8-rancher1-1.

But a complete cleanup of the node should have removed these changes.

@superseb
Copy link
Contributor

superseb commented Aug 5, 2021

When was the last time it worked with v1.2.9? Does it work with v1.2.9 now or do you also get an error? I assume the 3 remaining nodes have remained the same in between (specification wise, CPU/memory/disk)

@hitman249
Copy link
Author

rke v1.2.9 + v1.20.8-rancher1-1 doesn't work either
The remaining nodes remain the same:
node-1: https://linux-hardware.org/?probe=7b9adbf809 CPU: 2 cores, 8Gb RAM, SSD
node-2: https://linux-hardware.org/?probe=ee55d78771 CPU: 4 cores, 8Gb RAM, SSD
node-3: https://linux-hardware.org/?probe=1e17007509 CPU: 2 cores, 8Gb RAM, SSD

Repeated all the cleaning steps above

Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash

rke --debug up:
first run https://paste.4040.io/dumavazobu.sql
second run https://paste.4040.io/xegarojoqu.sql

@superseb
Copy link
Contributor

superseb commented Aug 6, 2021

Not really sure but if it's not version dependent, you might still be running into the linked issue above. Can you test storage performance using https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd?

@hitman249
Copy link
Author

@superseb
Copy link
Contributor

superseb commented Aug 6, 2021

This look good, did you follow the article and check the values? Did you run it a couple of times to make sure it's consistent?

@hitman249
Copy link
Author

Sorry, I checked it once.
Results improved after several passes.

node-1: https://paste.4040.io/rikelejaro.md
node-2: https://paste.4040.io/uwelijiyib.md
node-3: https://paste.4040.io/qudenulada.md

@superseb
Copy link
Contributor

superseb commented Aug 6, 2021

I don't really have a lead at this moment, I mean, those errors could be related to bad disk performance but thats not the case. What changed on these nodes since the last successful rke up? Are you sure that DNS is correct for node-1/node-2/node-3 as it might conflict? Does it work when you use one node? Does it work when you create a completely new node and use that for rke up just to rule out the existing nodes?

@hitman249
Copy link
Author

The problem turned out to be an incorrect DNS.
They were filled in automatically and I was sure of their reliability.
Sorry for the false positive bug
Thanks for the help

192.168.1.199  node-1
192.168.1.38   node-2
192.168.1.11   node-3
nodes:
  - address: node-1
    internal_address: 192.168.1.11
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-2
    internal_address: 192.168.1.199
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-3
    internal_address: 192.168.1.38
    user: server
    role:
      - controlplane
      - etcd
      - worker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants