Error rebuilding the cluster with a different number of nodes #2637

hitman249 · 2021-08-05T06:28:44Z

RKE version: v1.2.11

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:38 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.8
  GitCommit:        7eba5930496d9bbe375fdf71603e610ad737d2b2
 runc:
  Version:          1.0.0
  GitCommit:        v1.0.0-0-g84113ee
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7eba5930496d9bbe375fdf71603e610ad737d2b2
 runc version: v1.0.0-0-g84113ee
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-80-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.665GiB
 Name: node-2
 ID: QEKF:ILSY:KC3P:LQ7N:H5YD:YMWH:AM4C:QSXJ:6646:RIVY:XBVO:2XIS
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Linux node-2 5.4.0-80-generic  #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Type/provider of hosts: Bare-metal

cluster.yml file:

nodes:
  - address: node-1
    internal_address: 192.168.1.11
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-2
    internal_address: 192.168.1.199
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-3
    internal_address: 192.168.1.38
    user: server
    role:
      - controlplane
      - etcd
      - worker

authentication:
  strategy: x509
  sans:
    - "192.168.1.11"
    - "192.168.1.199"
    - "192.168.1.38"
    - "node-1"
    - "node-2"
    - "node-3"
    - "localhost"
    - "127.0.0.1"
    - "127.0.1.1"

ssh_key_path: ~/.ssh/id_rsa

kubernetes_version: v1.20.9-rancher1-1

authorization:
  mode: rbac

network:
  plugin: canal
  # mtu: 1500
  options:
    canal_flannel_backend_type: host-gw

dns:
  provider: coredns

ingress:
  provider: "nginx"
  options:
    use-forwarded-headers: "true"

cluster_name: cluster.local

addons_include:
  - https://download.elastic.co/downloads/eck/1.6.0/all-in-one.yaml

services:
  scheduler:
    extra_args:
      leader-elect: true
      leader-elect-renew-deadline: 20s
      leader-elect-lease-duration: 30s
      leader-elect-retry-period: 4s
  kube-api:
    service_cluster_ip_range: 10.96.0.0/12
    extra_args:
      allow-privileged: true
  kube-controller:
    cluster_cidr: 10.244.0.0/16
    service_cluster_ip_range: 10.96.0.0/12
    extra_args:
      leader-elect: true
      leader-elect-renew-deadline: 20s
      leader-elect-lease-duration: 30s
      leader-elect-retry-period: 4s
      cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
      cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"
  kubelet:
    cluster_dns_server: 10.96.0.10
    extra_args:
      pod-manifest-path: "/etc/kubernetes/manifests"
      minimum-container-ttl-duration: "10s"
      kube-reserved: "cpu=300m,memory=1024Mi,ephemeral-storage=1Gi,pid=100"
      system-reserved: "cpu=100m,memory=512Mi,ephemeral-storage=500Mi,pid=100"
    extra_binds:
      - "/opt/local-storage/galera:/opt/local-storage/galera"
      - "/opt/local-storage/galera-backup:/opt/local-storage/galera-backup"
      - "/opt/local-storage/elasticsearch:/opt/local-storage/elasticsearch"
      - "/opt/local-storage/store-volume1:/opt/local-storage/store-volume1"
      - "/opt/local-storage/store-volume2:/opt/local-storage/store-volume2"

Steps to Reproduce:

Build a cluster from nodes: cluster, node-1, node-2, node-3
Control node: cluster
Command: rke up
Remove cluster
Command: rke remove
Remove cluster node in cluster.yml file.
Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash
Build a cluster from nodes: node-1, node-2, node-3
Control node: node-2
Command: rke up

Results:
An error will appear on a random node:

Conflict. The container name "/rke-etcd-port-listener" is already in use by container

INFO[0000] Running RKE version: v1.2.11                 
INFO[0000] Initiating Kubernetes cluster                
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates 
INFO[0000] [certificates] Generating Kubernetes API server certificates 
INFO[0000] [certificates] Generating admin certificates and kubeconfig 
INFO[0000] Successfully Deployed state file at [./cluster.rkestate] 
INFO[0000] Building Kubernetes cluster                  
INFO[0000] [dialer] Setup tunnel for host [node-3]      
INFO[0000] [dialer] Setup tunnel for host [node-2]      
INFO[0000] [dialer] Setup tunnel for host [node-1]      
INFO[0000] [network] Deploying port listener containers 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-2], try  #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-1], try  #1 
INFO[0000] Pulling image [rancher/rke-tools:v0.1.77] on host [node-3], try  #1 
INFO[0026] Image [rancher/rke-tools:v0.1.77] exists on host [node-2] 
INFO[0026] Image [rancher/rke-tools:v0.1.77] exists on host [node-1] 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
WARN[0026] Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name. 
INFO[0027] Starting container [rke-etcd-port-listener] on host [node-2], try  #1 
INFO[0027] [network] Successfully started [rke-etcd-port-listener] container on host [node-2] 
INFO[0038] Image [rancher/rke-tools:v0.1.77] exists on host [node-3] 
INFO[0039] Starting container [rke-etcd-port-listener] on host [node-3], try  #1 
INFO[0040] [network] Successfully started [rke-etcd-port-listener] container on host [node-3] 
FATA[0040] [Failed to create [rke-etcd-port-listener] container on host [node-1]: Failed to create Docker container [rke-etcd-port-listener] on host [node-1]: Error response from daemon: Conflict. The container name "/rke-etcd-port-listener" is already in use by container "200e5e3170ff966fcf5f06f25f87be8411f2402b8c54968a838b0df06e68ab3b". You have to remove (or rename) that container to be able to reuse that name.]

The text was updated successfully, but these errors were encountered:

superseb · 2021-08-05T09:05:57Z

If the nodes are cleaned properly, it seems you are hitting the same as #2632. Can you share logging with --debug.

hitman249 · 2021-08-05T09:29:41Z

Repeated all the cleaning steps above

Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash

rke --debug up:
first run https://paste.4040.io/osapocawor.sql
second run https://paste.4040.io/uqugeforav.sql

superseb · 2021-08-05T10:02:05Z

The title of the issue is, error rebuilding the cluster, with what version, how many and what nodes and when was the last time you ran rke up successfully with these (or a subset of) these nodes?

hitman249 · 2021-08-05T10:35:50Z

This is a test cluster of 4 nodes, one was taken away, so I had to rebuild the cluster on 3 nodes.
I ran into this problem during build.

Before rebuilding, rke v1.2.9 + v1.20.8-rancher1-1 was running there.
After rebuilding rke v1.2.11 + v1.20.9-rancher1-1.

There is a subtlety, I did not immediately notice that the version of the image changed in rke v1.2.11, so I tried several times to start rke v1.2.11 + v1.20.8-rancher1-1.

But a complete cleanup of the node should have removed these changes.

superseb · 2021-08-05T10:39:06Z

When was the last time it worked with v1.2.9? Does it work with v1.2.9 now or do you also get an error? I assume the 3 remaining nodes have remained the same in between (specification wise, CPU/memory/disk)

hitman249 · 2021-08-06T03:04:07Z

rke v1.2.9 + v1.20.8-rancher1-1 doesn't work either
The remaining nodes remain the same:
node-1: https://linux-hardware.org/?probe=7b9adbf809 CPU: 2 cores, 8Gb RAM, SSD
node-2: https://linux-hardware.org/?probe=ee55d78771 CPU: 4 cores, 8Gb RAM, SSD
node-3: https://linux-hardware.org/?probe=1e17007509 CPU: 2 cores, 8Gb RAM, SSD

Repeated all the cleaning steps above

Clear up all nodes and reboot system with script https://paste.4040.io/agucoqawoz.bash

rke --debug up:
first run https://paste.4040.io/dumavazobu.sql
second run https://paste.4040.io/xegarojoqu.sql

superseb · 2021-08-06T05:20:42Z

Not really sure but if it's not version dependent, you might still be running into the linked issue above. Can you test storage performance using https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd?

hitman249 · 2021-08-06T06:06:26Z

node-1: https://paste.4040.io/enazuwojuq.md
node-2: https://paste.4040.io/ewahokibom.md
node-3: https://paste.4040.io/xugoyeripe.md

superseb · 2021-08-06T07:31:39Z

This look good, did you follow the article and check the values? Did you run it a couple of times to make sure it's consistent?

hitman249 · 2021-08-06T07:48:29Z

Sorry, I checked it once.
Results improved after several passes.

node-1: https://paste.4040.io/rikelejaro.md
node-2: https://paste.4040.io/uwelijiyib.md
node-3: https://paste.4040.io/qudenulada.md

superseb · 2021-08-06T11:57:21Z

I don't really have a lead at this moment, I mean, those errors could be related to bad disk performance but thats not the case. What changed on these nodes since the last successful rke up? Are you sure that DNS is correct for node-1/node-2/node-3 as it might conflict? Does it work when you use one node? Does it work when you create a completely new node and use that for rke up just to rule out the existing nodes?

hitman249 · 2021-08-09T07:44:44Z

The problem turned out to be an incorrect DNS.
They were filled in automatically and I was sure of their reliability.
Sorry for the false positive bug
Thanks for the help

192.168.1.199  node-1
192.168.1.38   node-2
192.168.1.11   node-3

nodes:
  - address: node-1
    internal_address: 192.168.1.11
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-2
    internal_address: 192.168.1.199
    user: server
    role:
      - controlplane
      - etcd
      - worker
  - address: node-3
    internal_address: 192.168.1.38
    user: server
    role:
      - controlplane
      - etcd
      - worker

hitman249 closed this as completed Aug 5, 2021

hitman249 reopened this Aug 5, 2021

superseb added the status/more-info label Aug 5, 2021

hitman249 closed this as completed Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error rebuilding the cluster with a different number of nodes #2637

Error rebuilding the cluster with a different number of nodes #2637

hitman249 commented Aug 5, 2021 •

edited by zube bot

Loading

superseb commented Aug 5, 2021

hitman249 commented Aug 5, 2021 •

edited

Loading

superseb commented Aug 5, 2021

hitman249 commented Aug 5, 2021

superseb commented Aug 5, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 9, 2021

Error rebuilding the cluster with a different number of nodes #2637

Error rebuilding the cluster with a different number of nodes #2637

Comments

hitman249 commented Aug 5, 2021 • edited by zube bot Loading

superseb commented Aug 5, 2021

hitman249 commented Aug 5, 2021 • edited Loading

superseb commented Aug 5, 2021

hitman249 commented Aug 5, 2021

superseb commented Aug 5, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 6, 2021

superseb commented Aug 6, 2021

hitman249 commented Aug 9, 2021

hitman249 commented Aug 5, 2021 •

edited by zube bot

Loading

hitman249 commented Aug 5, 2021 •

edited

Loading