Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long reboot time due to nvme reconnect #1764

Open
todeb opened this issue Oct 28, 2024 · 10 comments
Open

long reboot time due to nvme reconnect #1764

todeb opened this issue Oct 28, 2024 · 10 comments

Comments

@todeb
Copy link

todeb commented Oct 28, 2024

Describe the bug
after initiating reboot of k8s node with mayastor, OS in 8 mins was trying to reconnect nvme.

To Reproduce
init reboot of OS

Expected behavior
I dont know how these nvmes are handled but imo nvme should not lock system reboot, especially if they failing in connection

Screenshots

Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-c4376117\x2d0426\x2d4eda\x2d8590\x2d5d9fbc8c8a5d-volumes-kubernetes.io\x7ecsi-pvc\x2d81c7ff78\x2d70b7\x2d4cab\x2db72e\x2d76bfd6888e42-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Mount process still >
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Failed with result '>
Oct 24 14:56:38 node1 systemd[1]: var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount: Unit process 2456087>
Oct 24 14:56:38 node1 systemd[1]: Unmounted var-lib-kubelet-pods-73932d89\x2d7b3a\x2d4ecd\x2da250\x2d94292beb8bbf-volumes-kubernetes.io\x7ecsi-pvc\x2d793eefb1\x2d829a\x2d4e3f\x2da2db\x2d5108f865e410-mount.mount - /var/lib/>
Oct 24 14:56:38 node1 systemd[1]: Stopped target local-fs-pre.target - Preparation for Local File Systems.
Oct 24 14:56:38 node1 systemd[1]: Reached target umount.target - Unmount All Filesystems.
Oct 24 14:56:38 node1 systemd[1]: Stopping lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Oct 24 14:56:38 node1 systemd[1]: Stopping multipathd.service - Device-Mapper Multipath Device Controller...
Oct 24 14:56:38 node1 multipathd[3636859]: multipathd: shut down
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
Oct 24 14:56:38 node1 systemd[1]: systemd-tmpfiles-setup-dev-early.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully.
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme0n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme1n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme2n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme3n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 kernel: block nvme5n1: no usable path - requeuing I/O
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped multipathd.service - Device-Mapper Multipath Device Controller.
Oct 24 14:56:38 node1 systemd[1]: multipathd.service: Consumed 1min 20.798s CPU time, 19.0M memory peak, 0B memory swap peak.
Oct 24 14:56:38 node1 systemd[1]: systemd-remount-fs.service: Deactivated successfully.
Oct 24 14:56:38 node1 systemd[1]: Stopped systemd-remount-fs.service - Remount Root and Kernel File Systems.
Oct 24 14:56:40 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 460794 jiffies old.
Oct 24 14:56:40 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:40 node1 kernel: nvme nvme7: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme8: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme8: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme1: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme5: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme3: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme3: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme3: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme6: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:40 node1 kernel: nvme nvme2: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme7: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme0: Failed reconnect attempt 46
Oct 24 14:56:40 node1 kernel: nvme nvme7: Reconnecting in 10 seconds...
Oct 24 14:56:40 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 1785 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 471034 jiffies old.
Oct 24 14:56:50 node1 kernel: Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
Oct 24 14:56:50 node1 kernel: nvme nvme6: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme5: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme1: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme1: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme0: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme2: failed to connect socket: -111
Oct 24 14:56:50 node1 kernel: nvme nvme2: Failed reconnect attempt 47
Oct 24 14:56:50 node1 kernel: nvme nvme2: Reconnecting in 10 seconds...
Oct 24 14:56:50 node1 kernel: nvme nvme7: failed to connect socket: -111
.....
Oct 24 15:04:41 node1 kernel: nvme nvme0: failed to connect socket: -111
Oct 24 15:04:41 node1 kernel: nvme nvme0: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme0: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Failed reconnect attempt 93
Oct 24 15:04:41 node1 kernel: nvme nvme5: Reconnecting in 10 seconds...
Oct 24 15:04:41 node1 kernel: nvme nvme6: Reconnecting in 10 seconds...
Oct 24 15:04:49 node1 systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID 2456488.
Oct 24 15:04:49 node1 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 24 15:04:49 node1 systemd-journald[734]: Journal stopped

** OS info (please complete the following information):**

  • Distro: Ubuntu 24.04.1 LTS
  • Kernel version: 6.8.0-47-generic
  • MayaStor revision or container image: openebs.io/version=2.7.0

Additional context
Add any other context about the problem here.

@tiagolobocastro
Copy link
Contributor

Did you reboot without draining the node? The nvme kernel initiator will keep trying to connect for some time.
If you really want to do this, then you need to do something like this on the host:

for dev in /sys/class/nvme/*/ctrl_loss_tmo; do echo 10 | sudo tee -a $dev; done

But I suggest you drain the node first before rebooting otherwise the filesystems may not unmount gracefully and potentially result in data loss.

@todeb
Copy link
Author

todeb commented Oct 28, 2024

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?
Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

@tiagolobocastro
Copy link
Contributor

do you mean kubectl-mayastor drain command or usual kubectl drain to evict pods?

I mean the usual kubectl drain to evict the pods using the mayastor volumes to other nodes.

Im not draining node because im using single replica and have apps replicated by their own, so don't want that they will be scheduled on other nodes.

Ah I see... are you using the GracefulNodeShutdown feature?
IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

@Abhinandan-Purkait
Copy link
Member

Abhinandan-Purkait commented Oct 29, 2024

Ah I see... are you using the GracefulNodeShutdown feature? IIRC that should ensure the apps are stopped gracefully, but IIRC the volumeattachment and connections would still remain? CC @Abhinandan-Purkait

Yes, that's true. IIRC we need to remove the attachments manually.

@todeb
Copy link
Author

todeb commented Oct 29, 2024

not using it.

@todeb
Copy link
Author

todeb commented Dec 3, 2024

my kubernetes node lost network connection for while, now i see a lot of these errors in dmesg:

[Tue Dec  3 16:33:14 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Tue Dec  3 16:33:24 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5445 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 707338240 jiffies old.
[Tue Dec  3 16:33:24 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Tue Dec  3 16:33:34 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5445 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 707348480 jiffies old.
[Tue Dec  3 16:33:34 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Tue Dec  3 16:33:44 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5445 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 707358720 jiffies old.
[Tue Dec  3 16:33:44 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Tue Dec  3 16:33:55 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5445 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 707368960 jiffies old.
[Tue Dec  3 16:33:55 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[Tue Dec  3 16:34:05 2024] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 5445 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is 707379200 jiffies old.
[Tue Dec  3 16:34:05 2024] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()

can it be related to nvme / mayastor ?

@tiagolobocastro
Copy link
Contributor

Hi @todeb ,

I'm not sure about those errors above, never seen it.
But regarding the long reboot:

I can reproduce this quite easily. I think the problem is the network is being brought down before the nvme devices.
This is a problem because the devices get into connecting state, and are not able to connect because the network is already down!?

I was able to to somewhat address this with a systemd service, like so:

[Unit]
Description=Cleanup leftover NVMe-Of connections
DefaultDependencies=no
Requires=local-fs.target
Wants=network.target

[Service]
Type=oneshot
ExecStop=/var/local/mayastor/nvmf-cleanup.sh
TimeoutStopSec=30
RemainAfterExit=true
Environment="CTRL_LOSS_TMO=10"

[Install]
WantedBy=local-fs.target

And script:

#!/usr/bin/env bash

shopt -s nullglob

CTRL_LOSS_TMO=${CTRL_LOSS_TMO:-10}

set_all_ctrl_loss_tmo() {
  for ctrl in /sys/class/nvme/*; do
    if ! cat $ctrl/model | grep "Mayastor NVMe controller"; then
      # not our device
      continue 
    fi
    # set controller loss timeout so a broken device can be removed after the timeout
    set_ctrl_loss_tmo $ctrl $CTRL_LOSS_TMO
  done
}
set_ctrl_loss_tmo() {
  ctrl=$1
  ctrl_tmo=$ctrl/ctrl_loss_tmo
  ctrl_delay=$ctrl/reconnect_delay
  tmo=$2
  dl=5
  if [ -f $ctrl_tmo ]; then
    echo "Setting ReconnectDelay of $dl for $ctrl" 
    echo $dl | tee -a $ctrl_delay >/dev/null
    echo "Setting CtrlLossTmo of $tmo for $ctrl" 
    echo $tmo | tee -a $ctrl_tmo >/dev/null
  fi
}

device_from_ctrl() {
  local input=$1
  if [[ $input =~ ^nvme([0-9]{1,5})c[0-9]{1,5}n([0-9]{1,5})$ ]]; then
    echo "nvme${BASH_REMATCH[1]}n${BASH_REMATCH[2]}"
    return 0
  fi
  return 1
}

wait_fs_device_off() {
  dev_name=$1

  count=0

  while true; do
    files=(/sys/fs/*/$dev_name)
    if [ ${#files[@]} -eq 0 ]; then
      break
    fi
    echo "File still exists, waiting..."
    count=$((count+1))
    if [ $count -ge 10 ]; then
      echo "Maximum iterations reached, exiting."
      break
    fi
    sleep 0.1
  done 
}

cleanup_device() {
  local ctrl=$1
  local dev_name=$2
  local dev=$3
  findmnt $dev

  local mounts="no"
  for mnt in $(findmnt $dev -no target); do
    echo "Found $mnt for device $dev"
    mounts="yes"
  done

  if [ "$mounts" = "yes" ]; then
    echo -n "Unmounting $dev... "
    # We've set the ctrl loss timeout, so shouldn't lock up?
    umount $dev -A &
    echo "done"
  fi
  findmnt $dev

  wait_fs_device_off $dev_name

  echo -n "Deleting controller $ctrl... "
  echo 1 > $ctrl/delete_controller
  echo "done"
}

cleanup_nvmf() {
  # todo: may want to do multiple devices at once
  for ctrl in /sys/class/nvme/*; do
    if ! cat $ctrl/model | grep "Mayastor NVMe controller"; then
      # not our device
      continue 
    fi
    for match in $ctrl/nvme*c*n*; do
      echo -n "Finding block device for $ctrl... "
      local dev_name=$(device_from_ctrl ${match##*/})
      local dev="/dev/$dev_name"
      if [ $? -eq 0 ] && [ -n $dev ]; then
        echo "$dev"
        cleanup_device $ctrl $dev_name $dev
      else
        echo "failed"
      fi
    done
  done
}

echo "Starting NVMf Cleanup"

# set the ctrl loss timeout right away
set_all_ctrl_loss_tmo

# cleanup all mounts and controllers
cleanup_nvmf

echo "Done"

If you can give this a try, let me know how it goes!

@todeb
Copy link
Author

todeb commented Dec 12, 2024

looks very good, sure I will check this.

btw. about the load I noticed that is pretty correlated about io time for device which mayastor is using which went from <1% to more than 4% already.

image

Btw. what will be the consequence of stopping/killing io_engine, will just pods for few seconds loose access to storage?

@tiagolobocastro
Copy link
Contributor

Btw. what will be the consequence of stopping/killing io_engine, will just pods for few seconds loose access to storage?

Yeah path will become connecting, and IO may freeze until it comes back up or another takes its place

@tiagolobocastro
Copy link
Contributor

I've got improvement on the scripts (I think :))
I'm also realising we might need 2 services, one to keep network online and the other to disconnect gracefully after the pod containers and filesystems are down.
Anyway this is already better, I'm quickly rebooting a node with 100 volumes.

The service:

[Unit]
Description=Cleanup leftover NVMe-Of connections
Wants=network-online.target remote-fs-pre.target
After=network-online.target
Before=remote-fs-pre.target
DefaultDependencies=no
Conflicts=shutdown.target
Before=shutdown.target

[Service]
Type=oneshot
ExecStop=/root/nvmf-cleanup.sh
TimeoutStopSec=120
RemainAfterExit=true
Environment="CTRL_LOSS_TMO=1"
Environment="REC_DELAY=1"

[Install]
WantedBy=multi-user.target

The script:

#!/usr/bin/env bash

shopt -s nullglob

CTRL_LOSS_TMO=${CTRL_LOSS_TMO:-1}
REC_DELAY=${REC_DELAY:-1}

set_ctrl_loss_tmo() {
  ctrl=$1
  ctrl_tmo=$ctrl/ctrl_loss_tmo
  ctrl_delay=$ctrl/reconnect_delay
  tmo=$2
  dl=$3
  if [ -f $ctrl_tmo ]; then
    echo "Setting ReconnectDelay of $dl for $ctrl" 
    echo $dl | tee -a $ctrl_delay >/dev/null
    echo "Setting CtrlLossTmo of $tmo for $ctrl" 
    echo $tmo | tee -a $ctrl_tmo >/dev/null
  fi
}

device_from_ctrl() {
  local input=$1
  if [[ $input =~ ^nvme([0-9]{1,5})c[0-9]{1,5}n([0-9]{1,5})$ ]]; then
    echo "nvme${BASH_REMATCH[1]}n${BASH_REMATCH[2]}"
    return 0
  fi
  return 1
}

remount_dev_ro() {
  local ctrl=$1
  local dev_name=$2
  local dev=$3

  for mnt in $(findmnt $dev -no target); do
    echo "Found $mnt for device $dev"
    nohup sh -c "mount -o remount,ro $mnt" &
  done

  systemd-umount $dev --no-block

  # set controller loss timeout so a broken device can be removed after the timeout
  set_ctrl_loss_tmo $ctrl $CTRL_LOSS_TMO $REC_DELAY
  
  nohup sh -c "umount $dev" &
}

delete_controller() {
  local ctrl=$1
  local dev_name=$2
  local dev=$3

  echo -n "Deleting controller $ctrl... "
  nohup sh -c "echo 1 > $ctrl/delete_controller" &
  echo "done"
}

cleanup_nvmf() {
  # todo: may want to do multiple devices at once
  devices=()
  for ctrl in /sys/class/nvme/*; do
    if ! cat $ctrl/model | grep "Mayastor NVMe controller"; then
      # not our device
      continue 
    fi
    for match in $ctrl/nvme*c*n*; do
      echo -n "Finding block device for $ctrl... "
      local dev_name=$(device_from_ctrl ${match##*/})
      local dev="/dev/$dev_name"
      if [ $? -eq 0 ] && [ -n $dev ]; then
        echo "$dev"
        remount_dev_ro $ctrl $dev_name $dev
        devices+=("$dev_name")
      else
        echo "failed"
      fi
    done
  done
  sleep $REC_DELAY
  sleep $CTRL_LOSS_TMO
  sleep $CTRL_LOSS_TMO
  for ctrl in /sys/class/nvme/*; do
    if ! cat $ctrl/model | grep "Mayastor NVMe controller"; then
      # not our device
      continue 
    fi
    for match in $ctrl/nvme*c*n*; do
      echo -n "Finding block device for $ctrl... "
      local dev_name=$(device_from_ctrl ${match##*/})
      local dev="/dev/$dev_name"
      if [ $? -eq 0 ] && [ -n $dev ]; then
        echo "$dev"
        delete_controller $ctrl $dev_name $dev
      else
        echo "failed"
      fi
    done
  done
  echo "Waiting for devices: ${devices[@]}"
  local leftovers="no"    
  for n in $(seq 1 10); do
    sleep 1
    leftovers="no"    
    for dev_name in "${devices[@]}"; do
      files=(/sys/fs/*/$dev_name)
      echo "For $dev_name, found: ${files[@]}"
      if [ ${#files[@]} -eq 0 ]; then
        files=(/proc/fs/jbd2/$dev_name-*)
        if [ ${#files[@]} -eq 0 ]; then
          continue
        fi
      fi
      leftovers="yes"
      break
    done
    if [ "$leftovers" = "no" ]; then
      break
    fi
  done
  if [ "$leftovers" = "yes" ]; then
    echo "Timed out waiting for all devices..." 
  fi
}

echo "Starting NVMf Cleanup"

cleanup_nvmf

echo "Finished NVMf Cleanup"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants