xfsprogs in csi-driver container and on host do not match #2588

kaitimmer · 2024-11-07T08:01:32Z

What happened:
When we try to mount a new XFS volume to a pod (via volumeclaimtemplate) we see the following error:

 AttachVolume.Attach succeeded for volume 'pvc-83d914c5-8359-4a5a-b659-ca6d46344792'
  Warning  FailedMount             7s (x7 over 51s)  kubelet                  MountVolume.MountDevice failed for volume 'pvc-83d914c5-8359-4a5a-b659-ca6d46344792' : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount, failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o noatime,defaults,nouuid,defaults /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/51acc19d37a450db470a345ce2ef9a54140278ef8cef648ba3d28f187913efbb/globalmount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.

What you expected to happen:

Mounting new volumes should just work.

How to reproduce it:

Create a new disk with the following StorageClass:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  labels:
    kustomize.toolkit.fluxcd.io/name: kube-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: default-v2-xfs-noatime
  resourceVersion: "4669108598"
  uid: 7e0be630-073f-461c-9513-9dd3131f578c
mountOptions:
- noatime
- defaults
parameters:
  DiskIOPSReadWrite: "3000"
  DiskMBpsReadWrite: "125"
  cachingMode: None
  fstype: xfs
  storageaccounttype: PremiumV2_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And mount it to a pod.

Anything else we need to know?:
When we log in to the aks node and execute dmesg we get the following information:

[42449.910213] XFS (sdb): Superblock has unknown incompatible features (0x20) enabled.
[42449.910217] XFS (sdb): Filesystem cannot be safely mounted by this kernel.
[42449.910228] XFS (sdb): SB validate failed with error -22.

This is the output of the xfs disk on the node:

root@aks-zone1node-35463436-vmss000011:/# xfs_db -r /dev/disk/azure/scsi1/lun8
xfs_db version
versionnum [0xbca5+0x18a] = V5,NLINK,DIRV2,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,RMAPBT,REFLINK,INOBTCNT,BIGTIME

On other nodes with older xfs volumes, the mount still works. The differences of the xfs format are:

xfs_db version
versionnum [0xbca5+0x18a] = V5,NLINK,DIRV2,ALIGN,LOGV2,EXTFLG,SECTOR,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK,INOBTCNT,BIGTIME

So, the new XFS volumes get the RMAPBT attribute, which seemingly can't be handled in the Ubuntu AKs node image.

Our workaround now is to log in to the Azure node and reformat the volume with 'mkfs.xfs—f/dev/sdX'.

Also, I would assume that this ffbeb55 might already be the hotfix for it. So, I'm just raising this for awareness so that others do not spend hours finding the issue on their end.

I do not think this is the best way to fix this, but it'll do as a quick solution.

A proper solution might be to mount the tools directly from the host into the container so that this version mismatch does not happen again.
Or do it like this: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/Dockerfile which also prevents this.

Can we please release a new version and release the fix?

Environment:

CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.30.5
Kubernetes version (use kubectl version): 1.30.3
OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202410.15.0

The text was updated successfully, but these errors were encountered:

andyzhangx · 2024-11-07T08:18:55Z

thanks for providing the info. @kaitimmer the problem is that AKS node is still using Ubuntu 5.15 kernel which does not support xfs RMAPBT attribute (kernel 6.x supports), while the new alpine image 3.20.2 contains this new xfs RMAPBT attribute, that makes the incompatibility. I would rather revert to old alpine image.

if you have AKS managed csi driver, and want to revert to v1.30.4, just mails me, thanks.

Note that this bug only impacts new xfs disk using azure disk csi driver v1.30.5

kaitimmer · 2024-11-07T09:31:50Z

if you have AKS managed csi driver, and want to revert to v1.30.4, just mails me, thanks.

I reached out to you for our specific clusters via Email.
Thanks for your help!

monotek · 2024-11-07T10:17:06Z

@andyzhangx
Can you estimate wow long would we need to stay on the old version of the CSI Driver docker image?
I guess it would mean CSI driver container updates would be disabled for the time beeing?
Would you advised to use Azure Linux instead of Ubuntu?

andyzhangx · 2024-11-07T11:50:02Z

@andyzhangx Can you estimate wow long would we need to stay on the old version of the CSI Driver docker image? I guess it would mean CSI driver container updates would be disabled for the time beeing? Would you advised to use Azure Linux instead of Ubuntu?

@monotek we will upgrade to alpine base image 3.18.9 which also fixes the CVE, here is the PR: #2590
btw, azure linux or Ubuntu makes no difference, those two AKS supported node images do not support xfs RMAPBT attribute since they are both on kernel 5.15

monotek · 2024-11-07T12:54:54Z

Ah, ok, Thanks! :)

We were not sure about that as we saw Kernel 6.6 here too: https://github.com/microsoft/azurelinux/releases/tag/3.0.20240824-3.0

So I guess the aks nodes would still use Azure Linux 2.x?

andyzhangx · 2024-11-07T12:56:57Z

Ah, ok, Thanks! :)

We were not sure about that as we saw Kernel 6.6 here too: https://github.com/microsoft/azurelinux/releases/tag/3.0.20240824-3.0

So I guess the aks nodes would still use Azure Linux 2.x?

@monotek Azure Linux 3.x (preview) is on kernel 6.6, while Azure Linux 2.x is on kernel 5.15

not working versions

/ # apk list xfsprogs
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/community: No such file or directory
xfsprogs-6.8.0-r0 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]
/ # apk list xfsprogs-extra
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.20/community: No such file or directory
xfsprogs-extra-6.8.0-r0 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]

working versions

/ # apk list xfsprogs
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
xfsprogs-6.2.0-r2 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]
/ # apk list xfsprogs-extra
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/main: No such file or directory
WARNING: opening from cache https://dl-cdn.alpinelinux.org/alpine/v3.18/community: No such file or directory
xfsprogs-extra-6.2.0-r2 x86_64 {xfsprogs} (LGPL-2.1-or-later) [installed]

and unfortunately we cannot downgrade to a specific version in higher alpine base image:

#0 1.476 ERROR: unable to select packages:
#0 1.478   xfsprogs-6.8.0-r0:
#0 1.478     breaks: world[xfsprogs=6.2.0-r2]
#0 1.478     satisfies: xfsprogs-extra-6.8.0-r0[xfsprogs]
#0 1.479   xfsprogs-extra-6.8.0-r0:
#0 1.479     breaks: world[xfsprogs-extra=6.2.0-r2]
------
Dockerfile:16
--------------------
  15 |     FROM alpine:3.20.3
  16 | >>> RUN apk upgrade --available --no-cache && \
  17 | >>>     apk add --no-cache util-linux e2fsprogs e2fsprogs-extra ca-certificates udev xfsprogs=6.2.0-r2 xfsprogs-extra==6.2.0-r2 btrfs-progs btrfs-progs-extra

ctrmcubed · 2024-11-20T13:26:10Z

I appreciate the help from @andyzhangx in resetting the azuredisk-csi to v1.30.4 on our cluster. However we restart the cluster daily and each restart upgrades to v1.30.5 again.

Is there an enduring solution?

andyzhangx · 2024-11-20T14:06:05Z

@ctrmcubed the hotfix has been rolled out complete in northeurope and westeurope regions now, pls check. we will also rollout on other regions next.

andyzhangx · 2024-11-20T14:08:42Z

btw, the issue is only on formatting new xfs PVC disk (no data loss risk here), if your cluster has been restored with fix ( with CSI driver v1.30.6 or v1.30.4), you need to delete existing broken xfs PVC, and then create new PVC again with the fixed CSI driver version. (only Azure disk CSI driver v1.30.5 is broken here)

ctrmcubed · 2024-11-21T10:09:55Z

@ctrmcubed the hotfix has been rolled out complete in northeurope and westeurope regions now, pls check. we will also rollout on other regions next.

Confirmed this is now working in my region using v1.30.6.

This was referenced Nov 13, 2024

[BUG] Can't create PV and mount based XFS volumes on Azure Linux // wrong fs type, bad option, bad superblock / Superblock has unknown incompatible features (0x20) enabled Azure/AKS#4643

Open

doc: cut v1.30.6 release #2636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xfsprogs in csi-driver container and on host do not match #2588

xfsprogs in csi-driver container and on host do not match #2588

kaitimmer commented Nov 7, 2024

andyzhangx commented Nov 7, 2024 •

edited

Loading

kaitimmer commented Nov 7, 2024

monotek commented Nov 7, 2024

andyzhangx commented Nov 7, 2024

monotek commented Nov 7, 2024

andyzhangx commented Nov 7, 2024 •

edited

Loading

ctrmcubed commented Nov 20, 2024

andyzhangx commented Nov 20, 2024

andyzhangx commented Nov 20, 2024 •

edited

Loading

ctrmcubed commented Nov 21, 2024

xfsprogs in csi-driver container and on host do not match #2588

xfsprogs in csi-driver container and on host do not match #2588

Comments

kaitimmer commented Nov 7, 2024

andyzhangx commented Nov 7, 2024 • edited Loading

kaitimmer commented Nov 7, 2024

monotek commented Nov 7, 2024

andyzhangx commented Nov 7, 2024

monotek commented Nov 7, 2024

andyzhangx commented Nov 7, 2024 • edited Loading

ctrmcubed commented Nov 20, 2024

andyzhangx commented Nov 20, 2024

andyzhangx commented Nov 20, 2024 • edited Loading

ctrmcubed commented Nov 21, 2024

andyzhangx commented Nov 7, 2024 •

edited

Loading

andyzhangx commented Nov 7, 2024 •

edited

Loading

andyzhangx commented Nov 20, 2024 •

edited

Loading