Skip to content

Case: Sysroot on Btrfs

Andrei Pozolotin edited this page Apr 26, 2020 · 4 revisions

based on issue #36

The problems

We need a backup system for a custom linux based NAS. This is a SOHO application so we need inexpensive but still reliable enough and fully encrypted up to plausible deniability.

Features needed and their implementation

  • Because it is a backup, it cannot be on premises so we need some way to control it remotely. This is done at 3 different levels:
    • HW power up: wake on LAN or any out of band controller
    • boot up: is implemented in initramfs and is the main focus of this page
    • running: standard ssh and other usual linux tools.
  • Inexpensive hard disks: common SATA hard drives
  • Reliability as in: we know that at one stage a some part of the PC will fail but we want to still get our data and salvage as much as possible from the current machine, if it makes sense. That, with other factors, means SW RAID.
  • Security: we want to encrypt as much as possible: rootfs, swap, backup-data. But of course that means we need to be able to decrypt the rootfs somehow and both locally or remotely... Having partitions hidden in an encrypted LUKS container is well documented, so that's not an issue.

I know only one FS that:

  • is supported natively by linux (cause we don't have an IT department to patch it)
  • offers SW RAID out of the box
  • supports various device geometries to be added and removed to/from the same FS

and that's BTRFS.

Now I know that it comes with some caveats as far as RAID5 and 6 are concerned, but these look manageable compared to the price of a more elaborate and expensive solution.

System boot up

So basically a system startup looks like this:

  1. Power up the machine via ILO
  2. The machine starts GRUB
  3. GRUB waits 3 secs and then boots the default OS (linux, with our custom kernel parameters)
  4. Linux starts and runs our initramfs
  5. Our initramfs runs systemd
  6. systemd asks for a password both via a SSH server for remote users and a console for a local operator
  7. The correct password causes the rootfs to be decrypted and mounted to /sysroot
  8. initramfs chroots to /sysroot to boot the real system
  9. real system decrypts and mounts the backup-data fs
  10. taaaaadaaaaa

The solutions

System layout

Some other considerations have gone into this design but are out of scope here. The result is that we decided to apply the same partition layout to each and every hard drive in the system.

About LVM: We decided to have rootfs and swap in the same luks container but split using LVM for some reason. I think we liked the better flexibility of LVM compared to partitions tables...

Device Size Type FS Mounted Redundancy
/dev/sdX1 256 MB BIOS / UEFI partition BIOS / UEFI n/a for bios n/a
/dev/sdX2 256 MB "keys" partition ext4 on LUKS /root/keys/diskXX/ rsync script
/dev/sdX3 512 MB "/boot" partition ext4 /boot (for the active one), /root/boot/diskXX/ (for all) rsync script
/dev/sdX4 60 GB "sysYY" LUKS LVM on LUKS /dev/mapper/sysYY n/a
50 GB LVM logical volume "rootYYvg-root" BTRFS raid5 / (subvol @), /home (subvol @home) BTRFS
10 GB LVM logical volume "rootYYvg-swap" swap n/a
/dev/sdX5 rest bakYY LUKS BTRFS raid5 on LUKS /srv/main_backup (subvol @<date>) BTRFS

Decrypting a multidevice FS with a single password

Say we have 3 HDDs in our box, we don't want to prompt the user for 3 passwords. There are several strategies but I did not find one that was working both remotely and locally well enough for us. So we came up with the following idea: the given password will not directly unlock the rootfs partition but instead a dedicated "keys" partition that contains the keys to each and every "sysYY" luks container. So the system boot-up really looks like:

  1. Our initramfs runs systemd
    1. systemd parses [initramfs]/etc/crypttab and sees a first LUKS container to decrypt with key "none", which means it requires the user to enter a password
  2. Get a password from user
  3. The correct password decrypts "keys" luks container
    1. Decrypted /dev/mapper/keys is mounted to initramfs /root/keys
    2. systemd parses the next line in [initramfs]/etc/crypttab which says that "sys00" is a LUKS container to be decrypted with the now available key file /root/keys/sys00
    3. then "sys01"
    4. then "sys02"

To make sure systemd does not attempt to mount /sysroot before the keys are avialable, you need to remove /sysroot from the [initramfs]/etc/fstab and express it in an explicit mount unit that has a "Requires" constraint that the "keys" fs is mounted.

Oddities

For some reason, I have not managed to get this setup working with /sysroot in fstab... So I needed to express it as an explicit mount unit.

I have tried numerous combinations and a lot of things are pretty odd with systemd. For example, mount units seem to ignore After= and Requires= constraints and will attempt to mount as soon as the device is available.

I really wished /sysroot was in fstab too.

Synchronization problems

Every time a "sysYY" is decrypted, the following happens:

  • /dev/mapper/sysYY is added to /dev and is scanned by udev
  • udev discovers that it contains an LVM signature and triggers a LVM scan
  • the LVM scanning adds /dev/mapper/rootYYvg-root to /dev and again udev scans it
  • udev discovers that it contains a BTRFS signature and triggers a BTRFS scan
  • the BTRFS scanning adds /dev/disk/by-uuid/e2e4c5ab-1234-1234-1234-189b1dca208a to /dev

There is however a pitfall here, due to BTRFS RAID devices sharing a common UUID.

Take a look at the system once it is booted:

NAME                FSTYPE       SIZE UUID                                   FSAVAIL FSUSE% MOUNTPOINT
sda                              2.7T
├─sda1                           256M
├─sda2              crypto_LUKS  256M 0d0e6aa4-1234-1234-1234-081b5b0a57b8
│ └─keys            ext4         254M e2e4c5ab-1234-1234-1234-189b1dca208a                  /root/keys/disk00
├─sda3              ext4         512M 6e923908-1234-1234-1234-9f3de4205460                  /root/boot/disk00
├─sda4              crypto_LUKS   60G 9286a856-1234-1234-1234-982e2574c318
│ └─sys00           LVM2_member   60G vinItc-hjDE-lP45-1234-1234-1234-2fYSPO
│   ├─root00vg-swap swap          10G 60e4b89f-1234-1234-1234-7001a1a4d6eb                  [SWAP]
│   └─root00vg-root btrfs         50G c081646d-1234-1234-1234-2d821fbea470     95.7G     2% /home
└─sda5              crypto_LUKS  2.7T caac0f4d-1234-1234-1234-d5dc13b113d8
  └─bak00           btrfs        2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a
sdb                              2.7T
├─sdb1                           256M
├─sdb2              crypto_LUKS  256M 3cd815f1-1234-1234-1234-3f8149685a2b
│ └─keys01          ext4         254M 6356506a-1234-1234-1234-e3a5967c10b2                  /root/keys/disk01
├─sdb3              ext4         512M 7e8a0dbf-1234-1234-1234-64c51f702c67                  /root/boot/disk01
├─sdb4              crypto_LUKS   60G 9eedc985-1234-1234-1234-c526da7479dc
│ └─sys01           LVM2_member   60G iDUNBk-AEFD-XcHp-1234-1234-1234-0nCxbG
│   ├─root01vg-swap swap          10G 211d8057-1234-1234-1234-2ae2394294ec                  [SWAP]
│   └─root01vg-root btrfs         50G c081646d-1234-1234-1234-2d821fbea470
└─sdb5              crypto_LUKS  2.7T 36ca46d8-1234-1234-1234-2e13776ecc1a
  └─bak01           btrfs        2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a
sdc                              2.7T
├─sdc1                           256M
├─sdc2              crypto_LUKS  256M 0a1806e0-1234-1234-1234-6d68a14bc782
│ └─keys02          ext4         254M e8cbfa1c-1234-1234-1234-42163c55aadc                  /root/keys/disk02
├─sdc3              ext4         512M e3e6de8f-1234-1234-1234-1335f3beb562                  /root/boot/disk02
├─sdc4              crypto_LUKS   60G db94e9a4-1234-1234-1234-2264a5c5c8f9
│ └─sys02           LVM2_member   60G B78tVH-0VoC-vUsf-1234-1234-1234-wQgdYa
│   ├─root02vg-swap swap          10G 41561260-1234-1234-1234-3d3a4f9ea244                  [SWAP]
│   └─root02vg-root btrfs         50G c081646d-1234-1234-1234-2d821fbea470
└─sdc5              crypto_LUKS  2.7T 00780e65-1234-1234-1234-afd6f1dbea29
  └─bak02           btrfs        2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a      4.4T    12% /srv/main_backup

As you can see, on each physical disk, even though they are embedded in LVM logical volumes with different UUIDs, themselves embedded in LUKS containers with different UUIDs, they all have the same UUID c081646d-1234-1234-1234-2d821fbea470. Note that the same can be said about the bakYY volumes, they all share the UUID bf21d462-1234-1234-1234-15b3a1f6ec9a, and they don't use LVM.

On a slow system like ours, the above sequence of events, repeated on each of the 3 disks will trigger an attempt by systemd to mount /dev/disk/by-uuid/e2e4c5ab-1234-1234-1234-189b1dca208a to /sysroot as soon as the first disk is ready, even though the other 2 are not. Of course, it will fail.

How to solve this ?

We have to somehow make sure that systemd waits for ALL the disks to be ready (= LVM and BTRFS scanned and found).

Our first idea was to simply add forward dependencies to sysroot.mount:

Requires=dev-mapper-root00vg-root.device
After=dev-mapper-root00vg-root.device

But this does not work: device units are transient, there are no devices at mkinitcpio build time.

So the solution is to use reverse-dependencies, using systemd drop-ins.

For each device that we want to wait for, we create a drop-in directory and add a override.conf file to it that contains the new dependency:

[Install]
RequiredBy=sysroot.mount

Now these dir/files will not be pulled automatically by mkinitcpio so you will have to explicitly declare them somewhere.

Limitations

We're lucky: it works for us. It might not work with a big number of devices.

Each LVM LV that is found and attached in /dev/mapper fulfills a "RequiredBy" of sysroot.mount and that's good enough for 3 devices, but in reality an extra step is taking place between the two that is needed: the discovery of the BTRFS FS on the LVM LV...

Since the BTRFS UUID is shared among devices, I can't really use that UUID to trigger anything... Unless there were a way to say that a device is required 3 times (which would mean hard coding a number somewhere and I would not like that)... I don't think this is workeable...

Another way of doing would be to have a the LVM attachement trigger a service that checks if the BTRFS raid is mountable. That means it would fail the first time, the second time too (although it would be mountable in degraded mode) but succeed the third time (or xth time, where x is the number of devices taking part into that raid)... I tried that but I could not get sysroot.mount to wait for anything... It was ignoring any After or Requires and was mounting as soon as the What condition was workeable... not sure if it is wanted or a bug...

Also: there is no tool to tell you if a volume is mountable or not... 2 possibilities : really try to mount it, or parse the output of "BTRFS device info xxxx" for "missing device"... both not nice.

For this one however I thought that a cleaner solution would be to have a service indeed check for the BTRFS mountability of sysroot and then simply add a new simlink in /dev/mapper/mountable_sysroot (via udev ?) and have sysroot.mount use this path as its What condition. I did not try it... I'm not sure there is a way to fool udev into doing that somehow...

Complete solution

/etc/mkinitcpio-systemd-tool/config/fstab

# The partition that contains the keys to the sysXX partitions
UUID=e2e4c5ab-1234-1234-1234-189b1dca208a       /root/keys/source       ext4            rw,noatime,stripe=4,x-systemd.device-timeout=9999h,x-systemd.before=sysroot.mount,x-systemd.required-by=sysroot.mount      0 1

/etc/mkinitcpio-systemd-tool/config/crypttab

# Keys disk
keys           UUID=0d0e6aa4-1234-1234-1234-081b5b0a57b8    none                         luks

# System disks
sys00          UUID=9286a856-1234-1234-1234-982e2574c318    /root/keys/source/s00.key    luks
sys01          UUID=9eedc985-1234-1234-1234-c526da7479dc    /root/keys/source/s01.key    luks
sys02          UUID=db94e9a4-1234-1234-1234-2264a5c5c8f9    /root/keys/source/s02.key    luks

Copy /usr/lib/systemd/system/initrd-dropbear.service to /etc/systemd/system and add:

# We want to login with password
[Service]
ExecStart=
ExecStart=/bin/dropbear -j -k -F -p ${SSHD_PORT}

[X-SystemdTool]
InitrdBuild=/usr/lib/mkinitcpio-systemd-tool/initrd-build.sh command=do_root_login_enable

# We need these to make sure BTRFS (well, LVM in fact) scan is completed BEFORE sysroot is mounted
InitrdPath=/etc/systemd/system/dev-mapper-root00vg-root.device.d/override.conf
InitrdPath=/etc/systemd/system/dev-mapper-root01vg-root.device.d/override.conf
InitrdPath=/etc/systemd/system/dev-mapper-root02vg-root.device.d/override.conf

Create /etc/systemd/system/dev-mapper-rootXXvg-root.device.d/override.conf as follows:

dev-mapper-root00vg-root.device.d
└── override.conf
dev-mapper-root01vg-root.device.d
└── override.conf
dev-mapper-root02vg-root.device.d
└── override.conf

with content:

[Install]
RequiredBy=sysroot.mount

/etc/systemd/system/sysroot.mount:

[Unit]
Requires=root-keys-source.mount
After=root-keys-source.mount
Before=initrd-root-fs.target
ConditionPathExists=/etc/initrd-release
DefaultDependencies=false

[Mount]
What=/dev/disk/by-uuid/c081646d-1234-1234-1234-2d821fbea470
Where=/sysroot
Type=btrfs
Options=noatime,subvol=/@

[Install]
WantedBy=initrd-root-fs.target

You must remove the "root" option from the kernel parameters !