-
Notifications
You must be signed in to change notification settings - Fork 28
Case: Sysroot on Btrfs
based on issue #36
We need a backup system for a custom linux based NAS. This is a SOHO application so we need inexpensive but still reliable enough and fully encrypted up to plausible deniability.
- Because it is a backup, it cannot be on premises so we need some way to control it remotely. This is done at 3 different levels:
- HW power up: wake on LAN or any out of band controller
- boot up: is implemented in initramfs and is the main focus of this page
- running: standard ssh and other usual linux tools.
- Inexpensive hard disks: common SATA hard drives
- Reliability as in: we know that at one stage a some part of the PC will fail but we want to still get our data and salvage as much as possible from the current machine, if it makes sense. That, with other factors, means SW RAID.
- Security: we want to encrypt as much as possible: rootfs, swap, backup-data. But of course that means we need to be able to decrypt the rootfs somehow and both locally or remotely... Having partitions hidden in an encrypted LUKS container is well documented, so that's not an issue.
I know only one FS that:
- is supported natively by linux (cause we don't have an IT department to patch it)
- offers SW RAID out of the box
- supports various device geometries to be added and removed to/from the same FS
and that's BTRFS.
Now I know that it comes with some caveats as far as RAID5 and 6 are concerned, but these look manageable compared to the price of a more elaborate and expensive solution.
So basically a system startup looks like this:
- Power up the machine via ILO
- The machine starts GRUB
- GRUB waits 3 secs and then boots the default OS (linux, with our custom kernel parameters)
- Linux starts and runs our initramfs
- Our initramfs runs systemd
- systemd asks for a password both via a SSH server for remote users and a console for a local operator
- The correct password causes the rootfs to be decrypted and mounted to /sysroot
- initramfs chroots to /sysroot to boot the real system
- real system decrypts and mounts the backup-data fs
- taaaaadaaaaa
Some other considerations have gone into this design but are out of scope here. The result is that we decided to apply the same partition layout to each and every hard drive in the system.
About LVM: We decided to have rootfs and swap in the same luks container but split using LVM for some reason. I think we liked the better flexibility of LVM compared to partitions tables...
Device | Size | Type | FS | Mounted | Redundancy |
---|---|---|---|---|---|
/dev/sdX1 | 256 MB | BIOS / UEFI partition | BIOS / UEFI | n/a for bios | n/a |
/dev/sdX2 | 256 MB | "keys" partition | ext4 on LUKS | /root/keys/diskXX/ | rsync script |
/dev/sdX3 | 512 MB | "/boot" partition | ext4 | /boot (for the active one), /root/boot/diskXX/ (for all) | rsync script |
/dev/sdX4 | 60 GB | "sysYY" LUKS | LVM on LUKS | /dev/mapper/sysYY | n/a |
50 GB | LVM logical volume "rootYYvg-root" | BTRFS raid5 | / (subvol @), /home (subvol @home) | BTRFS | |
10 GB | LVM logical volume "rootYYvg-swap" | swap | n/a | ||
/dev/sdX5 | rest | bakYY LUKS | BTRFS raid5 on LUKS | /srv/main_backup (subvol @<date>) | BTRFS |
Say we have 3 HDDs in our box, we don't want to prompt the user for 3 passwords. There are several strategies but I did not find one that was working both remotely and locally well enough for us. So we came up with the following idea: the given password will not directly unlock the rootfs partition but instead a dedicated "keys" partition that contains the keys to each and every "sysYY" luks container. So the system boot-up really looks like:
- Our initramfs runs systemd
- systemd parses [initramfs]/etc/crypttab and sees a first LUKS container to decrypt with key "none", which means it requires the user to enter a password
- Get a password from user
- The correct password decrypts "keys" luks container
- Decrypted /dev/mapper/keys is mounted to initramfs /root/keys
- systemd parses the next line in [initramfs]/etc/crypttab which says that "sys00" is a LUKS container to be decrypted with the now available key file /root/keys/sys00
- then "sys01"
- then "sys02"
To make sure systemd does not attempt to mount /sysroot before the keys are avialable, you need to remove /sysroot from the [initramfs]/etc/fstab and express it in an explicit mount unit that has a "Requires" constraint that the "keys" fs is mounted.
For some reason, I have not managed to get this setup working with /sysroot in fstab... So I needed to express it as an explicit mount unit.
I have tried numerous combinations and a lot of things are pretty odd with systemd. For example, mount units seem to ignore After= and Requires= constraints and will attempt to mount as soon as the device is available.
I really wished /sysroot was in fstab too.
Every time a "sysYY" is decrypted, the following happens:
- /dev/mapper/sysYY is added to /dev and is scanned by udev
- udev discovers that it contains an LVM signature and triggers a LVM scan
- the LVM scanning adds /dev/mapper/rootYYvg-root to /dev and again udev scans it
- udev discovers that it contains a BTRFS signature and triggers a BTRFS scan
- the BTRFS scanning adds /dev/disk/by-uuid/e2e4c5ab-1234-1234-1234-189b1dca208a to /dev
There is however a pitfall here, due to BTRFS RAID devices sharing a common UUID.
Take a look at the system once it is booted:
NAME FSTYPE SIZE UUID FSAVAIL FSUSE% MOUNTPOINT
sda 2.7T
├─sda1 256M
├─sda2 crypto_LUKS 256M 0d0e6aa4-1234-1234-1234-081b5b0a57b8
│ └─keys ext4 254M e2e4c5ab-1234-1234-1234-189b1dca208a /root/keys/disk00
├─sda3 ext4 512M 6e923908-1234-1234-1234-9f3de4205460 /root/boot/disk00
├─sda4 crypto_LUKS 60G 9286a856-1234-1234-1234-982e2574c318
│ └─sys00 LVM2_member 60G vinItc-hjDE-lP45-1234-1234-1234-2fYSPO
│ ├─root00vg-swap swap 10G 60e4b89f-1234-1234-1234-7001a1a4d6eb [SWAP]
│ └─root00vg-root btrfs 50G c081646d-1234-1234-1234-2d821fbea470 95.7G 2% /home
└─sda5 crypto_LUKS 2.7T caac0f4d-1234-1234-1234-d5dc13b113d8
└─bak00 btrfs 2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a
sdb 2.7T
├─sdb1 256M
├─sdb2 crypto_LUKS 256M 3cd815f1-1234-1234-1234-3f8149685a2b
│ └─keys01 ext4 254M 6356506a-1234-1234-1234-e3a5967c10b2 /root/keys/disk01
├─sdb3 ext4 512M 7e8a0dbf-1234-1234-1234-64c51f702c67 /root/boot/disk01
├─sdb4 crypto_LUKS 60G 9eedc985-1234-1234-1234-c526da7479dc
│ └─sys01 LVM2_member 60G iDUNBk-AEFD-XcHp-1234-1234-1234-0nCxbG
│ ├─root01vg-swap swap 10G 211d8057-1234-1234-1234-2ae2394294ec [SWAP]
│ └─root01vg-root btrfs 50G c081646d-1234-1234-1234-2d821fbea470
└─sdb5 crypto_LUKS 2.7T 36ca46d8-1234-1234-1234-2e13776ecc1a
└─bak01 btrfs 2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a
sdc 2.7T
├─sdc1 256M
├─sdc2 crypto_LUKS 256M 0a1806e0-1234-1234-1234-6d68a14bc782
│ └─keys02 ext4 254M e8cbfa1c-1234-1234-1234-42163c55aadc /root/keys/disk02
├─sdc3 ext4 512M e3e6de8f-1234-1234-1234-1335f3beb562 /root/boot/disk02
├─sdc4 crypto_LUKS 60G db94e9a4-1234-1234-1234-2264a5c5c8f9
│ └─sys02 LVM2_member 60G B78tVH-0VoC-vUsf-1234-1234-1234-wQgdYa
│ ├─root02vg-swap swap 10G 41561260-1234-1234-1234-3d3a4f9ea244 [SWAP]
│ └─root02vg-root btrfs 50G c081646d-1234-1234-1234-2d821fbea470
└─sdc5 crypto_LUKS 2.7T 00780e65-1234-1234-1234-afd6f1dbea29
└─bak02 btrfs 2.7T bf21d462-1234-1234-1234-15b3a1f6ec9a 4.4T 12% /srv/main_backup
As you can see, on each physical disk, even though they are embedded in LVM logical volumes with different UUIDs, themselves embedded in LUKS containers with different UUIDs, they all have the same UUID c081646d-1234-1234-1234-2d821fbea470
. Note that the same can be said about the bakYY volumes, they all share the UUID bf21d462-1234-1234-1234-15b3a1f6ec9a
, and they don't use LVM.
On a slow system like ours, the above sequence of events, repeated on each of the 3 disks will trigger an attempt by systemd to mount /dev/disk/by-uuid/e2e4c5ab-1234-1234-1234-189b1dca208a to /sysroot as soon as the first disk is ready, even though the other 2 are not. Of course, it will fail.
How to solve this ?
We have to somehow make sure that systemd waits for ALL the disks to be ready (= LVM and BTRFS scanned and found).
Our first idea was to simply add forward dependencies to sysroot.mount:
Requires=dev-mapper-root00vg-root.device
After=dev-mapper-root00vg-root.device
But this does not work: device units are transient, there are no devices at mkinitcpio build time.
So the solution is to use reverse-dependencies, using systemd drop-ins.
For each device that we want to wait for, we create a drop-in directory and add a override.conf file to it that contains the new dependency:
[Install]
RequiredBy=sysroot.mount
Now these dir/files will not be pulled automatically by mkinitcpio so you will have to explicitly declare them somewhere.
We're lucky: it works for us. It might not work with a big number of devices.
Each LVM LV that is found and attached in /dev/mapper fulfills a "RequiredBy" of sysroot.mount and that's good enough for 3 devices, but in reality an extra step is taking place between the two that is needed: the discovery of the BTRFS FS on the LVM LV...
Since the BTRFS UUID is shared among devices, I can't really use that UUID to trigger anything... Unless there were a way to say that a device is required 3 times (which would mean hard coding a number somewhere and I would not like that)... I don't think this is workeable...
Another way of doing would be to have a the LVM attachement trigger a service that checks if the BTRFS raid is mountable. That means it would fail the first time, the second time too (although it would be mountable in degraded mode) but succeed the third time (or xth time, where x is the number of devices taking part into that raid)... I tried that but I could not get sysroot.mount to wait for anything... It was ignoring any After or Requires and was mounting as soon as the What condition was workeable... not sure if it is wanted or a bug...
Also: there is no tool to tell you if a volume is mountable or not... 2 possibilities : really try to mount it, or parse the output of "BTRFS device info xxxx" for "missing device"... both not nice.
For this one however I thought that a cleaner solution would be to have a service indeed check for the BTRFS mountability of sysroot and then simply add a new simlink in /dev/mapper/mountable_sysroot (via udev ?) and have sysroot.mount use this path as its What condition. I did not try it... I'm not sure there is a way to fool udev into doing that somehow...
/etc/mkinitcpio-systemd-tool/config/fstab
# The partition that contains the keys to the sysXX partitions
UUID=e2e4c5ab-1234-1234-1234-189b1dca208a /root/keys/source ext4 rw,noatime,stripe=4,x-systemd.device-timeout=9999h,x-systemd.before=sysroot.mount,x-systemd.required-by=sysroot.mount 0 1
/etc/mkinitcpio-systemd-tool/config/crypttab
# Keys disk
keys UUID=0d0e6aa4-1234-1234-1234-081b5b0a57b8 none luks
# System disks
sys00 UUID=9286a856-1234-1234-1234-982e2574c318 /root/keys/source/s00.key luks
sys01 UUID=9eedc985-1234-1234-1234-c526da7479dc /root/keys/source/s01.key luks
sys02 UUID=db94e9a4-1234-1234-1234-2264a5c5c8f9 /root/keys/source/s02.key luks
Copy /usr/lib/systemd/system/initrd-dropbear.service to /etc/systemd/system and add:
# We want to login with password
[Service]
ExecStart=
ExecStart=/bin/dropbear -j -k -F -p ${SSHD_PORT}
[X-SystemdTool]
InitrdBuild=/usr/lib/mkinitcpio-systemd-tool/initrd-build.sh command=do_root_login_enable
# We need these to make sure BTRFS (well, LVM in fact) scan is completed BEFORE sysroot is mounted
InitrdPath=/etc/systemd/system/dev-mapper-root00vg-root.device.d/override.conf
InitrdPath=/etc/systemd/system/dev-mapper-root01vg-root.device.d/override.conf
InitrdPath=/etc/systemd/system/dev-mapper-root02vg-root.device.d/override.conf
Create /etc/systemd/system/dev-mapper-rootXXvg-root.device.d/override.conf as follows:
dev-mapper-root00vg-root.device.d
└── override.conf
dev-mapper-root01vg-root.device.d
└── override.conf
dev-mapper-root02vg-root.device.d
└── override.conf
with content:
[Install]
RequiredBy=sysroot.mount
/etc/systemd/system/sysroot.mount:
[Unit]
Requires=root-keys-source.mount
After=root-keys-source.mount
Before=initrd-root-fs.target
ConditionPathExists=/etc/initrd-release
DefaultDependencies=false
[Mount]
What=/dev/disk/by-uuid/c081646d-1234-1234-1234-2d821fbea470
Where=/sysroot
Type=btrfs
Options=noatime,subvol=/@
[Install]
WantedBy=initrd-root-fs.target
You must remove the "root" option from the kernel parameters !