-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume Management #8367
Comments
This feature is going to be shifted to Talos 1.8.0 (only first bits might appear in Talos 1.7.0). Talos 1.8.0 will be released as soon as this feature is ready. |
Some software like longhorn might not respect limits and fill the whole disk. It would be great if a misbehaving pod cannot destroy etcd or other core parts of talos by just claiming all the available disk space. |
This is good to know. I always liked the separation of Talos and it using a dedicated disk to prevent unknowns/complications like this. Any ideas on how we could impose those limitations? |
From my point of view something like lvm and partitions for each part would help. I used a similar setup in k3s and had never issues like this. LVM would also make the encryption part easy because you only have to encrypt one device… |
Allow the choice of any block device. A partition is also a block device. People could partition a single SSD with sufficient space for Talos and then an additional partition for general use. Filling up the general use partition isn’t going to affect the Talos partition(s) |
Similar to how the newer ESXi installer does when using the |
I think at minimal we would need two partitions or lvm volumes: It would be great if we could have an option to say okay we also need 100gb longhorn space and 50gb local path space. That are just some examples we would just need a volume size and mount path. All remaining space could be assigned to the general purpose partition. Here the default setting should be to use all space. With something like lvm we could also allow to fix the general volume to a specific size and leave the remaining space unused. The would allow for expansion of other volumes or ensure that all nodes are the same even if one has a bigger disk. |
Thank you for clarifying @smira! If I set up a cluster with 1.7 today, will there be a migration path in 1.8 to have talos managing disks as proposed in this issue? |
Talos is always backwards compatible, so upgrade to 1.8 will always work. You would be able to start using volume management features, but some of them (e.g. shrinking |
OpenEBS had a component called ndm (node-disk-manager) that was quite handy to manage block-devices. HostPath and OS disks could be excluded with filters, e.g.: filterconfigs:
- key: os-disk-exclude-filter
name: os disk exclude filter
state: true
exclude: "/,/etc/hosts,/boot,/var/mnt/openebs/nvme-hostpath-xfs" This was used by the localpc-device sc, letting you assign a whole block device to a pod. Unfortunately they have stopped supporting ndm and localpv-device with the release of OpenEBS 4.0. It would be great if talos had a similar feature! |
This is early WIP. See siderolabs#8367 Signed-off-by: Andrey Smirnov <[email protected]>
This is early WIP. See siderolabs#8367 Signed-off-by: Andrey Smirnov <[email protected]>
this is incredibly exciting, happy to give it a whirl once you get an RC/beta or something @smira . thank you! |
This is early WIP. See siderolabs#8367 Signed-off-by: Andrey Smirnov <[email protected]>
@smira TopoLVM requires a What I get from this issue, is that we would be able to at least allocate the system disk with free-space remaining, which is already 99% of the way there! Does it also allow us to use lvm using |
Was asked to add some notes about software RAID for on-premise i.e. metal. Not looking for a response just wanted to add another viewpoint.
|
@nadenf I dont get what you aim for here with this response.
You're sharing a viewpoint, that is irrelevant for the issue. As it's already been decided to add this feature, your opinion or implementation of it (aka viewpoint) is of absolutely zero relevance. Byond that just because you say you're not looking for responses, posting off topic crap like this (which even contains wrong information), is always going to get the inherent response of "stfu". As you're pushing yourself into the notification of all 14 people watching this issue, with information that is of zero relevance. It's as much relevance to this issue as sharing your favorite recipe for fried kangaroo. |
I was asked to add it here. By Steve Francis, the CEO of Sidera Labs. And perhaps the thinking is that since a RAID is a form of volume that the design whilst not supporting software RAID could at least be done in a way that does not prevent it being added in the future. |
I bet you also emailed him some off-topic crap out-of-the-blue like you did here.
In that case, freaking say that. |
@PrivatePuffin, I (Sidero CEO) did ask @nadenf to post to this epic. Looking closer, it is the wrong epic - it's volume, not disk/partition management, but that is on me. Posting why a user wants a feature, without asking for a specific response, is very valuable to us. The reason Talos has such limited volume/partition/disk management features is that it has a very architecturally pure design - and in the pure design, it is best if Talos owns the entire disk, that it can completely wire/erase/repartition on any upgrade, and all other disk management is done on other disks, and if things go awry, you add another node (cattle, not pets.) This doesn't work in real life, but our product team appreciate reminders as to why, which is exactly the context that was posted. While I appreciate your thoughts to keep issues relevant, your impolite and aggressive tone is completely inappropriate. To quote yourself, you could have just said:
Instead of you, @PrivatePuffin, "rambling on about all sorts of things that that have nothing to do with the issue" or the point your are trying to make, and further are rude, and leave a bad taste in the community. |
Just wanted to pop in and say the entire community loves Talos and just ignore the one butthead. Keep up the good work guys! |
Did you have any joy with this approach @isometry? Would love to see an example if so! |
It would be helpful to create custom partitions. |
Any workarounds for now? apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
maxSize: 30GiB
grow: false And then manually created a custom volume for Ceph metadata: How bad is this as a temporary solution? I even tried |
It's designed to be working this way. Keep in mind that talosctl reset without specific labels will blow up the whole disk still, but we are working on a solution for that as well. |
Unfortunately, Ceph doesn't support the use of partitions for OSD:
It would be cool to have a way to manage NVMe namespaces in the machine configuration. |
I have previously used partitions for rook, so I know this used to work, but that was with kubeadm-clusters, so things might have changed since I switched to talos 2 years ago. That being said, this mentions partitions several times as something one can use. Edit: I would try wiping the partition if you are sure it's ready for rook |
My cluster runs nixos right now waiting for this feature on Talos. I specify by disk uuid for ceph though not the partition name and it works fine. |
Hmm. |
after shrinking the ephemeral disk, how can i mount the remaining disk space as seperate partition to eg: apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
diskSelector:
match: system_disk
minSize: 200GB
maxSize: 200GB
grow: false the |
did you manage to manually create and bind the volume with talos? |
No, each node in my cluster has two HDDs for Ceph data and one NVMe SSD for Talos and Ceph metadata. |
Yes, I've (only just) finished testing this successfully. I applied the following configuration snippets to a freshly reset Turing RK1 with Talos v1.9.0 installed to the MMC, and with a single, unpartitioned 1TB NVMe drive attached at machine:
# ...elided...
disks:
- device: /dev/nvme0n1
partitions:
- mountpoint: /var/lib/longhorn
size: 800GB
# ...elided...
kubelet:
extraMounts:
- destination: /var/lib/longhorn # Destination is the absolute path where the mount will be placed in the container.
type: bind # Type specifies the mount kind.
source: /var/lib/longhorn # Source specifies the source path of the mount.
# Options are fstab style mount options.
options:
- rbind
- rshared
- rw
# ...elided...
---
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
diskSelector:
match: disk.transport == "nvme"
maxSize: 128GB
grow: false This results in the following volume configuration: $ talosctl get volumestatus
NODE NAMESPACE TYPE ID VERSION PHASE LOCATION SIZE
turingpi-1 runtime VolumeStatus /dev/nvme0n1-1 2 ready /dev/nvme0n1p2 800 GB
turingpi-1 runtime VolumeStatus EPHEMERAL 1 ready /dev/nvme0n1p1 128 GB
turingpi-1 runtime VolumeStatus META 2 ready /dev/mmcblk0p4 1.0 MB
turingpi-1 runtime VolumeStatus STATE 3 ready /dev/mmcblk0p5 105 MB
$ talosctl mounts | egrep 'NODE|nvme'
NODE FILESYSTEM SIZE(GB) USED(GB) AVAILABLE(GB) PERCENT USED MOUNTED ON
turingpi-1 /dev/nvme0n1p1 127.93 5.44 122.50 4.25% /var
turingpi-1 /dev/nvme0n1p2 871.78 16.76 855.01 1.92% /var/lib/longhorn I was a little surprised that the Longhorn is happily running atop the |
Warning: following my previously described configuration, local storage is broken by an upgrade (to v1.9.1) :-/ $ talosctl -n 10.0.88.11 get volumestatus
NODE NAMESPACE TYPE ID VERSION PHASE LOCATION SIZE
turingpi-1 runtime VolumeStatus /dev/nvme0n1-1 1 ready /dev/nvme0n1p1 128 GB
turingpi-1 runtime VolumeStatus EPHEMERAL 1 ready /dev/nvme0n1p1 128 GB
turingpi-1 runtime VolumeStatus META 2 ready /dev/mmcblk0p4 1.0 MB
turingpi-1 runtime VolumeStatus STATE 2 ready /dev/mmcblk0p5 105 MB I suspect that it is necessary to pre-partition the NVMe for all UPDATE: Confirmed. Pre-creating the appropriate I personally did this via:
|
User disks |
I might be a bit too early, and this might not be implemented yet, or I completely miss understood how this works, but I tried following to what @isometry wrote:
But I always get:
The EPHEMERAL is correctly sized to 100Gb, but the following machine mount does not work:
|
@Blarc check my follow-up warning. Ensure that your target disk has a GPT label and a single 400GB partition (which will be used for your |
Is there any chance you could provide an example how to ensure disk has GPT label? |
I don't think GPT or not matters. Just make sure that your disk contains a single 400GB partition before you apply your configuration (most easily done by resetting the host, applying just the |
Hi úSo is there any documentation (except the template in the .yaml config file) on how to use the machine.disks and how to interact with default EPHEMERAL? Or is this currently in testing state anyways? Setting Talos up in a way @isometry recommends (meaning requiring a certain order of commands to initialize) is kind of exactly what Talos wants to avoid, right? Also, if I if set the machine.disks like this: disks:
- device: /dev/nvme0n1 # The name of the disk to use.
# A list of partitions to create on the disk.
partitions:
- mountpoint: /var/lib/longhorn # Where to mount the partition.
# # The size of partition: either bytes or human readable representation. If `size:` is omitted, the partition is sized to occupy the full disk.
# # Human readable representation.
size: 400GB
# # Precise value in bytes.
# size: 1073741824 , I get the following:
(not the proper size and phase is failed). Same result if I set the EPHEMERAL volume to e.g. 40gb using volume config entry. |
@rserbitar "quite". From what I can tell, the machine disks mechanism is strictly not currently appropriate for trying to create or manage non-system partitions on the system disk. The longhorn/ephemeral on NVMe configuration I found to work is sub-optimal, as you say, in that it requires multiple configuration passes, and only works at all when you have (at least) a second disk for machine disks (I'm using the Turing RK1 embedded MMC for Talos system/meta/state, on the understanding that these should be almost read-only during normal operation). |
Closely related: #8016
Problem Statement
Talos Linux is not flexible in the way it manages volumes, it occupies the whole system disk, creating an EPHEMERAL partition covering 99% of the disk space. User disk management is fragile, requires extra steps to get it to work properly (mounting into the kubelet), doesn’t support wiping disks, etc. Talos does not detect properly various partition types which leads to wiping user data (e.g. Ceph bluestore).
There were following requests from the users/customers which can’t be addressed in the current design:
STATE
/EPHEMERAL
ontmpfs
)/var
from an NVMe/SSDcontainerd
stateetcd
data directory to a separate disk/var
:containerd
state)/var/mnt/foo
)/data
The proposed design provides an option to solve the issues mentioned above.
Groundwork
Before we move into volume management operations, there is some amount of work that needs to be done to improve the blockdevice management operations:
blkid
, to allow easier identification of storage objects.Installation Process
Talos installation should do the bare minimum to make sure that Talos can be booted from the disk, without touching the pieces which are not strictly required to boot Talos. This might include installing Talos without having machine configuration.
So the install should only touch following partitions/objects:
BOOT
/EFI
partitions (boot assets, boot loader itself)META
partitionMBR
Any management of the storage/volumes should be deferred to the Talos running on the
host
(i.e. creating/var
,/system/state
, etc.)Volumes
Let’s introduce a new concept of the
volumes
, which will solve the ideas mentioned above and allow us to take storage management to the next level.There are two kinds of volumes:
Every volume has several most important features:
Volumes support a basic set of operations:
Volume types:
Volume formats:
Volume additional options:
System Volumes
As of today, Talos implicitly has the following volume types:
Volume Lifecycle
Talos services can express their dependency on the volumes. For example,
kubelet
service can only be started whenkubelet data
volume is available. Same way, ifkubelet data
volume is going to be unmounted,kubelet
should be stopped first.The boot process should naturally stop when the required volume is not available. E.g. maintenance mode of Talos implies that the boot can’t proceed as long as the volume configuration is not available.
Volume Configuration
System volumes have implicit configuration, which is applied as long as
v1alpha1.Config
is applied to the machine. Some properties are configurable inv1alpha1.Config
, e.g. disk encryption. If an explicit volume configuration is provided, Talos uses that.For example, if the user configures
EPHEMERAL
to betmpfs
of size 10 GiB, it will be created on each boot as instructed.Users might provide configuration for user volumes (similar to the user disks feature today), which might be critical for the pods to be started, an otherwise e.g. extension services might provide a dependency on the additional volumes.
Some system volumes might be optional, i.e. configured by the users - for example, container image cache.
Upgrades and Wiping
Talos Linux upgrades should not wipe anything by default, and wiping should be an additional operation which can be done without an upgrade, or can optionally be combined with an upgrade.
Update itself should only modify boot assets/boot loader, i.e. ensure that the new version of Talos Linux can be booted up from the disk device.
Wiping is volume-based, examples:
EPHEMERAL
, which implies wiping all volumes which haveEPHEMERAL
as a parent (e.g. subdirectory volume of/var/lib/etcd
); all services which depend onEPHEMERAL
or its children should be stopped, but reboot is not necessary, as theEPHEMERAL
will be re-provisioned after the wipeetcd data
, which in the default configuration implies leavingetcd
, stoppingetcd
services, performingrm -rf /var/lib/etcd
, and re-startingetcd
join processNotes
As pointed out by @utkuozdemir,
EPHEMERAL
might be a bad name given that the partition is not supposed to be forced wiped by default.Tasks
The text was updated successfully, but these errors were encountered: