-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support linear zpool #16754
Comments
Seems that you can achieve this in both cases by simply not creating the "to-be-used-later" vdev until running out of space. |
Well this puts a rather strong restriction on how the pool is used and needs either manual intervention when disk runs out of space or write monitoring programs to automate this. More specifically, your proposed approach assumes files are only added, not modified or deleted. It's true for some people but that's a strong restriction for most people. If I already have more than one vdevs and then mutate existing data, then new data will still be stripped to different vdevs. Also IIRC, zfs performs badly (e.g. fragmentation) when the space is close to full. So you can't wait for it to run out of space. |
This feels like a hell for management and data recovery. I can think of several tunables (like |
Good point for the shared IO scheduler queue. I didn't think about it and I appreciate you brought it up, and it's a nice thing to have. However, it makes the point stronger that it needs some support in ZFS rather than existing tunables. Can you elaborate a little more about the management and data recovery and how it messes up the ZFS internals? For management, did you mean it makes CLI more complicated? It could be as simple as a prop on vdev, in which the vdevs of the same value are in the same group. Ordinary user wouldn't need to care about it, while advanced user could set it to enable the feature. In this way there is no need to introduce any new subcommand to the CLI. If you mean it is difficult for user to set up the zpool in this way, then for use case 1, it is a one-off thing, and is a much better experience than what Harry suggested in terms of management; for use case 2, it is only a little pain at setup time, while to do shrinking, a simple Keep in mind this is for users who have the requirements, so they pay a little inconvenience for all the benefits they get; for ones who don't need it, it won't make any difference anyway. Also for both use cases, people could write reusable script or program to automate related operations. For data recovery, there is no disk format change, therefore not affecting scrub/resilver/... except the shared IO scheduler you mentioned. If you mean putting vdevs on the same physical drive makes it easier for data loss. I'd say today nothing prevents people from doing that, so it doesn't make things worse, on the flip side, it reduces disk seeks, which potentially makes drive last longer. For both use cases I gave, they all tolerate single disk failure, which is not less secure than a normal raidz setup; therefore with a proper set up, it wouldn't harm the data safety. Of course, when doing a replace, you need to run multiple replace commands now, but that's not a huge pain IMO, and again can be automated. In fact, by letting ZFS aware of underlying physical drive, it opens up other opportunities as well, e.g. warn users if their setup can't tolerate single disk failure. Also, in the future, Again I am happy to contribute this feature, but I need a thumb up before I really spend hours on it. I don't want to spend many hours on it and only got a rejection. |
When one of the disks fail, somebody will need to recreate exactly the same configuration to replace it. If replacement disk happen to have different size, since time has passed, I bet you'll want some even weirder topology in addition to existing one. If after all this you get some bad accident and will need to contact some data rescue company, it will be quite an explanation of where to look for your data, especially after few years passed since you've set it up. Don't complicate your life, it is already complicated enough. |
I must say it remains true for your proposal, i.e. using vdevs in a pre-defined order. |
Let's take the use case 1 as an example. First of all, there is no need to discuss the case where the replacement disk is smaller, as it will break other zpool setups as well. If the disk getting bigger is the largest one, C or D, then the additional spaces can't be used in the zpool. This is the same as unRAID, so I don't see this is a big problem. Let's assume disk A gets replaced with a bigger one, it can be 10TB or 12TB. For any sizes in between, e.g. 11TB, the 1TB additional spaces can't be added to the pool, which is fine. Let's take A 6TB->12TB as an example, other cases, such as A->10TB, B->12TB, are similar. Recall the original configuration is In this way, the new topology is not getting worse over time. If there are sufficient free space, we can even remove the 4TBx3, resize partitions, and rely on All the above steps can be automated and it is fairly simple to write some scripts to do that and share to the community. |
Not really, for your suggestion, allocator are forced to find spaces in small holes in the existing pools, which caused fragmentation; with more vdevs added beforehand, when existing vdevs doesn't have sufficiently large continuous free spaces, it can allocate from other vdevs. As I said, it treats the group as a single continuous space, so it is not forced to completely fill up the 1st one before going to the 2nd one. Also in my suggestion, I also mentioned an alternative, which is still evenly writes to vdevs, but with a large stripes, and this will also work and not suffer from the fragmentation issue. |
Yes -- but there will eventually be a threshold (either hardcoded or configurable) to decide when the allocator starts to use a non-empty vdev. |
Well, if I fill up 90% of my 10TBx3 spaces, I deserve the fragmentation, that's my problem. But I don't deserve the fragmentation when I only have 90% of 1TBx3 data. |
I do enough recovery work on pools where people tried to be too clever, so I appreciate this:
That said, I think your original description lays out some of the moving parts required and provides a guide to getting here:
Breaking this down, this seems like you want:
1 and 2 are both wanted in other contexts. We can already set arbitrary properties on a vdev, but we haven't yet settled on how to do property inheritance. "Groups" are one possibility (I have a prototype of something similar, that I call "templates', but it's not ready to show and I wouldn't care if something different came along). These facilities, built in good faith with the involvement of the rest of the community, will give you the tools you need to implement 3, and probably along the way you'll gain the knowledge you need to know whether or not it's something that makes sense, or should be done differently. |
Thanks Rob for your inputs.
And according to Alexander, there is a
Do you want to give some examples of the "other contexts" you mentioned? The solution I proposed was very specific to my use cases, e.g. the prop is only meaningful for top-level vdevs, so inheritance is not a problem. I'd like to hear the requirements from other contexts so we can come up with a uniform and future-proof approach that covers as many use cases as possible, to avoid adding more chaos to the project. |
Describe the feature would like to see added to OpenZFS
I'd like to see ZFS to support marking group(s) of vdevs to be linear. A group of linear vdevs are considered to be on the same physical device so ZFS should avoid striping when writing to them.
To implement this feature, no change needs to be made to disk format, except adding some pool level metadata that remembers these groups. At the allocation time, treat all vdevs as a single, linear space; or alternatively, write sufficiently large amount of data to one vdev before moving to the next one. The latter approach sounds like similar to setting
metaslab_aliquot
to a large value, but the module parameter is global and I'd like this to be applied only within the group, at least this should be per-pool basis.How will this feature improve OpenZFS?
I believe this will enable some very useful and interesting use cases.
To see how this feature allows supporting disks of different sizes, let's assume there are four disks: A is 6TB, B is 10TB, and C and D are 12TB. This is not uncommon in homelab setup, where we incrementally add new drives and the ones on the market are bigger over time.
We can then divide B into 6TB + 4TB, C and D into 6TB + 4TB + 2TB. Now we can form a zpool with three top level vdevs
6TB x 4 as raidz, 4TB x 3 as raidz, 2TB x 2 as mirror. Without this feature, it works, but performance would suffer, because it stripes between these three vdevs, so when doing a sequential write, it will constantly seek between the three vdevs, thus quite slow and wears out the disk surface. With this feature, then this will work perfectly.
Imagine we have 4 10TB disks, instead of setting up them as one raidz vdev of 10TBx3, we can divide each disk into 10 1TB partitions, and create 10 linear top level raidz vdevs, and each of them are 1TBx3. By marking them as linear, then in the future, we can shrink the pool to 9TBx3 by removing the last vdev, and 8TBx3 by removing the last 2 vdevs, etc. We can even get something like 7.5TBx3 by removing the last 3 vdevs, and then add 0.5TBx3.
I would say these two use cases are two important pain points of ZFS for many people and if it was supported, many of them would've chosen ZFS.
Additional context
I'm not familiar with ZFS internals, but if everyone else is very busy, I'm happy to implement and send PR, if some one can give me some guidance on how to start working in this project.
The text was updated successfully, but these errors were encountered: