Add support for anyraid vdevs #17567

pcd1193182 · 2025-07-25T16:47:37Z

Sponsored by: Eshtek, creators of HexOS; Klara, Inc.

Motivation and Context

For industry/commercial use cases, the existing redundancy solutions in ZFS (mirrors and RAIDZ) work great. They provide high performance, reliable, efficient storage options. For enthusiast users, however, they have a drawback. RAIDZ and mirrors will use the size of the smallest drive that is part of the vdev as the size of every drive, so they can provide their reliability guarantees. If you can afford to buy a new box of drives for your pool, like large-scale enterprise users, that's fine. But if you already have a mix of hard drives, of various sizes, and you want to use all of the space they have available while still benefiting from ZFS's reliability and featureset, there isn't currently a great solution for that problem.

Description

The goal of Anyraid is to fill that niche. Anyraid allows devices of mismatched sizes to be combined together into a single top-level vdev. In the current version, Anyraid only supports mirror-type parity, but raidz-type parity is planned for the near future.

Anyraid works by dividing each of the disks that makes up the vdev into tiles. These tiles are the same size across all disks within a given anyraid vdev. The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger. These tiles are then combined together to form the logical vdev that anyraid presents, with sets of tiles from different disks acting as mini-mirrors, allowing the reliability guarantees to be preserved. Tiles are allocated on demand; when a write comes into a part of the logical vdev that doesn't have backing tiles yet, the Anyraid logic picks the nparity + 1 disks with the most unallocated tiles, and allocates one tile from each of them. These physical tiles are combined together into one logic tile, which is used to store data for that section of the logical vdev.

One important note with this design is that we need to understand this mapping from logical offset to tiles (and therefore to actual physical disk locations) in order to read anything from the pool. As a result, we cannot store the mapping in the MOS, since that would result in a bootstrap problem. To solve this issue, we allocate a region at the start of each disk where we store the Anyraid tile map. The tile map is made up of 4 copies of all the data necessary to reconstruct the tile map. These copies are updated in rotating order, like uberblocks. In addition, each disk has a full copy of all 4 maps, ensuring that as long as any drive's copy survives, the tile map for a given TXG can be read successfully. The size of one copy of the tile map is 64MiB; that size determines the maximum number of tiles an anyraid vdev can have, which is 2^24. This is made up of up to 2^8 disks, and up to 2^16 tiles per disk. This does mean that the largest device that can be fully used by an anyraid vdev is 1024 times the size of the smallest disk that was present at vdev creation time. This was considered to be an acceptable tradeoff, though it is a limit that could be alleviated in the future if needed; the primary difficulty is that either the tile map needs to grow substantially, or logic needs to be added to handle/prevent the tile map filling up.

Anyraid vdevs support all the operations that normal vdevs do. They can be resilvered, removed, and scrubbed. They also support expansion; new drives can be attached to the anyraid vdev, and their tiles will be used in future allocations. There is currently no support for rebalancing tiles onto new devices, although that is also planned. VDEV Contraction is also planned for the future.

New ZDB functionality was added to print out information about the anyraid mapping, to aid in debugging and understanding. A number of tests were also added, and ztest support for the new type of vdev was implemented.

How Has This Been Tested?

In addition to the tests added to the test suite and zloop runs, I also ran many manual tests of unusual configurations to verify that the tile layout behaves correctly. There was also some basic performance testing to verify that nothing was obviously wrong. Performance is not the primary design goal of anyraid, however, so in-depth analysis was not performed.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

cmd/zdb/zdb.c

module/zfs/vdev_anyraid.c

module/zfs/spa.c

amotin · 2025-08-07T20:01:33Z

module/zfs/metaslab.c

+	vdev_t *vd = mg->mg_vd;
+	if (B_FALSE) {
+		weight = 2 * weight - (msp->ms_id * weight) / vd->vdev_ms_count;
+		weight = MIN(weight, METASLAB_MAX_WEIGHT);


I am suspicious about this math:

Unlike seems like linear space-based weight, segment-based weight is exponential. So doubling weight of the first metaslabs, you actually exponentially increasing it. I.e, if the first metaslab has only one free segment of only 1MB, and so weight with INDEX=20 and COUNT=1, doubling that will give INDEX=40 and COUNT=2, which would make it absolutely unbeatable for the most metaslabs, since it would require a free segment of up to 1TB.

all free metaslabs are identical and selected based on their offset, which is OK, if it is expected, but they also don't go through this path, so they have a very small chance to ever be used until all the earlier metaslabs are filled to the brim.

Hm, yeah doubling is probably overkill for this. But we do need something that will interpolate nicely later into the vdev. Perhaps what we do is something like "Add 3 - ((ms_id * 4) / vdev_ms_count) to the index and add 20 - (((ms_id * 20) / (vdev_ms_count / 5)) % 20) to the count? So for the first quarter we would add 3 to the count, then 2, 1, and 0. And within each quarter, we would add 19 to the first metaslab, 18 to the second, etc. It's not ideal, since larger vdevs will have large plateaus where the modifications are the same, but that's probably alright. We just want to try to concentrate writes generally earlier, a little mixing in adjacent metaslabs on large pools is probably fine. And with the largest index difference being 3, the last metaslabs only needs to have segments 8x as large as the first one to compete.

For purposes of spinning disks performance we don't really need a smooth curve. Having just several "speed zones" would be enough to get the most of performance. I propose to leave the first zone as it is, for the second zone to subtract 1 from index, while doubling the count to keep it equivalent, for the third zone subtract another one from index and again double the count, etc. This logic may not be great beyond 3-4 zones, but IMO that should be enough to get the most of speed, and I think it could actually be applied independently of anyraid to any rotating vdevs. We may not apply this to metaslabs with index below some threshold, since sequential speed there should not worth much, and we should better focus on lower fragmentation.

For purposes of anyraid's tiles allocation I think you only need to a certain degree prefer already used metaslabs to empty ones. Once some metaslabs are used, I am not sure (yet?) why would you really prefer one used metaslab to another, since single used block would be enough for tile to stay allocated no matter what. So I think it may be enough to just account free metaslabs not as one free segment (index of size and count of 1, that makes it unbeatable now), but split it few times in a way I described above for HDDs to a level just below (or equal to) the last speed zone of used metaslabs. Free metaslabs do not need zones, since current code already sort identical metaslabs by offset.

I agree we probably don't need gradations beyond the few top-level speed zones. I think the simplest algorithm that satisfies our goals is to just add N - 1 to the index in the first zone, N - 2 in the second, etc, for N zones. That will prefer earlier metaslabs, and since untouched metaslabs won't hit this code, they will naturally end up with nothing added to the index, and end up sorting with everything in the final zone (and then earlier ones will sort first, as you said). I see the idea behind decreasing the index and doubling the count to match, but I don't think that's actually important; we don't multiply the segment size and count together to get the available space anywhere, so we don't need to preserve that relationship.

This prefers earlier metaslabs for the general rotational performance bonus, and prefers already used metaslabs for anyraid. We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.

We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.

Yea. In unrelated context I was also recently thinking that we could do more when we reach some fragmentation or capacity threshold.

One annoying caveat about changing the weighting algorithm dynamically is that the assertions that the weight not decrease while a metaslab is unloaded means we have to be a little careful to design/test it with those constraints in mind.

include/sys/vdev_anyraid.h

module/zfs/vdev_anyraid.c

amotin · 2025-08-08T16:22:14Z

module/zfs/vdev_label.c

+	if (vd->vdev_parent->vdev_ops == &vdev_anyraid_ops) {
+		vdev_anyraid_write_map_sync(vd, zio, txg, good_writes, flags,
+		    status);


I suppose this will write all the maps every time? In addition to 2 one-sector uberblock writes per leaf vdev we now write up to 64MB per one?

Yes, we now write out the whole map every TXG. In theory that could be 64MiB, but in practice it's usually on the order of kilobytes; an anyraid with 32 disks, with an average of 256 tiles each, and 80% of them mapped would use 26KB. We have 64MB here because that's the maximum the mapping could possibly reach, not because we expect it to be anywhere close to that in practice.

We could add an optimization that doesn't update the map if nothing changed in a given txg; we'd still want to update the header, but the mapping itself could be left unmodified.

amotin · 2025-08-08T16:38:58Z

module/zfs/vdev_anyraid.c

+	void *buf = abd_borrow_buf(map_abd, SPA_MAXBLOCKSIZE);
+
+	rw_enter(&var->vd_lock, RW_READER);
+	anyraid_tile_t *cur = avl_first(&var->vd_tile_map);


As I understand, you have only one copy of a tile map, that covers all TXGs. It may be OK as long as you don't need precise accounting and not going to ever free the tiles. But what is not OK, I suppose, is that for writes done in open context (such as Direct I/O and ZIL) maps will not be written till the end of next committed TXG. Direct I/O might not care, but it may be impossible to replay the ZIL after crash if its blocks or blocks they reference were written to a new tiles that are not yet synced.

That's not correct, there are 4 copies per vdev, which rotate per txg. So there are plenty of copies of the map for each TXG, and there are 4 TXGs worth of maps.

As for the ZIL issue, that is a good point. We need to prevent ZIL blocks from ending up in unmapped tiles. While this probably wouldn't actually cause a problem in practice (when you import the pool again the tile would get mapped in the same way as before, since the vdev geometry hasn't changed), it definitely could in theory (if you had multiple new tiles in the same TXG, and they got remapped in a different order than they were originally).

The best fix I came up with is to prevent ZIL writes from being allocated to unmapped tiles in the first place. I also considered trying to stall those writes until the TXG synced, but that's slow and also technically annoying. I also considered having a journal of tile mappings that we would update immediately, but that adds a lot of complexity. Preventing the allocation nicely solves the problem, and if they can't find a place to allocate in all the mapped tiles, we already have logic to force the ZIL to fall back to txg_wait_synced.

tonyhutter · 2025-08-18T22:02:52Z

Overall this is a really nice feature! I haven't looked at the code yet, but did kick the tires a little and have some comments/questions.

Regarding:

The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger.

How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).

We should tell the user the minimum anyraid vdev size if they pass too small a vdev. Currently the error is:

$ sudo ./zpool create tank anyraid ./8gb_file1 ./8gb_file2
cannot create 'tank': one or more devices is out of space

We should document that autoexpand=on|off is ignored by anyraid to mitigate any confusion/ambiguity.
I was able to create an anyraid1 pool with an anyraid1 special device, which is nice. However, I could not create an anyraid1 pool with a mirror special device, even though they're the same redundancy level (special devices must have same redundancy level as the pool). We should update the checks to allow mirror/raidz/anyraid/anyraidz equivalent redundancy levels with special vdevs.
This PR uses anyraid, anyraid0, anyraid1, anyraid2 naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like?

anymirror, anymirror0, anymirror1, anymirror2

anyraidz, anyraidz1

That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement:

"With ZFS AnyRaid, we will see at least two new layouts added: AnyRaid-Mirror and AnyRaid-Z1. The AnyRaid-Mirror feature will come first, and will allow users to have a pool of more than two disks of varying sizes while ensuring all data is written to two different disks. The AnyRaid-Z1 feature will apply the same concepts of ZFS RAID-Z1, but while supporting mixed size disks."

https://hexos.com/blog/introducing-zfs-anyraid-sponsored-by-eshtek

tonyhutter · 2025-08-18T23:55:27Z

I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":

	  anyraid-0                 ONLINE       0     0     0

We should have it match the raidz TLD convention where the parity level is included:

	  raidz1-0                  ONLINE       0     0     0
	  raidz2-0                  ONLINE       0     0     0
	  raidz3-0                  ONLINE       0     0     0

pcd1193182 · 2025-08-27T20:55:27Z

Regarding:

The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger.

How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).

16GiB was selected mostly because that would make the minimum line up with the standard fraction (1/64th) at a 1TiB disk. That's a nice round number, and a pretty reasonable size for "a normal size disk" these days. Anything less than 1TiB is definitely on the smaller side. The other effect of this value is that with this tile size, you can have any disk up to 1PiB in size and still be able to use all the space; any disk that's more than 2^24 tiles can't all be used.

It is possible to have smaller tile sizes; we do it in the test suite a bunch. There is a tunable, zfs_anyraid_min_tile_size, that controls this.

We should tell the user the minimum anyraid vdev size if they pass too small a vdev. Currently the error is:
$ sudo ./zpool create tank anyraid ./8gb_file1 ./8gb_file2
cannot create 'tank': one or more devices is out of space

That's fair, we could have a better error message for this case. I can work on that.

We should document that autoexpand=on|off is ignored by anyraid to mitigate any confusion/ambiguity.

I think autoexpand works like normal? It doesn't affect the tile size or anything, because the tile size is locked in immediately when the vdev is created, but it should affect the disk sizes like normal. Maybe the tile capacity doesn't change automatically? But that's probably a bug, if so. Did you run into this in your testing?

I was able to create an anyraid1 pool with an anyraid1 special device, which is nice. However, I could not create an anyraid1 pool with a mirror special device, even though they're the same redundancy level (special devices must have same redundancy level as the pool). We should update the checks to allow mirror/raidz/anyraid/anyraidz equivalent redundancy levels with special vdevs.

Interesting, I will investigate why that happened. Those should be able to mix for sure.

This PR uses anyraid, anyraid0, anyraid1, anyraid2 naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like?
anymirror, anymirror0, anymirror1, anymirror2

anyraidz, anyraidz1
That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement:

...

I'm open to new naming options. My vague plan was to use anyraidz{1,2,3} for the RAID-Z-style parity when that support is added. But having mirror-parity have a clearer name does probably make sense. I'm open to anymirror; I was also think about anyraidm as a possibility.

I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":
	  anyraid-0                 ONLINE       0     0     0

Good point, I will fix that too.

junkbustr · 2025-08-30T05:20:08Z

This is a nit, but in the description I believe there is a typo:

"Anyraid works by diving each of the disks that makes up the vdev..."

I believe the intent was for dividing.

module/zfs/vdev_anyraid.c

include/sys/vdev_anyraid.h

tonyhutter · 2025-09-15T23:16:26Z

The validation logic will need to be tweaked to allow differing numbers of vdevs per anyraid TLD:

$ truncate -s 30G file1_30g
$ truncate -s 40G file2_40g
$ truncate -s 20G file3_20g
$ truncate -s 35G file4_35g
$ truncate -s 35G file5_35g
$ sudo ./zpool create tank anyraid ./file1_30g ./file2_40g ./file3_20g anyraid ./file4_35g ./file5_35g
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: both 3-way and 2-way anyraid vdevs are present

The anyraid TLD type string needs checks as well:

$ ./zpool create tank anyraid-this_should_not_work ./file1_30g
$ sudo ./zpool status
  pool: tank
 state: ONLINE
config:

	NAME                            STATE     READ WRITE CKSUM
	tank                            ONLINE       0     0     0
	  anyraid0-0                    ONLINE       0     0     0
	    /home/hutter/zfs/file1_30g  ONLINE       0     0     0

errors: No known data errors

Regarding:

This PR uses anyraid, anyraid0, anyraid1, anyraid2 naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like? anymirror, anymirror0, anymirror1, anymirror2, anyraidz, anyraidz1

I'm open to new naming options. My vague plan was to use anyraidz{1,2,3} for the RAID-Z-style parity when that support is added. But having mirror-parity have a clearer name does probably make sense. I'm open to anymirror; I was also think about anyraidm as a possibility.

I prefer the anymirror name over anyraidm, just to keep convention with mirror. Same with my preference for the future anyraidz name for the same reasons.

I don't know if this has anything to do with this PR, but I notice the rep_dev_size in the JSON was a little weird. Here I create an anyraid pool with 30GB, 40GB, and 20GB vdevs:

$ sudo ./zpool status -j | jq
 ...
              "vdevs": {
                "/home/hutter/zfs/file1_30g": {
                  "name": "/home/hutter/zfs/file1_30g",
                  "vdev_type": "file",
                  "guid": "2550367119017510955",
                  "path": "/home/hutter/zfs/file1_30g",
                  "class": "normal",
                  "state": "ONLINE",
                  "rep_dev_size": "16.3G",
                  "phys_space": "30G",
...
                },
                "/home/hutter/zfs/file2_40g": {
                  "name": "/home/hutter/zfs/file2_40g",
                  "vdev_type": "file",
                  "guid": "17589174087940051454",
                  "path": "/home/hutter/zfs/file2_40g",
                  "class": "normal",
                  "state": "ONLINE",
                  "rep_dev_size": "16.3G",
                  "phys_space": "40G",
...
                },
                "/home/hutter/zfs/file3_20g": {
                  "name": "/home/hutter/zfs/file3_20g",
                  "vdev_type": "file",
                  "guid": "6265258539420333029",
                  "path": "/home/hutter/zfs/file3_20g",
                  "class": "normal",
                  "state": "ONLINE",
                  "rep_dev_size": "261M",
                  "phys_space": "20G",
...

I'm guessing first two vdevs report a rep_dev_size of 16.3G due to tile alignment. What I don't get is the 261M value for the 3rd vdev. I would have expected a 16.3GB value there.

Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>

tonyhutter · 2025-10-03T20:46:11Z

cmd/zdb/zdb.c

+static int
+log_10(uint64_t v) {
+	char buf[32];
+	snprintf(buf, sizeof (buf), "%llu", (u_longlong_t)v);
+	return (strlen(buf));
+}


Could you use log10() from math.h?

Technically yes, but I don't actually care about the value of the logarithm, really. The thing i want is the ability to know how much space the number will take when printed, so I can format properly. I'm going to rename this to reflect that, rather than using the math function and having to do rounding and everything. This could also matter with different LOCALEs in theory.

include/sys/fs/zfs.h

tonyhutter · 2025-10-03T21:06:44Z

module/zcommon/zfeature_common.c

 	}

+	zfeature_register(SPA_FEATURE_ANYRAID,
+	    "com.klarasystems:anyraid", "anyraid", "Support for anyraid VDEV",


I like that the overarching name for anymirror+anyraid is called "anyraid". For this feature flag though, do we need to be more specific and call it "anymirror", and then add another feature flag for the actual raid part follow-on ("anyraid")? Or can we use the same feature flag for both anymirror and the future anyraid?

We should be able to use the same featureflag for both. Pools with anyraidz will fail to import on systems without the relevant logic, but it should fail gracefully with ENOTSUP. Given that, I don't think we need a separate featureflag for the two.

tonyhutter · 2025-10-03T21:14:55Z

module/zfs/vdev_anyraid.c

+ * Initialize private VDEV specific fields from the nvlist.
+ */
+static int
+vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd)


Should this be?:

- vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd) + vdev_anyraid_init(spa_t *spa, nvlist_t *nv, vdev_anyraid_t **tsd)

I believe this doesn't work because the vdev_*_init functions get used to set the vdev_op_init field of the vdev ops, which requires a vdev_init_func_t, which is typedef int vdev_init_func_t(spa_t *spa, nvlist_t *nv, void **tsd); You'll note that raidz, for example, also uses a void ** here.

tests/zfs-tests/include/libtest.shlib

tests/zfs-tests/tests/functional/anyraid/anyraid_special_vdev_001_pos.ksh

tests/zfs-tests/tests/functional/anyraid/anyraid_special_vdev_002_pos.ksh

tests/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount.kshlib

tests/zfs-tests/tests/functional/cli_root/zpool_add/zpool_add_009_neg.ksh

tonyhutter · 2025-10-03T22:06:03Z

tests/zfs-tests/tests/functional/cli_root/zpool_get/zpool_get.cfg

    "feature@redaction_list_spill"
    "feature@dynamic_gang_header"
    "feature@physical_rewrite"
+    "feature@anyraid"


"feature@anymirror"?

tonyhutter · 2025-10-03T22:13:52Z

...s/tests/functional/cli_root/zpool_initialize/zpool_initialize_fault_export_import_online.ksh

+for type in "mirror" "anymirror1"; do
+	log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2
+	if [[ "$type" == "anymirror1" ]]; then
+		log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=2k


Is this dd necessary? 1:52 of the total 1:55 test time is the dd:

19:33:02.54 SUCCESS: zpool create -f testpool anymirror1 loop0 loop1 19:34:54.98 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=2k

Yes. Well, maybe. Mostly, anyway.

The way that initialize and trim works with anyraid is a little weird. Because there is no logical-to-physical mapping for a given offset before the backing tile gets allocated, initialize on an empty anyraid device will complete basically instantly. We need to actually allocate all the tiles to get the initialize to take enough time for the test to work.

We can speed this up by not using /dev/urandom and using /dev/zero instead but setting compress=off first. And it's possible that a smaller filesize could work, though I think I was seeing flakiness with smaller sizes. Really what we want is to be able to just force all the tiles to get allocated, but there's no way to force that right now aside from writing data. Maybe with the new zhack metaslab leak functionality we could allocate most of the space in each metaslab and then just write a little bit to each one.

using /dev/zero instead but setting compress=off first

Alternatively, you could use file_write -d R ...

file_write -d R is psudorandom enough not to compress, and much faster:

$ time sudo dd if=/dev/urandom of=testfile bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 3.79406 s, 566 MB/s real 0m3.973s user 0m0.005s sys 0m0.003s $ time ./tests/zfs-tests/cmd/file_write -o create -f testfile -b $((1024 * 1024)) -c 2048 -d R real 0m0.646s user 0m0.057s sys 0m0.513s

...tests/functional/cli_root/zpool_initialize/zpool_initialize_offline_export_import_online.ksh

tests/zfs-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_online_offline.ksh

...s-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_start_and_cancel_neg.ksh

tests/zfs-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_uninit.ksh

tests/zfs-tests/tests/functional/anyraid/anyraid_faildisk_write_replace_resilver.ksh

owlshrimp · 2025-10-06T11:15:38Z

@pcd1193182 I apologize for my tardiness, as I only just found the time to watch the leadership meeting introducing anyraid. This is wrt the throughput concern in the single-writer case, where in (eg.) a two-disk-mirror anyraid that writer will only ever be writing to two given tiles [1] on two given disks at any one time. I have a potential solution, though I expect everyone will recoil in horror at it.

What seems to be lost in anyraid is that the toplevel of the pool is able to raid-0 stripe writes across all vdevs, where as in anyraid you are essentially creating a series of mirror vdevs which are hidden from the toplevel stripe. They must be filled completely in sequential order. It's as if you had a rather naive operator adding a series of normal mirror vdevs to a normal pool, taking action only when each one filled completely to capacity (and who never touches old data).

The suggestion: You would pre-generate the mirror/raidZ mappings slightly in advance, say for a complete "round" [2] at a time. You would then internally in the anyraid borrow the toplevel pool's raid-0 logic [3], striping/swizzling writes coming into the anyraid across each set of mirror/raidZ mappings. This would very much suck if you have very unbalanced disk sizes as in [2] below (which has high contention on single disk C and would probably be no better than existing sequential anyraid) but may be significantly less bad and more balanced for configurations like {4TB, 4TB, 8TB, 8TB} or {4TB, 4TB, 4TB}. In this way, you don't have a series of temporary mirrors (or raidZs), but a series of raid-0 stripes containing balanced sets of overlapping mirrors (or raidZs) with more parallel writing that is more balanced across more disks.

Admittedly, the idea of inserting raid-0 vdevs into the middle of the stack is extremely unpleasant one. That said, with proper implementation of the mirror or raidz logic by the anyraid my guess(?) is that it's no more hazardous than the existing architecture of anyraid or a normal zpool. The conditions for failure of the array seem the same (to be verified). I expect there will probably be a few funny quirks though. This is also likely far more palatable than my initial idea of anyraid offering up multiple vdevs on a silver platter to the toplevel stripe, which would expose the anyraid's internal scheduling to it and make a colossal mess.

[1] 64GB regions, allocated from available disks by anyraid
[2] in the case of 3 disks {A:4TB, B:4TB, C:8TB} one round would be an allocation of a pair of mirror tiles from A+C and then B+C
[3] perhaps it could be borrowed like mirror and raidz logic is borrowed by draid and anyraid, by making it organizationally more like a raid-0 vdev?

pcd1193182 · 2025-10-08T18:03:00Z

@pcd1193182 I apologize for my tardiness, as I only just found the time to watch the leadership meeting introducing anyraid.

No worries, thanks for your feedback!

The suggestion: You would pre-generate the mirror/raidZ mappings slightly in advance, say for a complete "round" [2] at a time.

So, a "round" is not a thing that we have any real concept of inside of anyraid. It's one of those things that's easy to think about in specific cases in your head, but hard to define in the general case in the code. The simplest definition is probably "a round is N allocations that result in at least one allocation to every disk". Which works fine, except that with extreme vdev layouts this can result in you allocating a lot of tiles. Consider a pair of 10T disks and three 1T disks. No tiles will be allocated on the 1T disks until the 10T disks are most of the way full.

"Surely, Paul, this is just a result of the choice of the algorithm for selecting tiles. Wouldn't a different algorithm fix this problem?" you might justifiably ask. Let's say we go by capacity %age first, and then by raw amount of free space. This does result in every disk getting a tile allocated early on... which can result in less efficient space usage in some extreme scenarios (a disk with 10 tiles and 10 disks with one tile should be able to store 10 tiles in parity=1, but with this algorithm stores only 6). Even for more normal cases, like 10T and 1T disks, sure the first round will be just a quick set of tiles from each disk. But the second set will go back to being several from the 10T disks before there are any more from the 1T disks.

Now, it is probably not worth optimizing for extreme layouts when in practice, not many disks are only going to store a few tiles. But even with a "perfect" selection algorithm, though, this only goes so far, as we will discuss. The best we can do, unfortunately, is smooth things out a little bit.

You would then internally in the anyraid borrow the toplevel pool's raid-0 logic

To dive into the internals a bit here, there is no "raid-0 logic" in the vdev code. There is no raid-0 top-level vdev to borrow from, because there are no raid-0 top level vdevs. The top-level vdevs are actually combined together in the metaslab code, where we have the rotors in the metaslab classes that are used to move allocations between top level vdevs, and the allocation throttles and queues that control how much goes to each one. This is where we'd want to hook in probably.

striping/swizzling writes coming into the anyraid across each set of mirror/raidZ mappings. This would very much suck if you have very unbalanced disk sizes ... but may be significantly less bad and more balanced for configurations like ...

It's worth noting that because of the way that allocators work in ZFS, you don't actually just write to a single tile at a time. Different metaslabs get grabbed by different allocators, and those can and will be on different tiles (especially if we increase the size of metaslabs, as we've been discussing for quite a while now and it seems high time for). Writes are distributed across those roughly evenly, so writes will spread to different tiles. And so if you have relatively even disk sizes, everything is basically already going to work out for you. Tiles will mostly spread out nicely, and you'll be distributing writes across your disks, and later reads will do the same

The problem mostly only appears with very different disk sizes, and this is where we run into the fundamental problem that underlies all of this: If a disk has 10x as much space as another, we need to send 10x as many writes to it to fill it. So unless that disk is 10x as fast as the other one, it will be the performance bottleneck. We can smooth things out a little by trying to write a little bit to all the other disks while we write to that disk, but all that does is remove the lumps from the performance curve. Which is not a bad thing to do! But no matter how you slice it, if you have disks with very different sizes, you are going to have performance problems where the big disks bottleneck you.

Admittedly, the idea of inserting raid-0 vdevs into the middle of the stack is extremely unpleasant one.

Especially when you take into account that the raid-0 code (to the extent that it exists) lives at the metaslab level. This is ultimately why I didn't implement something like this. It does have a benefit, and would probably result in improved user experience, but the code when I sketched it was very unpleasant.

And one nice thing about this idea and anyraid is that it doesn't have to be part of the initial implementation. No part of the proposed idea depends on anything about the ondisk format or the fundamentals of the design, which means that it can be iterated on. If performance proves to be a problem for end users, there can be another patch that tries out this idea, or a related one. But by keeping this PR smaller and more focused on just the vdev architecture itself, we can keep the review process more focused and hopefully get things integrated more efficiently.

The initial goal for anyraid is correctness and space maximization. Performance is secondary, but it is also separable, and can be iterated on later or by others who are focused on it.

This is also likely far more palatable than my initial idea of anyraid offering up multiple vdevs on a silver platter to the toplevel stripe, which would expose the anyraid's internal scheduling to it and make a colossal mess.

Unfortunately, that idea is basically this idea, because of the details of ZFS's internals :)

owlshrimp · 2025-10-09T03:06:05Z

So, a "round" is not a thing that we have any real concept of inside of anyraid. It's one of those things that's easy to think about in specific cases in your head, but hard to define in the general case in the code. The simplest definition is probably "a round is N allocations that result in at least one allocation to every disk". Which works fine, except that with extreme vdev layouts this can result in you allocating a lot of tiles. Consider a pair of 10T disks and three 1T disks. No tiles will be allocated on the 1T disks until the 10T disks are most of the way full.

I suppose one could cap the maximum number of allocations opened at once (at the cost of reduced balance) but yeah the rest of this pretty much falls apart from there. Especially since I was under the impression the toplevel raid-0 metaphor was a lot more literal than it clearly actually is.

The initial goal for anyraid is correctness and space maximization. Performance is secondary, but it is also separable, and can be iterated on later or by others who are focused on it.

Separability is a wonderful thing. :)

github-advanced-security bot found potential problems Jul 25, 2025

View reviewed changes

cmd/zdb/zdb.c Fixed Show fixed Hide fixed

module/zfs/vdev_anyraid.c Fixed Show fixed Hide fixed

pcd1193182 force-pushed the anyraid branch from e28f7cc to ed1b6b5 Compare July 25, 2025 17:40

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Jul 25, 2025

allanjude requested review from ahrens, amotin, behlendorf and grwilson July 25, 2025 18:16

pcd1193182 force-pushed the anyraid branch 4 times, most recently from d8526e8 to c3b8110 Compare August 7, 2025 20:05

amotin reviewed Aug 7, 2025

View reviewed changes

amotin reviewed Aug 8, 2025

View reviewed changes

amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Aug 8, 2025

pcd1193182 force-pushed the anyraid branch from c3b8110 to ab47571 Compare September 2, 2025 21:14

github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Sep 2, 2025

pcd1193182 force-pushed the anyraid branch 3 times, most recently from 64b9223 to ae25ac3 Compare September 5, 2025 18:16

amotin reviewed Sep 8, 2025

View reviewed changes

module/zfs/vdev_anyraid.c Outdated Show resolved Hide resolved

pcd1193182 force-pushed the anyraid branch 4 times, most recently from 6e96d25 to 7c87ca3 Compare September 10, 2025 22:35

amotin reviewed Sep 11, 2025

View reviewed changes

module/zfs/vdev_anyraid.c Outdated Show resolved Hide resolved

include/sys/vdev_anyraid.h Outdated Show resolved Hide resolved

Paul Dagnelie added 9 commits October 3, 2025 10:28

Anyraid implementation

7053f14

Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Implement rebuild support

42e2a60

Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add support for anyraid in vdev properties

de47c72

Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add man page entry

ff7a0f9

Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

improve byteswap logic

f3f9c60

Signed-off-by: Paul Dagnelie <[email protected]>

Use zinject to try to make test fully reliable

1f9da24

Signed-off-by: Paul Dagnelie <[email protected]>

Final byteswap handling

47c91c2

Signed-off-by: Paul Dagnelie <[email protected]>

Tony's feedback

4d0397d

Signed-off-by: Paul Dagnelie <[email protected]>

Fix test failures

2377cdc

Signed-off-by: Paul Dagnelie <[email protected]>

pcd1193182 force-pushed the anyraid branch from 00234aa to 2377cdc Compare October 3, 2025 18:17