-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for anyraid vdevs #17567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d8526e8
to
c3b8110
Compare
vdev_t *vd = mg->mg_vd; | ||
if (B_FALSE) { | ||
weight = 2 * weight - (msp->ms_id * weight) / vd->vdev_ms_count; | ||
weight = MIN(weight, METASLAB_MAX_WEIGHT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suspicious about this math:
- Unlike seems like linear space-based weight, segment-based weight is exponential. So doubling weight of the first metaslabs, you actually exponentially increasing it. I.e, if the first metaslab has only one free segment of only 1MB, and so weight with INDEX=20 and COUNT=1, doubling that will give INDEX=40 and COUNT=2, which would make it absolutely unbeatable for the most metaslabs, since it would require a free segment of up to 1TB.
- all free metaslabs are identical and selected based on their offset, which is OK, if it is expected, but they also don't go through this path, so they have a very small chance to ever be used until all the earlier metaslabs are filled to the brim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah doubling is probably overkill for this. But we do need something that will interpolate nicely later into the vdev. Perhaps what we do is something like "Add 3 - ((ms_id * 4) / vdev_ms_count)
to the index and add 20 - (((ms_id * 20) / (vdev_ms_count / 5)) % 20)
to the count? So for the first quarter we would add 3 to the count, then 2, 1, and 0. And within each quarter, we would add 19 to the first metaslab, 18 to the second, etc. It's not ideal, since larger vdevs will have large plateaus where the modifications are the same, but that's probably alright. We just want to try to concentrate writes generally earlier, a little mixing in adjacent metaslabs on large pools is probably fine. And with the largest index difference being 3, the last metaslabs only needs to have segments 8x as large as the first one to compete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For purposes of spinning disks performance we don't really need a smooth curve. Having just several "speed zones" would be enough to get the most of performance. I propose to leave the first zone as it is, for the second zone to subtract 1 from index, while doubling the count to keep it equivalent, for the third zone subtract another one from index and again double the count, etc. This logic may not be great beyond 3-4 zones, but IMO that should be enough to get the most of speed, and I think it could actually be applied independently of anyraid to any rotating vdevs. We may not apply this to metaslabs with index below some threshold, since sequential speed there should not worth much, and we should better focus on lower fragmentation.
For purposes of anyraid's tiles allocation I think you only need to a certain degree prefer already used metaslabs to empty ones. Once some metaslabs are used, I am not sure (yet?) why would you really prefer one used metaslab to another, since single used block would be enough for tile to stay allocated no matter what. So I think it may be enough to just account free metaslabs not as one free segment (index of size and count of 1, that makes it unbeatable now), but split it few times in a way I described above for HDDs to a level just below (or equal to) the last speed zone of used metaslabs. Free metaslabs do not need zones, since current code already sort identical metaslabs by offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we probably don't need gradations beyond the few top-level speed zones. I think the simplest algorithm that satisfies our goals is to just add N - 1
to the index in the first zone, N - 2
in the second, etc, for N
zones. That will prefer earlier metaslabs, and since untouched metaslabs won't hit this code, they will naturally end up with nothing added to the index, and end up sorting with everything in the final zone (and then earlier ones will sort first, as you said). I see the idea behind decreasing the index and doubling the count to match, but I don't think that's actually important; we don't multiply the segment size and count together to get the available space anywhere, so we don't need to preserve that relationship.
This prefers earlier metaslabs for the general rotational performance bonus, and prefers already used metaslabs for anyraid. We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.
Yea. In unrelated context I was also recently thinking that we could do more when we reach some fragmentation or capacity threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One annoying caveat about changing the weighting algorithm dynamically is that the assertions that the weight not decrease while a metaslab is unloaded means we have to be a little careful to design/test it with those constraints in mind.
if (vd->vdev_parent->vdev_ops == &vdev_anyraid_ops) { | ||
vdev_anyraid_write_map_sync(vd, zio, txg, good_writes, flags, | ||
status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this will write all the maps every time? In addition to 2 one-sector uberblock writes per leaf vdev we now write up to 64MB per one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we now write out the whole map every TXG. In theory that could be 64MiB, but in practice it's usually on the order of kilobytes; an anyraid with 32 disks, with an average of 256 tiles each, and 80% of them mapped would use 26KB. We have 64MB here because that's the maximum the mapping could possibly reach, not because we expect it to be anywhere close to that in practice.
We could add an optimization that doesn't update the map if nothing changed in a given txg; we'd still want to update the header, but the mapping itself could be left unmodified.
void *buf = abd_borrow_buf(map_abd, SPA_MAXBLOCKSIZE); | ||
|
||
rw_enter(&var->vd_lock, RW_READER); | ||
anyraid_tile_t *cur = avl_first(&var->vd_tile_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, you have only one copy of a tile map, that covers all TXGs. It may be OK as long as you don't need precise accounting and not going to ever free the tiles. But what is not OK, I suppose, is that for writes done in open context (such as Direct I/O and ZIL) maps will not be written till the end of next committed TXG. Direct I/O might not care, but it may be impossible to replay the ZIL after crash if its blocks or blocks they reference were written to a new tiles that are not yet synced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not correct, there are 4 copies per vdev, which rotate per txg. So there are plenty of copies of the map for each TXG, and there are 4 TXGs worth of maps.
As for the ZIL issue, that is a good point. We need to prevent ZIL blocks from ending up in unmapped tiles. While this probably wouldn't actually cause a problem in practice (when you import the pool again the tile would get mapped in the same way as before, since the vdev geometry hasn't changed), it definitely could in theory (if you had multiple new tiles in the same TXG, and they got remapped in a different order than they were originally).
The best fix I came up with is to prevent ZIL writes from being allocated to unmapped tiles in the first place. I also considered trying to stall those writes until the TXG synced, but that's slow and also technically annoying. I also considered having a journal of tile mappings that we would update immediately, but that adds a lot of complexity. Preventing the allocation nicely solves the problem, and if they can't find a place to allocate in all the mapped tiles, we already have logic to force the ZIL to fall back to txg_wait_synced.
Overall this is a really nice feature! I haven't looked at the code yet, but did kick the tires a little and have some comments/questions.
How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).
That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement: "With ZFS AnyRaid, we will see at least two new layouts added: AnyRaid-Mirror and AnyRaid-Z1. The AnyRaid-Mirror feature will come first, and will allow users to have a pool of more than two disks of varying sizes while ensuring all data is written to two different disks. The AnyRaid-Z1 feature will apply the same concepts of ZFS RAID-Z1, but while supporting mixed size disks." https://hexos.com/blog/introducing-zfs-anyraid-sponsored-by-eshtek |
I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":
We should have it match the raidz TLD convention where the parity level is included:
|
16GiB was selected mostly because that would make the minimum line up with the standard fraction (1/64th) at a 1TiB disk. That's a nice round number, and a pretty reasonable size for "a normal size disk" these days. Anything less than 1TiB is definitely on the smaller side. The other effect of this value is that with this tile size, you can have any disk up to 1PiB in size and still be able to use all the space; any disk that's more than 2^24 tiles can't all be used. It is possible to have smaller tile sizes; we do it in the test suite a bunch. There is a tunable,
That's fair, we could have a better error message for this case. I can work on that.
I think autoexpand works like normal? It doesn't affect the tile size or anything, because the tile size is locked in immediately when the vdev is created, but it should affect the disk sizes like normal. Maybe the tile capacity doesn't change automatically? But that's probably a bug, if so. Did you run into this in your testing?
Interesting, I will investigate why that happened. Those should be able to mix for sure.
I'm open to new naming options. My vague plan was to use
Good point, I will fix that too. |
This is a nit, but in the description I believe there is a typo: "Anyraid works by diving each of the disks that makes up the vdev..." I believe the intent was for dividing. |
64b9223
to
ae25ac3
Compare
6e96d25
to
7c87ca3
Compare
I prefer the
I'm guessing first two vdevs report a |
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
static int | ||
log_10(uint64_t v) { | ||
char buf[32]; | ||
snprintf(buf, sizeof (buf), "%llu", (u_longlong_t)v); | ||
return (strlen(buf)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use log10()
from math.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, but I don't actually care about the value of the logarithm, really. The thing i want is the ability to know how much space the number will take when printed, so I can format properly. I'm going to rename this to reflect that, rather than using the math function and having to do rounding and everything. This could also matter with different LOCALEs in theory.
} | ||
|
||
zfeature_register(SPA_FEATURE_ANYRAID, | ||
"com.klarasystems:anyraid", "anyraid", "Support for anyraid VDEV", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the overarching name for anymirror+anyraid is called "anyraid". For this feature flag though, do we need to be more specific and call it "anymirror", and then add another feature flag for the actual raid part follow-on ("anyraid")? Or can we use the same feature flag for both anymirror and the future anyraid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use the same featureflag for both. Pools with anyraidz will fail to import on systems without the relevant logic, but it should fail gracefully with ENOTSUP. Given that, I don't think we need a separate featureflag for the two.
* Initialize private VDEV specific fields from the nvlist. | ||
*/ | ||
static int | ||
vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be?:
- vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd)
+ vdev_anyraid_init(spa_t *spa, nvlist_t *nv, vdev_anyraid_t **tsd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this doesn't work because the vdev_*_init
functions get used to set the vdev_op_init
field of the vdev ops, which requires a vdev_init_func_t
, which is typedef int vdev_init_func_t(spa_t *spa, nvlist_t *nv, void **tsd);
You'll note that raidz, for example, also uses a void **
here.
tests/zfs-tests/tests/functional/anyraid/anyraid_special_vdev_001_pos.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/anyraid/anyraid_special_vdev_002_pos.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_add/zpool_add_009_neg.ksh
Show resolved
Hide resolved
"feature@redaction_list_spill" | ||
"feature@dynamic_gang_header" | ||
"feature@physical_rewrite" | ||
"feature@anyraid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"feature@anymirror"?
for type in "mirror" "anymirror1"; do | ||
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 | ||
if [[ "$type" == "anymirror1" ]]; then | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dd
necessary? 1:52 of the total 1:55 test time is the dd
:
19:33:02.54 SUCCESS: zpool create -f testpool anymirror1 loop0 loop1
19:34:54.98 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=2k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Well, maybe. Mostly, anyway.
The way that initialize and trim works with anyraid is a little weird. Because there is no logical-to-physical mapping for a given offset before the backing tile gets allocated, initialize on an empty anyraid device will complete basically instantly. We need to actually allocate all the tiles to get the initialize to take enough time for the test to work.
We can speed this up by not using /dev/urandom
and using /dev/zero
instead but setting compress=off first. And it's possible that a smaller filesize could work, though I think I was seeing flakiness with smaller sizes. Really what we want is to be able to just force all the tiles to get allocated, but there's no way to force that right now aside from writing data. Maybe with the new zhack metaslab leak
functionality we could allocate most of the space in each metaslab and then just write a little bit to each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using /dev/zero instead but setting compress=off first
Alternatively, you could use file_write -d R ...
file_write -d R
is psudorandom enough not to compress, and much faster:
$ time sudo dd if=/dev/urandom of=testfile bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 3.79406 s, 566 MB/s
real 0m3.973s
user 0m0.005s
sys 0m0.003s
$ time ./tests/zfs-tests/cmd/file_write -o create -f testfile -b $((1024 * 1024)) -c 2048 -d R
real 0m0.646s
user 0m0.057s
sys 0m0.513s
...tests/functional/cli_root/zpool_initialize/zpool_initialize_offline_export_import_online.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_online_offline.ksh
Show resolved
Hide resolved
...s-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_start_and_cancel_neg.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_initialize/zpool_initialize_uninit.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/anyraid/anyraid_faildisk_write_replace_resilver.ksh
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/anyraid/anyraid_faildisk_write_replace_resilver.ksh
Show resolved
Hide resolved
@pcd1193182 I apologize for my tardiness, as I only just found the time to watch the leadership meeting introducing anyraid. This is wrt the throughput concern in the single-writer case, where in (eg.) a two-disk-mirror anyraid that writer will only ever be writing to two given tiles [1] on two given disks at any one time. I have a potential solution, though I expect everyone will recoil in horror at it. What seems to be lost in anyraid is that the toplevel of the pool is able to raid-0 stripe writes across all vdevs, where as in anyraid you are essentially creating a series of mirror vdevs which are hidden from the toplevel stripe. They must be filled completely in sequential order. It's as if you had a rather naive operator adding a series of normal mirror vdevs to a normal pool, taking action only when each one filled completely to capacity (and who never touches old data). The suggestion: You would pre-generate the mirror/raidZ mappings slightly in advance, say for a complete "round" [2] at a time. You would then internally in the anyraid borrow the toplevel pool's raid-0 logic [3], striping/swizzling writes coming into the anyraid across each set of mirror/raidZ mappings. This would very much suck if you have very unbalanced disk sizes as in [2] below (which has high contention on single disk C and would probably be no better than existing sequential anyraid) but may be significantly less bad and more balanced for configurations like {4TB, 4TB, 8TB, 8TB} or {4TB, 4TB, 4TB}. In this way, you don't have a series of temporary mirrors (or raidZs), but a series of raid-0 stripes containing balanced sets of overlapping mirrors (or raidZs) with more parallel writing that is more balanced across more disks. Admittedly, the idea of inserting raid-0 vdevs into the middle of the stack is extremely unpleasant one. That said, with proper implementation of the mirror or raidz logic by the anyraid my guess(?) is that it's no more hazardous than the existing architecture of anyraid or a normal zpool. The conditions for failure of the array seem the same (to be verified). I expect there will probably be a few funny quirks though. [1] 64GB regions, allocated from available disks by anyraid |
No worries, thanks for your feedback!
So, a "round" is not a thing that we have any real concept of inside of anyraid. It's one of those things that's easy to think about in specific cases in your head, but hard to define in the general case in the code. The simplest definition is probably "a round is N allocations that result in at least one allocation to every disk". Which works fine, except that with extreme vdev layouts this can result in you allocating a lot of tiles. Consider a pair of 10T disks and three 1T disks. No tiles will be allocated on the 1T disks until the 10T disks are most of the way full. "Surely, Paul, this is just a result of the choice of the algorithm for selecting tiles. Wouldn't a different algorithm fix this problem?" you might justifiably ask. Let's say we go by capacity %age first, and then by raw amount of free space. This does result in every disk getting a tile allocated early on... which can result in less efficient space usage in some extreme scenarios (a disk with 10 tiles and 10 disks with one tile should be able to store 10 tiles in parity=1, but with this algorithm stores only 6). Even for more normal cases, like 10T and 1T disks, sure the first round will be just a quick set of tiles from each disk. But the second set will go back to being several from the 10T disks before there are any more from the 1T disks. Now, it is probably not worth optimizing for extreme layouts when in practice, not many disks are only going to store a few tiles. But even with a "perfect" selection algorithm, though, this only goes so far, as we will discuss. The best we can do, unfortunately, is smooth things out a little bit.
To dive into the internals a bit here, there is no "raid-0 logic" in the vdev code. There is no raid-0 top-level vdev to borrow from, because there are no raid-0 top level vdevs. The top-level vdevs are actually combined together in the metaslab code, where we have the rotors in the metaslab classes that are used to move allocations between top level vdevs, and the allocation throttles and queues that control how much goes to each one. This is where we'd want to hook in probably.
It's worth noting that because of the way that allocators work in ZFS, you don't actually just write to a single tile at a time. Different metaslabs get grabbed by different allocators, and those can and will be on different tiles (especially if we increase the size of metaslabs, as we've been discussing for quite a while now and it seems high time for). Writes are distributed across those roughly evenly, so writes will spread to different tiles. And so if you have relatively even disk sizes, everything is basically already going to work out for you. Tiles will mostly spread out nicely, and you'll be distributing writes across your disks, and later reads will do the same The problem mostly only appears with very different disk sizes, and this is where we run into the fundamental problem that underlies all of this: If a disk has 10x as much space as another, we need to send 10x as many writes to it to fill it. So unless that disk is 10x as fast as the other one, it will be the performance bottleneck. We can smooth things out a little by trying to write a little bit to all the other disks while we write to that disk, but all that does is remove the lumps from the performance curve. Which is not a bad thing to do! But no matter how you slice it, if you have disks with very different sizes, you are going to have performance problems where the big disks bottleneck you.
Especially when you take into account that the raid-0 code (to the extent that it exists) lives at the metaslab level. This is ultimately why I didn't implement something like this. It does have a benefit, and would probably result in improved user experience, but the code when I sketched it was very unpleasant. And one nice thing about this idea and anyraid is that it doesn't have to be part of the initial implementation. No part of the proposed idea depends on anything about the ondisk format or the fundamentals of the design, which means that it can be iterated on. If performance proves to be a problem for end users, there can be another patch that tries out this idea, or a related one. But by keeping this PR smaller and more focused on just the vdev architecture itself, we can keep the review process more focused and hopefully get things integrated more efficiently. The initial goal for anyraid is correctness and space maximization. Performance is secondary, but it is also separable, and can be iterated on later or by others who are focused on it.
Unfortunately, that idea is basically this idea, because of the details of ZFS's internals :) |
I suppose one could cap the maximum number of allocations opened at once (at the cost of reduced balance) but yeah the rest of this pretty much falls apart from there. Especially since I was under the impression the toplevel raid-0 metaphor was a lot more literal than it clearly actually is.
Separability is a wonderful thing. :) |
Sponsored by: Eshtek, creators of HexOS; Klara, Inc.
Motivation and Context
For industry/commercial use cases, the existing redundancy solutions in ZFS (mirrors and RAIDZ) work great. They provide high performance, reliable, efficient storage options. For enthusiast users, however, they have a drawback. RAIDZ and mirrors will use the size of the smallest drive that is part of the vdev as the size of every drive, so they can provide their reliability guarantees. If you can afford to buy a new box of drives for your pool, like large-scale enterprise users, that's fine. But if you already have a mix of hard drives, of various sizes, and you want to use all of the space they have available while still benefiting from ZFS's reliability and featureset, there isn't currently a great solution for that problem.
Description
The goal of Anyraid is to fill that niche. Anyraid allows devices of mismatched sizes to be combined together into a single top-level vdev. In the current version, Anyraid only supports mirror-type parity, but raidz-type parity is planned for the near future.
Anyraid works by dividing each of the disks that makes up the vdev into tiles. These tiles are the same size across all disks within a given anyraid vdev. The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger. These tiles are then combined together to form the logical vdev that anyraid presents, with sets of tiles from different disks acting as mini-mirrors, allowing the reliability guarantees to be preserved. Tiles are allocated on demand; when a write comes into a part of the logical vdev that doesn't have backing tiles yet, the Anyraid logic picks the
nparity + 1
disks with the most unallocated tiles, and allocates one tile from each of them. These physical tiles are combined together into one logic tile, which is used to store data for that section of the logical vdev.One important note with this design is that we need to understand this mapping from logical offset to tiles (and therefore to actual physical disk locations) in order to read anything from the pool. As a result, we cannot store the mapping in the MOS, since that would result in a bootstrap problem. To solve this issue, we allocate a region at the start of each disk where we store the Anyraid tile map. The tile map is made up of 4 copies of all the data necessary to reconstruct the tile map. These copies are updated in rotating order, like uberblocks. In addition, each disk has a full copy of all 4 maps, ensuring that as long as any drive's copy survives, the tile map for a given TXG can be read successfully. The size of one copy of the tile map is 64MiB; that size determines the maximum number of tiles an anyraid vdev can have, which is 2^24. This is made up of up to 2^8 disks, and up to 2^16 tiles per disk. This does mean that the largest device that can be fully used by an anyraid vdev is 1024 times the size of the smallest disk that was present at vdev creation time. This was considered to be an acceptable tradeoff, though it is a limit that could be alleviated in the future if needed; the primary difficulty is that either the tile map needs to grow substantially, or logic needs to be added to handle/prevent the tile map filling up.
Anyraid vdevs support all the operations that normal vdevs do. They can be resilvered, removed, and scrubbed. They also support expansion; new drives can be attached to the anyraid vdev, and their tiles will be used in future allocations. There is currently no support for rebalancing tiles onto new devices, although that is also planned. VDEV Contraction is also planned for the future.
New ZDB functionality was added to print out information about the anyraid mapping, to aid in debugging and understanding. A number of tests were also added, and ztest support for the new type of vdev was implemented.
How Has This Been Tested?
In addition to the tests added to the test suite and zloop runs, I also ran many manual tests of unusual configurations to verify that the tile layout behaves correctly. There was also some basic performance testing to verify that nothing was obviously wrong. Performance is not the primary design goal of anyraid, however, so in-depth analysis was not performed.
Types of changes
Checklist:
Signed-off-by
.