-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Update dnode_next_offset_level to accept blkid instead of offset #17792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Converting this back to draft as I've been staring at the offset calculations for the new version and found an oddity in 8a8970e. That commit works for the case where a match occurs, but it returns a higher than expected offset in the non-matching case when 1) the starting offset points at an indirect hole and 2) the effect of The code prior to the commit above was leaving the offset unchanged when searching up the tree when Meanwhile all callers of TL;DR I'm going to study this a bit more before proposing the final form of this PR. I think the blkid + index means the next/previous behavior of |
Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Signed-off-by: Robert Evans <[email protected]>
After much staring, this is ready for review. See the top comment for the full analysis. TL;DR: iterating by (blkid, index) is clearer, simpler; and also helps uncover and address rough edges around offset handling PTAL @behlendorf when you get a chance; thanks in advance. |
*/ | ||
index = BF64_GET(blkid, 0, epbs) + | ||
((flags & DNODE_FIND_BACKWARDS) ? -1 : 1); | ||
blkid = blkid >> epbs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, when searching backwards, once it reach blkid == 0
, this will start climbing levels until lvl
hit maxlvl
. Previous code exited earlier once dnode_next_block()
saw DNODE_FIND_BACKWARDS
and blkid == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. After this PR search always ends at maxlvl (error == ESRCH
) or minlvl (error == 0).
The previous code had to break to prevent a loop for all the cases where *offset
ends up the same at the higher level. Now that's avoided directly, and the loop conditions are simpler.
*index = i; | ||
if (span < 8 * sizeof (*offset)) { | ||
uint64_t nblk = blkid << epbs; | ||
if (i >= 0 || blkid != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if i < 0
, then we return ESRCH
and offset does not matter. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not wrong. If i < 0
then dnode_next_offset_level
returns ESRCH
, and dnode_next_offset
doesn't use the offset to find the next block after this PR.
However there is some offset returned even when the entire search ends with ESRCH
from dnode_next_offset
. That final offset depends on which blocks exist in the tree since each level updates the offset only for blocks that exist with some value if greater or smaller than the last (or initial) offset.
None of the current kernel callers use the ESRCH
result from dnode_next_offset
as far as I can tell (especially for backwards search). That said, I'm preserving the behavior for now for the sake of a smaller PR.
I'm open to feedback if you think we want to do more here. The returned offset in the ESRCH
case is pretty dubious overall.
if ((nblk >> (8 * sizeof (*offset) - span)) == 0) | ||
*offset = (flags & DNODE_FIND_BACKWARDS) ? | ||
/* backwards: position offset at the end */ | ||
MIN(*offset, ((nblk + 1) << span) - 1) : | ||
MAX(*offset, nblk << span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are searching for a hole forward from the last block of a file with size close to 2^^64, and this level of indirection does not end at 2^^64, I suppose the code above will produce error == 0
, but offset
will not be updated here due to overflow. Won't it look like the hole starts at the current offset? Shouldn't we return ESRCH
or do something else in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. This is also a problem before this PR (and before 16025). I can write a separate PR to address it.
This is the case where there are no holes from the starting offset up to the end of the object, the object ends at 264 - 1, and there's an unallocated BP at L0 offset 264 in the indirect block at dn_nlevels
or dn_nlevels - 1
.
f.e. This happens with datablkshift=17, indblkshift=17, at L5 block 0 which covers [0, 267). Only BPs at indices [0,127] in that block ever get used each one covering 257 L0 offsets. In your scenario, we're at i == 128
.
That case indeed should do something else, but note that this has always been broken.
Before this would compute *offset << span
, shift would overflow to zero as 128 << 57
== 264, and the result would be start
for that level despite error == 0
.
The fix I've got in mind for this is to limit epb
to 1ULL << (64 - span)
in the loop so that we don't search the indices with L0 offsets that would overflow. Then the i >= epb
test causes ESRCH
, etc etc. I can write another PR for that.
Meanwhile I think in practice this does not have any practical effect since VFS layers don't allow files >= 263 bytes long because signed offsets.
Note this case arises when all of the physical BPs are allocated (virtual hole case).
Currently
dnode_next_offset_level
uses L0 offsets as input and output which:dnode_next_block
to handle edge cases at limitsThis PR updates
dnode_next_offset
to uselvl
,blkid
, andindex
as the iteration position.Together these three variables point uniquely to an iteration position in some block of an object.
lvl
andblkid
point to a block in the object, andindex
points to some dnode/BP (if0 <= index < N
)index == N
)index == -1
)Unlike offsets, these:
After this,
dnode_next_offset_level
only usesoffset
as an output to return the resulting offset to the caller ofdnode_next_offset
.To search upwards, instead of
dnode_next_block
, the lvl+1index
is set to the low bits of theblkid
plus one to point to the position of the current block's pointer sibling -- or one past the end if it was the last child of that block (and similarly minus one for backwards search).This PR has three minor effects beyond refactoring:
Upwards search no longer quits as soon as the L0 offset is < 0 or ≥ 264
This is no longer needed since
blkid
andindex
can correctly represent positions outside of the normal range of offsets. Removing this condition simplifies the iteration.When such a condition occurs, the search will proceed up to
maxlvl
and terminate withESRCH
.There is no effect on the search outcome since objects cannot have offsets ≥ 264.
Upwards search no longer spills into the parent's sibling when searching the last (or first) child block.
This is because
index
can point at one past the end (or beginning).Consider searching a block tree with nlevels == 3 and datablkshift=12 and indblkshift=17.
dnode_next_offset_level
returns with*offset == 0x100000000
Before this PR, the search proceeds at L2 block 1 from offset 0.
After this PR, the search proceeds at L2 block 0 at index 1024 (one past its end).
This difference doesn't change what is found, but it does eliminate the work to load and search the L2 block 1 if it was never going to match.
Instead the cached L3 block will point to the correct next block.
This matters less for hole search (no I/O), but the extra steps are wasteful and unnecessary.
For
ESRCH
, this restores the logic to return the same *offset as before backtracking.For
error == 0
and mostESRCH
cases, the offset is the same as before dnode_next_offset: backtrack if lower level does not match #16025.But for
error == ESRCH
case, the result is different for exactly the case above when all subsequent indirect blocks are holes.Before, the search would continue from offset
0x100000000
:dnode_hold_impl
returns ENOENT0x100000000
After, the search again continues from offset
0x100000000
:dnode_next_block
updates offset to0x200000000
The result differs since
dnode_next_block
unconditionally adds 1 at each level searching up the tree, while before it was only changed if an indirect block was scanned.This difference was observed using
ZFS_IOC_NEXT_OBJ
.1<<45 == 35184372088832
.35149978763231
== 0b111111111011111111101111111110111111111011111
dnode_next_block
anddmu_object_next
add:0b100000000010000000001000000000100001
2<<45 == 70368744177664
1<<45
.The return value from
dnode_next_offset
onESRCH
does not appear to be used except for:virtual hole
case (which should be unaffected since it deals only in populated blocks)ZFS_IOC_NEXT_OBJ
which returns the value to userspaceThis PR restores the
ESRCH
semantics back to how they were. This happens naturally withindex
plus one because the search will not spill into the next block during upwards traversal.Meanwhile, the value itself is underspecified and of questionable utility.
minlvl
have such offsets that would be greater than or equal 264Or for backwards search:
blkid
is clamped to zero when searching backwardsNeither of these seem to be deliberately implemented; they are instead side-effects of setting *offset to the larger (or smaller) of the initial offset or the resulting offset along with the the clamp to zero behavior.
For forward search, when the blkid is too large, the shift overflows to zero which means that the initial offset is returned instead.
Luckily, the result is never used for backwards search. This PR maintains the same semantics to minimize change.
Future ideas:
ESRCH
result so that the initial*offset
is returned instead.dnode_next_offset
variant that returns blkids natively. Many callers want to iterate over blocks but have to deal with L0 offsets.Motivation and Context
Code cleanup, readability, and minor changes to edge cases.
Description
Refactored to iterate by blkid instead of offsets.
See above for details of minor changes to edge cases.
How Has This Been Tested?
ztest, ZTS, llseek stressor
Types of changes
Checklist:
Signed-off-by
.