Skip to content

Conversation

rrevans
Copy link
Contributor

@rrevans rrevans commented Sep 26, 2025

Currently dnode_next_offset_level uses L0 offsets as input and output which:

  1. is hard to read since it maps offsets to blkid and back each call
  2. necessitates dnode_next_block to handle edge cases at limits
  3. obscures loop invariants required for traversal to always terminate
  4. entangles iteration with integer overflow for large objects

This PR updates dnode_next_offset to use lvl, blkid, and index as the iteration position.

Together these three variables point uniquely to an iteration position in some block of an object.

  • lvl and blkid point to a block in the object, and
  • index points to some dnode/BP (if 0 <= index < N)
  • ... or to one past the end (if index == N)
  • ... or to the one past the beginning (if index == -1)

Unlike offsets, these:

  • never run out of precision, even for objects with maximum indirection
  • can point at positions with L0 offset < 0 or ≥ 264
  • can distiguish past-the-end of one block vs. the start of the next

After this, dnode_next_offset_level only uses offset as an output to return the resulting offset to the caller of dnode_next_offset.

To search upwards, instead of dnode_next_block, the lvl+1 index is set to the low bits of the blkid plus one to point to the position of the current block's pointer sibling -- or one past the end if it was the last child of that block (and similarly minus one for backwards search).

This PR has three minor effects beyond refactoring:

  1. Upwards search no longer quits as soon as the L0 offset is < 0 or ≥ 264

    This is no longer needed since blkid and index can correctly represent positions outside of the normal range of offsets. Removing this condition simplifies the iteration.

    When such a condition occurs, the search will proceed up to maxlvl and terminate with ESRCH.

    There is no effect on the search outcome since objects cannot have offsets ≥ 264.

  2. Upwards search no longer spills into the parent's sibling when searching the last (or first) child block.

    This is because index can point at one past the end (or beginning).

    Consider searching a block tree with nlevels == 3 and datablkshift=12 and indblkshift=17.

    • Suppose search starts forward from offset 0xfffff000 at lvl=1
    • This is L1 block 1023, at index 1023 in that block.
    • ... which is the child at index 1023 in L2 block 0
    • If nothing is found, dnode_next_offset_level returns with *offset == 0x100000000
    • ... which is L1 block 1024 at index 0 in that block
    • ... which is the child at index 0 in L2 block 1

    Before this PR, the search proceeds at L2 block 1 from offset 0.

    • If the result is found in this block, then search continues downward.
    • If not, search goes upwards to L3 block 0 and searches from index 2.

    After this PR, the search proceeds at L2 block 0 at index 1024 (one past its end).

    • This always returns ESRCH since 1024 is greater than the number of BPs.
    • Search then goes upwards to L3 block 0 and searches from index 1.

    This difference doesn't change what is found, but it does eliminate the work to load and search the L2 block 1 if it was never going to match.

    Instead the cached L3 block will point to the correct next block.

    This matters less for hole search (no I/O), but the extra steps are wasteful and unnecessary.

  3. For ESRCH, this restores the logic to return the same *offset as before backtracking.

    For error == 0 and most ESRCH cases, the offset is the same as before dnode_next_offset: backtrack if lower level does not match #16025.

    But for error == ESRCH case, the result is different for exactly the case above when all subsequent indirect blocks are holes.

    Before, the search would continue from offset 0x100000000:

    • All indirect blocks beyond that offset are holes
    • So for L2 block 1, dnode_hold_impl returns ENOENT
    • Then offset is unmodified, and the result is 0x100000000

    After, the search again continues from offset 0x100000000:

    • After L2 block 1, so dnode_next_block updates offset to 0x200000000

    The result differs since dnode_next_block unconditionally adds 1 at each level searching up the tree, while before it was only changed if an indirect block was scanned.

    This difference was observed using ZFS_IOC_NEXT_OBJ.

    • After the last dnode, the result is 1<<45 == 35184372088832.
    • ... unless the starting position is 35149978763231
      • == 0b111111111011111111101111111110111111111011111
    • Because dnode_next_block and dmu_object_next add:
      • 0b100000000010000000001000000000100001
    • ... which yields 2<<45 == 70368744177664
    • But only if all of the 10-bit subwords are == 1023. If not there's no overflow.
    • This is most curious as this means some values above 35149978763231 also return 1<<45.

    The return value from dnode_next_offset on ESRCH does not appear to be used except for:

    • the virtual hole case (which should be unaffected since it deals only in populated blocks)
    • ZFS_IOC_NEXT_OBJ which returns the value to userspace

    This PR restores the ESRCH semantics back to how they were. This happens naturally with index plus one because the search will not spill into the next block during upwards traversal.

    Meanwhile, the value itself is underspecified and of questionable utility.

    • It is the first offset at or after which there are no physically allocated indirect blocks
    • ... unless that offset would be greater than or equal 264
    • ... in which case it is the first offset after which there are no indirect blocks at nlevels-1
    • ... unless that offset would be greater than or equal 264
    • ... in which case it is the first offset after which there are no indirect blocks at nlevels-2
    • ... etc.
    • ... unless all levels down to minlvl have such offsets that would be greater than or equal 264
    • ... in which case the result is simply the initial offset

    Or for backwards search:

    • It is the last offset at or before which there are no physically allocated indirect blocks
    • ... unless the search ends at offset zero
    • ... in which case it is the last offset covered by the indirect block at whatever level offset zero is reached
    • ... which happens because blkid is clamped to zero when searching backwards

    Neither of these seem to be deliberately implemented; they are instead side-effects of setting *offset to the larger (or smaller) of the initial offset or the resulting offset along with the the clamp to zero behavior.

    For forward search, when the blkid is too large, the shift overflows to zero which means that the initial offset is returned instead.

    Luckily, the result is never used for backwards search. This PR maintains the same semantics to minimize change.

Future ideas:

  • Remove the ESRCH result so that the initial *offset is returned instead.
  • Implement dnode_next_offset variant that returns blkids natively. Many callers want to iterate over blocks but have to deal with L0 offsets.

Motivation and Context

Code cleanup, readability, and minor changes to edge cases.

Description

Refactored to iterate by blkid instead of offsets.

See above for details of minor changes to edge cases.

How Has This Been Tested?

ztest, ZTS, llseek stressor

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Sep 26, 2025
@rrevans rrevans marked this pull request as ready for review September 26, 2025 12:06
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 26, 2025
@behlendorf behlendorf self-requested a review September 26, 2025 22:32
@rrevans rrevans marked this pull request as draft September 29, 2025 12:28
@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Sep 29, 2025
@rrevans
Copy link
Contributor Author

rrevans commented Sep 29, 2025

Converting this back to draft as I've been staring at the offset calculations for the new version and found an oddity in 8a8970e.

That commit works for the case where a match occurs, but it returns a higher than expected offset in the non-matching case when 1) the starting offset points at an indirect hole and 2) the effect of dnode_next_block accumulates and carries the offset into the next block at lvl+2 or higher.

The code prior to the commit above was leaving the offset unchanged when searching up the tree when dnode_next_offet_level encounters sparse indirect blocks. That left the returned offset set to the first L0 offset that has no physically allocated BP ignoring pointers in indirect blocks whose next sibling would start at an L0 offset >= 2**64 or the initial offset if no such BP exists.

Meanwhile all callers of dnode_next_offset ignore the offset upon ESRCH result except ZFS_IOC_NEXT_OBJ and the "virtual hole" case in dnode_next_offset (but it should be unaffected by the above as it deals only in populated indirect blocks).

TL;DR I'm going to study this a bit more before proposing the final form of this PR. I think the blkid + index means the next/previous behavior of dnode_next_block is both unnecessary and undesirable, but I'll need a minute to convince myself and write up my findings.

Currently this function uses L0 offsets which:
1. is hard to read since it maps offsets to blkid and back each call
2. necessitates dnode_next_block to handle edge cases at limits
3. makes it hard to tell if the traversal can loop infinitely

Instead, update this and dnode_next_offset to work in (blkid, index).
This way the blkid manipulations are clear, and it's also clear that
the traversal always terminates since blkid goes one direction.

I've also considered updating dnode_next_offset to operate on blkid.
Callers use both patterns, so maybe another PR can split the cases?

While here tidy up dnode_next_offset_level comments.

Signed-off-by: Robert Evans <[email protected]>
@rrevans rrevans marked this pull request as ready for review October 9, 2025 05:26
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 9, 2025
@rrevans
Copy link
Contributor Author

rrevans commented Oct 9, 2025

After much staring, this is ready for review.

See the top comment for the full analysis. TL;DR: iterating by (blkid, index) is clearer, simpler; and also helps uncover and address rough edges around offset handling

PTAL @behlendorf when you get a chance; thanks in advance.

*/
index = BF64_GET(blkid, 0, epbs) +
((flags & DNODE_FIND_BACKWARDS) ? -1 : 1);
blkid = blkid >> epbs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, when searching backwards, once it reach blkid == 0, this will start climbing levels until lvl hit maxlvl. Previous code exited earlier once dnode_next_block() saw DNODE_FIND_BACKWARDS and blkid == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. After this PR search always ends at maxlvl (error == ESRCH) or minlvl (error == 0).

The previous code had to break to prevent a loop for all the cases where *offset ends up the same at the higher level. Now that's avoided directly, and the loop conditions are simpler.

*index = i;
if (span < 8 * sizeof (*offset)) {
uint64_t nblk = blkid << epbs;
if (i >= 0 || blkid != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if i < 0, then we return ESRCH and offset does not matter. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not wrong. If i < 0 then dnode_next_offset_level returns ESRCH, and dnode_next_offset doesn't use the offset to find the next block after this PR.

However there is some offset returned even when the entire search ends with ESRCH from dnode_next_offset. That final offset depends on which blocks exist in the tree since each level updates the offset only for blocks that exist with some value if greater or smaller than the last (or initial) offset.

None of the current kernel callers use the ESRCH result from dnode_next_offset as far as I can tell (especially for backwards search). That said, I'm preserving the behavior for now for the sake of a smaller PR.

I'm open to feedback if you think we want to do more here. The returned offset in the ESRCH case is pretty dubious overall.

Comment on lines +2615 to +2619
if ((nblk >> (8 * sizeof (*offset) - span)) == 0)
*offset = (flags & DNODE_FIND_BACKWARDS) ?
/* backwards: position offset at the end */
MIN(*offset, ((nblk + 1) << span) - 1) :
MAX(*offset, nblk << span);
Copy link
Member

@amotin amotin Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are searching for a hole forward from the last block of a file with size close to 2^^64, and this level of indirection does not end at 2^^64, I suppose the code above will produce error == 0, but offset will not be updated here due to overflow. Won't it look like the hole starts at the current offset? Shouldn't we return ESRCH or do something else in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This is also a problem before this PR (and before 16025). I can write a separate PR to address it.

This is the case where there are no holes from the starting offset up to the end of the object, the object ends at 264 - 1, and there's an unallocated BP at L0 offset 264 in the indirect block at dn_nlevels or dn_nlevels - 1.

f.e. This happens with datablkshift=17, indblkshift=17, at L5 block 0 which covers [0, 267). Only BPs at indices [0,127] in that block ever get used each one covering 257 L0 offsets. In your scenario, we're at i == 128.

That case indeed should do something else, but note that this has always been broken.

Before this would compute *offset << span, shift would overflow to zero as 128 << 57 == 264, and the result would be start for that level despite error == 0.

The fix I've got in mind for this is to limit epb to 1ULL << (64 - span) in the loop so that we don't search the indices with L0 offsets that would overflow. Then the i >= epb test causes ESRCH, etc etc. I can write another PR for that.

Meanwhile I think in practice this does not have any practical effect since VFS layers don't allow files >= 263 bytes long because signed offsets.

Note this case arises when all of the physical BPs are allocated (virtual hole case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants