Update dnode_next_offset_level to accept blkid instead of offset #17792

rrevans · 2025-09-26T04:18:30Z

Currently dnode_next_offset_level uses L0 offsets as input and output which:

is hard to read since it maps offsets to blkid and back each call
necessitates dnode_next_block to handle edge cases at limits
obscures loop invariants required for traversal to always terminate
entangles iteration with integer overflow for large objects

This PR updates dnode_next_offset to use lvl, blkid, and index as the iteration position.

Together these three variables point uniquely to an iteration position in some block of an object.

lvl and blkid point to a block in the object, and
index points to some dnode/BP (if 0 <= index < N)
... or to one past the end (if index == N)
... or to the one past the beginning (if index == -1)

Unlike offsets, these:

never run out of precision, even for objects with maximum indirection
can point at positions with L0 offset < 0 or ≥ 2⁶⁴
can distiguish past-the-end of one block vs. the start of the next

After this, dnode_next_offset_level only uses offset as an output to return the resulting offset to the caller of dnode_next_offset.

To search upwards, instead of dnode_next_block, the lvl+1 index is set to the low bits of the blkid plus one to point to the position of the current block's pointer sibling -- or one past the end if it was the last child of that block (and similarly minus one for backwards search).

This PR has three minor effects beyond refactoring:

Upwards search no longer quits as soon as the L0 offset is < 0 or ≥ 2⁶⁴

This is no longer needed since blkid and index can correctly represent positions outside of the normal range of offsets. Removing this condition simplifies the iteration.

When such a condition occurs, the search will proceed up to maxlvl and terminate with ESRCH.

There is no effect on the search outcome since objects cannot have offsets ≥ 2⁶⁴.
Upwards search no longer spills into the parent's sibling when searching the last (or first) child block.

This is because index can point at one past the end (or beginning).

Consider searching a block tree with nlevels == 3 and datablkshift=12 and indblkshift=17.
- Suppose search starts forward from offset 0xfffff000 at lvl=1
- This is L1 block 1023, at index 1023 in that block.
- ... which is the child at index 1023 in L2 block 0
- If nothing is found, dnode_next_offset_level returns with *offset == 0x100000000
- ... which is L1 block 1024 at index 0 in that block
- ... which is the child at index 0 in L2 block 1
Before this PR, the search proceeds at L2 block 1 from offset 0.
- If the result is found in this block, then search continues downward.
- If not, search goes upwards to L3 block 0 and searches from index 2.
After this PR, the search proceeds at L2 block 0 at index 1024 (one past its end).
- This always returns ESRCH since 1024 is greater than the number of BPs.
- Search then goes upwards to L3 block 0 and searches from index 1.
This difference doesn't change what is found, but it does eliminate the work to load and search the L2 block 1 if it was never going to match.

Instead the cached L3 block will point to the correct next block.

This matters less for hole search (no I/O), but the extra steps are wasteful and unnecessary.
For ESRCH, this restores the logic to return the same *offset as before backtracking.

For error == 0 and most ESRCH cases, the offset is the same as before dnode_next_offset: backtrack if lower level does not match #16025.

But for error == ESRCH case, the result is different for exactly the case above when all subsequent indirect blocks are holes.

Before, the search would continue from offset 0x100000000:
- All indirect blocks beyond that offset are holes
- So for L2 block 1, dnode_hold_impl returns ENOENT
- Then offset is unmodified, and the result is 0x100000000
After, the search again continues from offset 0x100000000:
- After L2 block 1, so dnode_next_block updates offset to 0x200000000
The result differs since dnode_next_block unconditionally adds 1 at each level searching up the tree, while before it was only changed if an indirect block was scanned.

This difference was observed using ZFS_IOC_NEXT_OBJ.
- After the last dnode, the result is 1<<45 == 35184372088832.
- ... unless the starting position is 35149978763231
  - == 0b111111111011111111101111111110111111111011111
- Because dnode_next_block and dmu_object_next add:
  - 0b100000000010000000001000000000100001
- ... which yields 2<<45 == 70368744177664
- But only if all of the 10-bit subwords are == 1023. If not there's no overflow.
- This is most curious as this means some values above 35149978763231 also return 1<<45.
The return value from dnode_next_offset on ESRCH does not appear to be used except for:
- the virtual hole case (which should be unaffected since it deals only in populated blocks)
- ZFS_IOC_NEXT_OBJ which returns the value to userspace
This PR restores the ESRCH semantics back to how they were. This happens naturally with index plus one because the search will not spill into the next block during upwards traversal.

Meanwhile, the value itself is underspecified and of questionable utility.
- It is the first offset at or after which there are no physically allocated indirect blocks
- ... unless that offset would be greater than or equal 2⁶⁴
- ... in which case it is the first offset after which there are no indirect blocks at nlevels-1
- ... unless that offset would be greater than or equal 2⁶⁴
- ... in which case it is the first offset after which there are no indirect blocks at nlevels-2
- ... etc.
- ... unless all levels down to minlvl have such offsets that would be greater than or equal 2⁶⁴
- ... in which case the result is simply the initial offset
Or for backwards search:
- It is the last offset at or before which there are no physically allocated indirect blocks
- ... unless the search ends at offset zero
- ... in which case it is the last offset covered by the indirect block at whatever level offset zero is reached
- ... which happens because blkid is clamped to zero when searching backwards
Neither of these seem to be deliberately implemented; they are instead side-effects of setting *offset to the larger (or smaller) of the initial offset or the resulting offset along with the the clamp to zero behavior.

For forward search, when the blkid is too large, the shift overflows to zero which means that the initial offset is returned instead.

Luckily, the result is never used for backwards search. This PR maintains the same semantics to minimize change.

Future ideas:

Remove the ESRCH result so that the initial *offset is returned instead.
Implement dnode_next_offset variant that returns blkids natively. Many callers want to iterate over blocks but have to deal with L0 offsets.

Motivation and Context

Code cleanup, readability, and minor changes to edge cases.

Description

Refactored to iterate by blkid instead of offsets.

See above for details of minor changes to edge cases.

How Has This Been Tested?

ztest, ZTS, llseek stressor

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

rrevans · 2025-09-29T12:54:40Z

Converting this back to draft as I've been staring at the offset calculations for the new version and found an oddity in 8a8970e.

That commit works for the case where a match occurs, but it returns a higher than expected offset in the non-matching case when 1) the starting offset points at an indirect hole and 2) the effect of dnode_next_block accumulates and carries the offset into the next block at lvl+2 or higher.

The code prior to the commit above was leaving the offset unchanged when searching up the tree when dnode_next_offet_level encounters sparse indirect blocks. That left the returned offset set to the first L0 offset that has no physically allocated BP ignoring pointers in indirect blocks whose next sibling would start at an L0 offset >= 2**64 or the initial offset if no such BP exists.

Meanwhile all callers of dnode_next_offset ignore the offset upon ESRCH result except ZFS_IOC_NEXT_OBJ and the "virtual hole" case in dnode_next_offset (but it should be unaffected by the above as it deals only in populated indirect blocks).

TL;DR I'm going to study this a bit more before proposing the final form of this PR. I think the blkid + index means the next/previous behavior of dnode_next_block is both unnecessary and undesirable, but I'll need a minute to convince myself and write up my findings.

Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2025-10-09T05:34:59Z

After much staring, this is ready for review.

See the top comment for the full analysis. TL;DR: iterating by (blkid, index) is clearer, simpler; and also helps uncover and address rough edges around offset handling

PTAL @behlendorf when you get a chance; thanks in advance.

amotin · 2025-10-09T18:44:12Z

module/zfs/dnode.c

 			 */
+			index = BF64_GET(blkid, 0, epbs) +
+			    ((flags & DNODE_FIND_BACKWARDS) ? -1 : 1);
+			blkid = blkid >> epbs;


As I understand, when searching backwards, once it reach blkid == 0, this will start climbing levels until lvl hit maxlvl. Previous code exited earlier once dnode_next_block() saw DNODE_FIND_BACKWARDS and blkid == 0.

Indeed. After this PR search always ends at maxlvl (error == ESRCH) or minlvl (error == 0).

The previous code had to break to prevent a loop for all the cases where *offset ends up the same at the higher level. Now that's avoided directly, and the loop conditions are simpler.

amotin · 2025-10-09T19:20:01Z

module/zfs/dnode.c

+		*index = i;
+		if (span < 8 * sizeof (*offset)) {
+			uint64_t nblk = blkid << epbs;
+			if (i >= 0 || blkid != 0)


I guess if i < 0, then we return ESRCH and offset does not matter. Am I wrong?

You're not wrong. If i < 0 then dnode_next_offset_level returns ESRCH, and dnode_next_offset doesn't use the offset to find the next block after this PR.

However there is some offset returned even when the entire search ends with ESRCH from dnode_next_offset. That final offset depends on which blocks exist in the tree since each level updates the offset only for blocks that exist with some value if greater or smaller than the last (or initial) offset.

None of the current kernel callers use the ESRCH result from dnode_next_offset as far as I can tell (especially for backwards search). That said, I'm preserving the behavior for now for the sake of a smaller PR.

I'm open to feedback if you think we want to do more here. The returned offset in the ESRCH case is pretty dubious overall.

amotin · 2025-10-09T19:35:09Z

module/zfs/dnode.c

+			if ((nblk >> (8 * sizeof (*offset) - span)) == 0)
+				*offset = (flags & DNODE_FIND_BACKWARDS) ?
+				    /* backwards: position offset at the end */
+				    MIN(*offset, ((nblk + 1) << span) - 1) :
+				    MAX(*offset, nblk << span);


If we are searching for a hole forward from the last block of a file with size close to 2^^64, and this level of indirection does not end at 2^^64, I suppose the code above will produce error == 0, but offset will not be updated here due to overflow. Won't it look like the hole starts at the current offset? Shouldn't we return ESRCH or do something else in this case?

Yes, you're right. This is also a problem before this PR (and before 16025). I can write a separate PR to address it.

This is the case where there are no holes from the starting offset up to the end of the object, the object ends at 2⁶⁴ - 1, and there's an unallocated BP at L0 offset 2⁶⁴ in the indirect block at dn_nlevels or dn_nlevels - 1.

f.e. This happens with datablkshift=17, indblkshift=17, at L5 block 0 which covers [0, 2⁶⁷). Only BPs at indices [0,127] in that block ever get used each one covering 2⁵⁷ L0 offsets. In your scenario, we're at i == 128.

That case indeed should do something else, but note that this has always been broken.

Before this would compute *offset << span, shift would overflow to zero as 128 << 57 == 2⁶⁴, and the result would be start for that level despite error == 0.

The fix I've got in mind for this is to limit epb to 1ULL << (64 - span) in the loop so that we don't search the indices with L0 offsets that would overflow. Then the i >= epb test causes ESRCH, etc etc. I can write another PR for that.

Meanwhile I think in practice this does not have any practical effect since VFS layers don't allow files >= 2⁶³ bytes long because signed offsets.

Note this case arises when all of the physical BPs are allocated (virtual hole case).

github-actions bot added the Status: Work in Progress Not yet ready for general review label Sep 26, 2025

rrevans marked this pull request as ready for review September 26, 2025 12:06

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Sep 26, 2025

rrevans mentioned this pull request Sep 26, 2025

dnode_next_offset: backtrack if lower level does not match #16025

Merged

13 tasks

behlendorf self-requested a review September 26, 2025 22:32

rrevans marked this pull request as draft September 29, 2025 12:28

github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Sep 29, 2025

rrevans force-pushed the level_blkid branch from 4220ff7 to a7333bd Compare October 9, 2025 05:06

rrevans marked this pull request as ready for review October 9, 2025 05:26

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Oct 9, 2025

amotin reviewed Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dnode_next_offset_level to accept blkid instead of offset #17792

Update dnode_next_offset_level to accept blkid instead of offset #17792

Uh oh!

rrevans commented Sep 26, 2025 •

edited

Loading

Uh oh!

rrevans commented Sep 29, 2025 •

edited

Loading

Uh oh!

rrevans commented Oct 9, 2025

Uh oh!

amotin Oct 9, 2025

Uh oh!

rrevans Oct 10, 2025

Uh oh!

amotin Oct 9, 2025

Uh oh!

rrevans Oct 10, 2025

Uh oh!

amotin Oct 9, 2025 •

edited

Loading

Uh oh!

rrevans Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update dnode_next_offset_level to accept blkid instead of offset #17792

Are you sure you want to change the base?

Update dnode_next_offset_level to accept blkid instead of offset #17792

Uh oh!

Conversation

rrevans commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

rrevans commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrevans commented Oct 9, 2025

Uh oh!

amotin Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

rrevans Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

amotin Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

rrevans Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

amotin Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrevans Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rrevans commented Sep 26, 2025 •

edited

Loading

rrevans commented Sep 29, 2025 •

edited

Loading

amotin Oct 9, 2025 •

edited

Loading