Skip to content

Conversation

@amotin
Copy link
Member

@amotin amotin commented Jul 24, 2025

Motivation and Context

The earlier implemented zfs rewrite functionality for simplicity updated logical birth times of all rewritten blocks. It makes them look modified from perspective of replication, snapshot diffs, etc, even though the actual user data remain the same. While some people found it useful to recover corrupted remote backups, for majority replication of large extra amounts of logically unchanged blocks can be a huge waste of time and resources.

Description

This PR implements a new variation of rewrite, called "physical rewrite", controlled by the new -P argument to the zfs rewrite subcommand. When possible, it tries to keep logical birth times unchanged. It allows to distinguish blocks that were just relocated within a pool from blocks that were actually modified by users. While the first may occupy additional disk space due to snapshots, block cloning, etc, that should be accounted as such, they should be ignored by replication, etc.

Previously we've had block pointers with physical birth times bigger than logical birth times only as result of device removal remap process. But in that case space usage accounting was still based on block's logical birth times. Since physical rewrites require space reallocation accounted based on the physical birth times, to differentiate those two cases this PR introduces new "R"/"rewrite" flag in the block pointer structure. When set, it means the block's space accounting should use physical birth time instead of traditional logical birth time. Since read-only pool imports do not really care about space accounting, the new per-dataset pool feature "physical_rewrite" gating this is declared as read-compatible. The feature will be activated on first use and deactivated when last of affected datasets is deleted.

There are two exceptions when logical birth time might still be modified around physical rewrite:

  • In case of dedup hit, producing a different block pointer due to change of checksum algorithm or number of copies. Since the physical birth time must come from the DDT record, we can not put the current TXG there for space accounting, and have to update logical birth time instead, as done for logical rewrite. Aside of this quite rare case attempts to do physical rewrite on dedup'ed blocks should be NOP, but it can be used to enable/disable dedup.
  • In case of device is removal after physical rewrite. Since pointer remapping after device removal must set physical birth times to the removal time, it has to remove the rewrite flag and copy the physical birth times of the blocks into logical birth times to maintain correct space accounting.

Now that we have different birth times in block pointers, traversal code got new TRAVERSE_LOGICAL flag, allowing to choose between traversing only logical changes (replication, diff, etc), or physical changes (scrub/resilver, dataset destroy, etc).

How Has This Been Tested?

Several successful CI runs. Manual testing with zfs rewrite and zfs rewrite -P vs zfs send -i.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jul 24, 2025
@amotin amotin mentioned this pull request Jul 24, 2025
13 tasks
@amotin amotin force-pushed the physical_rewrite branch from 4093181 to 6f42692 Compare July 25, 2025 01:22
@gamanakis
Copy link
Contributor

@amotin, thank you for this! on a first pass it looks good to me.

During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Signed-off-by:	Alexander Motin <[email protected]>
@amotin amotin force-pushed the physical_rewrite branch 2 times, most recently from 3b724aa to c9382ac Compare July 30, 2025 17:22
@amotin
Copy link
Member Author

amotin commented Jul 30, 2025

Just a rebase and conflict resolution.

Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Signed-off-by:  Alexander Motin <[email protected]>
@amotin amotin force-pushed the physical_rewrite branch from c9382ac to 03ae3fc Compare July 31, 2025 03:04
@amotin amotin requested review from grwilson, pcd1193182 and robn July 31, 2025 15:54
Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty straightforward to me, good job. That one comment describing the macros made the whole thing very understandable.

@robn
Copy link
Member

robn commented Aug 1, 2025

Question: it seems like the only time you wouldn't want the physical rewrite option is if you don't want the pool feature? If true, could we make it the default somehow? I know its tricky if you don't have the feature enabled, and I don't think zfs rewrite with no options should choose or anything (it should still be predictable). I just don't like having the best/common case require a flag.

@amotin
Copy link
Member Author

amotin commented Aug 1, 2025

@robn As I mentioned, somebody was actually happy to use logical rewrite to force replication, but that is a minor case. The requirement of the feature was indeed the main factor for me though. It was not feeling right to me also to change the behavior based on feature status. Plus logical rewrite was implemented first, and we've already merged it to our branch of ZFS 2.3 in upcoming TrueNAS 25.10, so it would be a change in behavior, while we can't merge the physical rewrite there too due to the feature. I don't object it being default, but I am open to hear how.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 1, 2025
@stuartthebruce
Copy link

Are either of the rewrite options mature enough to consider for 2.3.4?

@amotin
Copy link
Member Author

amotin commented Aug 5, 2025

@stuartthebruce The logical rewrite is quite simple, that's why we've merged them into our TrueNAS ZFS sources. I could include it into #17595, if people desire. Physical rewrite though extends the on-disk format and so require a new pool feature, that's why it will not go below 3.4.

@stuartthebruce
Copy link

@stuartthebruce The logical rewrite is quite simple, that's why we've merged them into our TrueNAS ZFS sources. I could include it into #17595, if people desire.

+1

@vedranmiletic
Copy link

Physical rewrite though extends the on-disk format and so require a new pool feature, that's why it will not go below 3.4.

As a user, I would like to have it, but is there a precedent for extending on-disk format in a patch release? It is not so easy to deduce this from the changelog.

@Momi-V
Copy link

Momi-V commented Aug 6, 2025

I believe getting logical rewrite out sooner rather than later is a good thing, especially if it keeps the different OpenZFS Forks mostly feature aligned. It's just strictly superior to the way things have to be done without it.

On disk format changes obviously are a different thing so they can wait, but it's not like running a zsh script checking mtime for consistency is a robust solution.

@behlendorf behlendorf closed this in 4ae8bf4 Aug 6, 2025
behlendorf pushed a commit that referenced this pull request Aug 6, 2025
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:  Alexander Motin <[email protected]>
Closes #17565
@amotin amotin deleted the physical_rewrite branch August 7, 2025 02:12
@amotin
Copy link
Member Author

amotin commented Aug 7, 2025

but is there a precedent for extending on-disk format in a patch release?

Not that I can recall, at least since 2.0. It would be a bad practice to have patch releases incompatible. It should be a decision of distributions, and through them users, when to switch to the next stable branch, accepting risks of possible incompatibilities. Patch release updates should remain a safe routine.

amotin added a commit to amotin/zfs that referenced this pull request Aug 13, 2025
Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but
I missed update one of its occurences, ending up asserting equal
logical birth times instead of equal physical birth times.

Signed-off-by: Alexander Motin <[email protected]>
Fixes openzfs#17565
amotin added a commit that referenced this pull request Aug 13, 2025
Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but
I missed update one of its occurences, ending up asserting equal
logical birth times instead of equal physical birth times.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Fixes #17565
Closes #17631
@tonyhutter tonyhutter mentioned this pull request Aug 19, 2025
14 tasks
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Closes openzfs#17565
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:  Alexander Motin <[email protected]>
Closes openzfs#17565
spauka pushed a commit to spauka/zfs that referenced this pull request Aug 30, 2025
Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but
I missed update one of its occurences, ending up asserting equal
logical birth times instead of equal physical birth times.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Fixes openzfs#17565
Closes openzfs#17631
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Accepted Ready to integrate (reviewed, tested)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants