range_tree: Add zfs_recover_rt parameter and extra debug info #17094

ihoro · 2025-02-25T17:29:19Z

Motivation and Context

There are production cases when loading of a metaslab leads to a ZFS panic due to unexpected entries in its spacemap (presumably). The assertions in zfs_range_tree_add_impl() and zfs_range_tree_remove_impl() fail due to overlapping or missing segments, etc. A business would like to go ahead with such pools while the root cause is being investigated.

Description

The idea is to allow loading such metaslabs with a potential space leak as a trade-off instead of a potential data loss.

We already have zfs_recover module parameter to mitigate various issues, including some range tree cases, and this patch adds zfs_recover_ms parameter to localize the recovery behavior to the metaslab loading process only.

The following diagrams are expected to help with the details:

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/spa_misc.c

module/zfs/range_tree.c

ihoro · 2025-03-04T13:12:13Z

The update includes:

Re-work into zfs_recover_rt parameter instead of zfs_recover_ms. It covers all range trees.
Add range tree name as an extra debug info resulting into a message like zfs: rt_instance=vdev_obsolete_segments: ....
The metaslab related trees provide even more details like zfs: rt_instance={spa=p1 vdev_guid=4127788562752866619 ms_id=0 ms_allocatable}: ....
Update zfs.4 man page respectively

tonyhutter · 2025-03-07T01:55:49Z

Looks like there's a build issue:

    CC       module/zfs/libzpool_la-dsl_bookmark.lo
    CC       module/zfs/libzpool_la-dsl_crypt.lo
  module/zfs/dnode.c: In function 'dnode_free_range':
  module/zfs/dnode.c:2441:59: error: passing argument 7 of 'zfs_range_tree_create_usecase' discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers]
   2441 |                             ZFS_RANGE_TREE_UC_FREE_SPACE, "dn_free_ranges");
        |                                                           ^~~~~~~~~~~~~~~~
  In file included from ./include/sys/space_map.h:34,
                   from ./include/sys/spa.h:46,
                   from ./include/sys/dbuf.h:32,
                   from module/zfs/dnode.c:28:
  ./include/sys/range_tree.h:294:45: note: expected 'char *' but argument is of type 'const char *'
    294 |     zfs_range_tree_usecase_t usecase, char *instance);
        |                                       ~~~~~~^~~~~~~~
    CC       module/zfs/libzpool_la-dsl_dataset.lo
    CC       module/zfs/libzpool_la-dsl_deadlist.lo
  cc1: all warnings being treated as errors
  make[4]: *** [Makefile:10313: module/zfs/libzpool_la-dnode.lo] Error 1

ihoro · 2025-03-07T17:45:18Z

Looks like there's a build issue:

Indeed, there should have been one more dev iteration on my side. It should be fixed now.

tonyhutter · 2025-03-10T22:04:11Z

For some reason a bunch of the runners are failing. I'm going to manually restart then.

tonyhutter · 2025-03-10T23:14:39Z

Please rebase on master; that will pull in the new .github/workflow/* files and allow the tests to complete.

ihoro · 2025-03-11T08:19:38Z

For some reason a bunch of the runners are failing. I'm going to manually restart then.

Thank you.

Please rebase on master; that will pull in the new .github/workflow/* files and allow the tests to complete.

Sure, it's been moved 3 commits up, over the recent workflow changes, it should work now.

include/sys/range_tree.h

module/zfs/range_tree.c

amotin

I see plenty of places where use case is not defined. We could look better. Also I think in many cases we could still log pool and vdev, even if we have no metaslab number to report.

module/zfs/metaslab.c

include/sys/range_tree.h

There are production cases where unexpected range tree segment adding/removal leads to panic. The root cause investigation requires more debug info about the range tree and the segments in question when it happens. In addition, the zfs_recover_rt parameter allows converting such panics into warnings with a potential space leak as a trade-off. Signed-off-by: Igor Ostapenko <[email protected]>

ihoro · 2025-03-13T23:00:15Z

I see plenty of places where use case is not defined. We could look better.

It seems it's time to rename the _UC_UNKNOWN flag to the more appropriate one _UC_GENERIC. It depicts the intention better -- a range tree without special treatment where zfs_recover_rt simply does not panic upon unexpected additions/removal.

The unknown (now generic) flag is intentionally used for the range tree instances where special treatment is not expected. Sometimes it's not about allocated/free space or it's a temporary tree which is based on already "recovered" other ones. Anyway, I think I could review the instances once again, probably someone should not be GENERIC.

Also I think in many cases we could still log pool and vdev, even if we have no metaslab number to report.

Yes, it's worth the extra code to be maximum useful. It will come with the next iteration of the patch.

amotin · 2025-03-17T16:54:42Z

include/sys/range_tree.h

+ * name string, which can be marked as dynamic to be freed along with the tree
+ * instance destruction.
+ */
+#define	ZFS_RANGE_TREE_F_UC_GENERIC		(1 << 0)


I don't think "GENERIC" is really meaningful. Easier and cleaner I think would be just to pass 0 if we can't say anything better (we really should).

GregorKopka · 2025-04-26T12:43:00Z

I disagree on logic outlined the diagrams, regardless if it can already be triggered by some module parameter:
Freeing something already free is something completely different than allocating something that is already in use.

It is IMHO a very bad idea to allow the latter to happen.

amotin · 2025-06-19T17:03:43Z

module/zfs/range_tree.c

 	if (delta < 0 && delta * -1 >= zfs_rs_get_fill(rs, rt)) {
-		zfs_panic_recover("zfs: attempting to decrease fill to or "
-		    "below 0; probable double remove in segment [%llx:%llx]",
+		zfs_panic_recover_rt("zfs: rt_instance=%s: attempting to "


Cosmetics, but here and in other places I would reduce "rt_instance=" to "rt=", since it provides no useful information, and the line is too long. Or otherwise write full "range tree" to make it more human-readable if we don't care about length.

amotin · 2025-06-19T17:11:29Z

@ihoro So how about remove unneeded ZFS_RANGE_TREE_F_UC_GENERIC constant as I have asked and one more comment above? I don't really believe in the "recovery" part, but I do like more informative panic messages.

amotin · 2025-07-25T01:58:30Z

@ihoro ping?

ihoro · 2025-07-30T11:29:48Z

The debug info part has been extracted into a separate PR: #17581

amotin · 2025-08-05T14:47:08Z

This needs a rebase after #17581 has beed merged.

behlendorf · 2025-09-11T16:56:08Z

@ihoro can you rebase this now that #17581 has been merged.

ihoro · 2025-10-05T18:13:40Z

I've got back into the context of this and there is a desire to revise the interface and expectations from the end user perspective. I think we could discuss the following topics:

First of all, it reminds me that this is not a fix. If a pool has overlapping issue then there is a chance it is already in trouble, i.e., some data is lost. There might be reasons why we want to make it running anyway, but users should understand that ignoring such troubles opens undefined behavior path and potentially leads to more issues on top of the existing ones.
The above means to rework the man page to avoid wording like "to recover from fatal issues" etc. The old patch simply follows the existing zfs_recover param's concept.
There is a strong intuition that this behavior must be detached from the existing zfs_recover. The old patch extends the area which zfs_recover param covers and allows to activate only part of it by using zfs_recover_rt. What do you think if we keep zfs_recover as is and simply add a new set of behavior activated separately? And, obviously, it should not have any recover in the name not to mislead compared to the existing zfs_recover.
Also, there is an intuition that the interface should provide a way to ignore such issues only during metaslab loading and still panic upon all other cases. Otherwise, ignoring all range tree overlapping issues touches all aspects of ZFS, not only related to the allocation accounting.
And we can provide the "full mode" to ignore all range tree issues.

The above is a conceptual discussion of what we would like to have. If we go down the technical road then we could discuss naming options, whether we want to use bool-like knobs or bitfield ones, whether it should collaborate with the existing metaslab_debug_load parameter instead of introducing a brand new one, and so on.

amotin · 2025-10-06T13:48:23Z

ignore such issues only during metaslab loading and still panic upon all other cases

This makes some sense to me. Duplicate frees found during loading has already happened and are already on disk, and didn't lead anywhere so far. Would it happen earlier (in other places), it could be more informative.

behlendorf · 2025-10-07T00:38:00Z

ignore such issues only during metaslab loading and still panic upon all other cases

To build on this a little bit, what I think we want is to add a metaslab_skip_unloadable tunable. When set, if any issues are encountered when loading a metaslab from disk (like duplicate frees) then; 1) log an error to the debug log, 2) abort the metaslab load, 3) mark the metaslab as damaged in memory so we don't try it again, and 4) try loading a different metaslab. This would at least allow the pool to be imported read/write without risking further damage to that metaslab. In all other cases I agree we need to panic.

GregorKopka · 2025-10-07T06:50:14Z

In all other cases I agree we need to panic.

Would it be feasible to 'panic' just the affected pool, preferably in a way that allows to (force) unload anything related to it without needing a (hard) reboot... instead halting the whole system?

amotin · 2025-10-07T13:43:03Z

Would it be feasible to 'panic' just the affected pool

Would we be able to handle it as Brian described, we would not need to panic at all.

github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 25, 2025

ihoro force-pushed the zfs_recover_ms branch from ac2b7cc to 963c816 Compare February 25, 2025 19:00

amotin reviewed Feb 25, 2025

View reviewed changes

module/zfs/spa_misc.c Outdated Show resolved Hide resolved

module/zfs/range_tree.c Outdated Show resolved Hide resolved

ihoro force-pushed the zfs_recover_ms branch from 963c816 to b186158 Compare March 4, 2025 13:05

ihoro changed the title ~~Add zfs_recover_ms parameter~~ range_tree: Add zfs_recover_rt parameter and extra debug info Mar 4, 2025

ihoro marked this pull request as ready for review March 4, 2025 13:12

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Mar 4, 2025

ihoro force-pushed the zfs_recover_ms branch from b186158 to 10b1016 Compare March 7, 2025 17:39

ihoro force-pushed the zfs_recover_ms branch from 10b1016 to 5f3cc8b Compare March 11, 2025 08:16

amotin mentioned this pull request Mar 12, 2025

Always perform bounds-checking in metaslab_free_concrete #17136

Merged

13 tasks

tonyhutter reviewed Mar 12, 2025

View reviewed changes

include/sys/range_tree.h Outdated Show resolved Hide resolved

tonyhutter reviewed Mar 12, 2025

View reviewed changes

module/zfs/range_tree.c Outdated Show resolved Hide resolved

ihoro force-pushed the zfs_recover_ms branch from 5f3cc8b to 8898f4e Compare March 13, 2025 11:40

amotin reviewed Mar 13, 2025

View reviewed changes

module/zfs/metaslab.c Outdated Show resolved Hide resolved

include/sys/range_tree.h Outdated Show resolved Hide resolved

include/sys/range_tree.h Outdated Show resolved Hide resolved

ihoro force-pushed the zfs_recover_ms branch from 8898f4e to 0594de3 Compare March 13, 2025 22:47

amotin mentioned this pull request Mar 17, 2025

PANIC: zfs: adding existent segment to range tree #15030

Open

amotin reviewed Mar 17, 2025

View reviewed changes

amotin reviewed Jun 19, 2025

View reviewed changes

behlendorf self-requested a review August 6, 2025 19:43

behlendorf added the Status: Revision Needed Changes are required for the PR to be accepted label Aug 7, 2025

range_tree: Add zfs_recover_rt parameter and extra debug info #17094

Are you sure you want to change the base?

range_tree: Add zfs_recover_rt parameter and extra debug info #17094

Conversation

ihoro commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

ihoro commented Mar 4, 2025

Uh oh!

tonyhutter commented Mar 7, 2025

Uh oh!

ihoro commented Mar 7, 2025

Uh oh!

tonyhutter commented Mar 10, 2025

Uh oh!

tonyhutter commented Mar 10, 2025

Uh oh!

ihoro commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ihoro commented Mar 13, 2025

Uh oh!

amotin Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

GregorKopka commented Apr 26, 2025

Uh oh!

amotin Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

amotin commented Jun 19, 2025

Uh oh!

amotin commented Jul 25, 2025

Uh oh!

ihoro commented Jul 30, 2025

Uh oh!

amotin commented Aug 5, 2025

Uh oh!

behlendorf commented Sep 11, 2025

Uh oh!

ihoro commented Oct 5, 2025

Uh oh!

amotin commented Oct 6, 2025

Uh oh!

behlendorf commented Oct 7, 2025

Uh oh!

GregorKopka commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Oct 7, 2025

Uh oh!

Uh oh!

ihoro commented Feb 25, 2025 •

edited

Loading

GregorKopka commented Oct 7, 2025 •

edited

Loading