Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_open: clear async remove flag after reopen #16921

Merged
merged 1 commit into from
Jan 3, 2025

Conversation

robn
Copy link
Member

@robn robn commented Jan 3, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

It's possible[*] for a vdev to be flagged for async remove after the pool has suspended. If the removed device has been returned when the pool is resumed, the ASYNC_REMOVE task will still run at the end of txg, and remove the device from the pool again.

[*] Maybe only theoretical. I think to happen the pool would have to suspend, then a different IO would need to return an error, and the media check fail, flagging the async remove. I haven't triggered it on stock OpenZFS. I can trip it regularly on a (not yet published) branch that chains writes and flush IOs, and so is more likely to have outstanding IO after the pool has suspended.

Description

To fix, we clear the async remove flag at reopen, just as we did for the async fault flag in 5de3ac2.

How Has This Been Tested?

ZTS run completed successfully.

I am not really sure how to write a test to prove this (we don't really have the tools), but I think intuitively it makes sense - it matches the existing async fault case.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

It's possible for a vdev to be flagged for async remove after the pool
has suspended. If the remvoed device has been returned when the pool is
resumed, the ASYNC_REMOVE task will still run at the end of txg, and
remove the device from the pool again.

To fix, we clear the async remove flag at reopen, just as we did for the
async fault flag in 5de3ac2.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to make sense. But reminds me #16864 (comment), and makes me think that it would be nice if we would be able to remove (close) vdev even when suspended. In some cases OS may be unable to detach device before it is closed, and in some cases it can not reattach one until it is fully detached.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jan 3, 2025
@allanjude
Copy link
Contributor

Seems to make sense. But reminds me #16864 (comment), and makes me think that it would be nice if we would be able to remove (close) vdev even when suspended. In some cases OS may be unable to detach device before it is closed, and in some cases it can not reattach one until it is fully detached.

It might even make sense to try to close all of the devices on suspend, and reopen them for resume? For this case, but also just to make sure we don't have anything stale, and as you mention, are not typing up any resources.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jan 3, 2025
@behlendorf behlendorf merged commit c02e1cf into openzfs:master Jan 3, 2025
22 of 24 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jan 3, 2025
It's possible for a vdev to be flagged for async remove after the pool
has suspended. If the removed device has been returned when the pool is
resumed, the ASYNC_REMOVE task will still run at the end of txg, and
remove the device from the pool again.

To fix, we clear the async remove flag at reopen, just as we did for the
async fault flag in 5de3ac2.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants