Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-demand log-spacemap flush; zpool condense command #16747

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Nov 13, 2024

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Normally, log spacemaps are flushed out to the metaslabs when the pool is exported. For large logs, this can lead to export taking an inordinate amount of time.

This PR adds a an on-demand variant of the log spacemap flush, and a zpool condense command to trigger it. With it, an operator can request that log spacemaps be flushed ahead of time, so that there is relatively little work to be done at export time.

Description

There's two halves to this.

First, we add a "mode" to the existing "flushall" behaviour in spa_flush_metaslabs(). The traditional behaviour is now "export mode", and flushes all logs. Some new functions are added to start and stop the flush with a given mode. Then we add a "request" mode, for use by the operator. This follows the same logic of walking the logs and flushing them out, but skips any that were modified after the flush request was made.

The second part is the addition of the zpool condense command, and support library and ioctl additions. This takes a -t <target> parameter, which is the "thing" to condense, flush, garbage-collect or otherwise accelerate background processing for. It's designed so it could be wired up to any similar background process in the future. In particular, I had the dedup log in mind while putting it together.

All the trimmings you'd expect are there. Condense operations can be cancelled, restoring the flushing behaviour to its original pace or schedule. They can be waited on, via condense -w or wait -t condense. The latter combines all condense targets into one signal value, theoretically allowing multiple things to be condensed at the same time, and wait until they're all finished. Without a second or maybe even third target it's unclear to me if this it what the user will expect, but I don't think this is far off and it can be changes when the next thing gets hooked up.

I've included a kstat exposing the pool "unflushed" counters. We used this in our initial investigations. I thought it was going to be useful for the ZTS test, but it ended up being too difficult to control reliably. So nothing here uses it directly, but it doesn't hurt anything to have it there and may help someone, so why not.

How Has This Been Tested?

ztest gets support, and has had many tens of runs without issue. ZTS test has been added, and the entire suite run to successful completion.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Normally, log spacemaps are flushed out to the metaslabs when the pool
is exported. For large logs, this can lead to export taking an
inordinate amount of time.

This commit adds a "mode" parameter for the log spacemap "flushall"
operation, and functions for starting and stopping it in a particular
mode. The existing behaviour of flushing everything is now the "export"
mode.

Then, we add a new "request" mode, that can be triggered externally.
This mode differs in that it only flushes spacemaps that were dirtied
before the current transaction, stopping when the only dirty ones
remaining, if any, are newer.

This commit only adds the behaviours and sets up the entry points; the
next commit will add something to call them.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn changed the title On-demand log-spacemap flush; zpool condense command. On-demand log-spacemap flush; zpool condense command Nov 13, 2024
The idea is to have a single command that could signal to any background
cleanup task that it should do its work faster, or care less about not
getting in the way of user IO, or whatever.

This adds the the `zpool condense` command, the `ZFS_IOC_POOL_CONDENSE`
ioctl and counters so userspace can get progress. Because the target
could be anything, there's no particular unit, just a total number of
items to condense and count of how many done.

It also adds a `log-spacemap` condense target, which calls the "request
log flush" function.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I've missed something, but what will make transactions move and more metaslabs flushed if the pool is idle?

Also, if the pool is idle, will it flush only 5 transactions per transaction? I worry about the amount of new dirty metaslabs/transactions it may produce until it finally converge.

Comment on lines +8799 to +8800
if (argc < 1) {
(void) fprintf(stderr, gettext("missing pool name argument\n"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zpool sync we allow without arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't want an "all pools" mode here, because I don't know what might be there in the future, and it might mean different things for different pools. Same way zpool scrub requires a pool arg.

Comment on lines +8791 to +8792
if (cb.type == POOL_CONDENSE_TYPES) {
(void) fprintf(stderr, gettext("missing condense target\n"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be nice to have "all" or allow multiple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure! I figure we could add it when the second one comes along. Comma-separated might be nice, like wait targets etc.

boolean_t
spa_flush_all_logs_requested(spa_t *spa)
void
spa_log_flushall_start(spa_t *spa, spa_log_flushall_mode_t mode, uint64_t txg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both the field and function names you just say "log", but you've already mention that aside of spacemap logs we might get dedup logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, maybe I wasn't clear. These are just the controls for log spacemap flushall. The next thing would be something separate, just triggered from zfs_ioc_condense().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying that this function will handle other logs. Just that it is not very specific for externally visible name. Even more for spa_log_flushall_mode actually. But looking now there are other names around it that are also not metaslab-specific, so I don't insist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see what you mean. Yeah, just following the surrounding style... :/

@robn
Copy link
Member Author

robn commented Nov 13, 2024

May be I've missed something, but what will make transactions move and more metaslabs flushed if the pool is idle?

Also, if the pool is idle, will it flush only 5 transactions per transaction? I worry about the amount of new dirty metaslabs/transactions it may produce until it finally converge.

@amotin mm, you may be right. It didn't come up in testing, but we hadn't gone out of our way to stop pool activity. (Also I wrote this last year, so probably didn't know about this at the time!)

I'll study it and post an update soon, thanks!

@robn robn added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 13, 2024
@robn
Copy link
Member Author

robn commented Nov 14, 2024

Right, I think I've swapped back in everything I need.

So yes, you're right - when the pool is idle, nothing is pushing things along (same for dedup log, incidentally). The 5s timeout will see some flushed out, but that's all. This is existing behaviour, so I'm ok with that, I think.

So it seems to me that there's two questions.

If the operator has requested spacemap log flush, should we push the sync along a bit? I think it's reasonable to say yes, in theory. Similar idea to the dsl_scan_active() call in txg_sync_thread().

Then, what should the amount be. I forget why we chose minimum 5; probably it was just a number that was easy to see vs 1. Of course, we should use more the quieter things are. Is that just a much higher minimum? Maybe a % of the total, or of the amount dirty? Or is it more like, some larger minimum + inverse of the change rate, so we always get a big amount, and more if there's space to do it.

I'll ask around. Let me know if you have any thoughts.

@amotin
Copy link
Member

amotin commented Nov 14, 2024

@robn I think once user requested condense, we should do it as fast as possible, since user likely waits for it to reboot, export, whatever. This operation makes no sense to do just routinely. The only limitation is not hurt other workload too much.

The amount of sync is a good question. I'd guess we don't want to extend transaction group for too long, neither consume to much memory on dirty data, etc. But my last trip to spacemaps was some time ago, so I don't have specific recommendations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants