-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linux: sync mapped data on umount #16817
base: master
Are you sure you want to change the base?
Conversation
Since we have to disable zpl_prune_sb when umounting, because the shrinker gets freed before zpl_kill_sb is ever called, the umount might be unable to sync open files. Let's do it in zfs_preumount. Signed-off-by: Pavel Snajdr <[email protected]>
Here are some thoughts how to mitigate the impact from this as much as possible: #16324 (comment) |
so it would seem that actually my patch in #16770 is innocent, we can reproduce this even with that reverted... thing is, it's not entirely deterministic, I can't reproduce it on my dev env, only my buddy @ vpsFree can - and the production nodes too... the issue goes like this: a member reported that their Alpine on those nodes where this reproduces, it can be reproduced with this simple C using
Steps to reproduce:
So while I'm glad I didn't cause this, it makes me scratch my head even more. Why doesn't my dev node reproduce this is the most burning question I currently have. ping @robn and also @satmandu @TheUbuntuGuy @AllKind might be interested so while this PR seems to fix that issue, I'm not happy as I'd like a full root cause analysis before taking any action... will need to spend more time on this |
if (zp->z_sa_hdl) | ||
filemap_write_and_wait(ZTOI(zp)->i_mapping); | ||
} | ||
mutex_exit(&zfsvfs->z_znodes_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have expected this writeback to happen in iput()->iput_final()->
when the last reference on the inode is dropped. Clearly that isn't happening, we'll need to get to the bottom of why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you have any hints where to look what can be relevant if I can only reproduce this on older pools? With a sufficiently new pool (cca 2022+) I can't reproduce it and the data ends up consistently on disk as it should...
turns out this was the only difference between my dev setup, where it doesn't reproduce - and the rest where it does, production nodes tend to have pools from install time of that machine (and sometimes it goes back a few HW generations also) - I just didn't see how it could be relevant so I left it as a last thing to try - and boom :D
if I create a new pool with exactly the same set of features as the older pools, I get nothing, so it really must be an older pool - tried meddling with xattr=on|sa
too (this is all with xattr=on
FWIW)
wasn't there a difference in how the root znode/dentry of a dataset is set up? could that be relevant? what I don't understand is how could it, if we're creating new datasets using new code now?
Actually, |
That looks suspiciously close to the older issues with mmap. Are there any zfs parameters different between those systems? zfs_dmu_offset_next_sync? |
No. It's all based off of same Nix definions, same nixpkgs revision, etc. - those builds are close to fully reproducible, plusminus a few details (definitions are hashed, not outputs, in practice it's good enough). Btw there might still be "the old problems" with mmap, filemap has an invalidate lock we're supposed to be using but are not. Though I would expect that to be a different problem than this thing I'm trying to solve, here, generally, the problem is that dirty data is not synced on umount, but it only happens if you have an old pool created with older code... |
Can you quantify "old pool created with older code" a little more. Which versions were you testing with. Which feature flags are enabled for the pools? |
It's not even that back into the past as I thought, actually. So no historical stuff with any dentry or anything like that.
We tried current master and 2.3.0-rc4 WIP taken a few days ago (pools created with this version also don't reproduce the problem). The two where this reproduces, pools from 2020
And the two pools where it doesn't, where it works without patching, pools from 2022
Worst case I'll have to script out a really long bisect :D No better ideas at this point. |
The oldest pools I have (that are in systems with a working Docker daemon) are from ~July 2021:
I tried your
and through the magic of having whole OS snapshots, were created using ZFS version 2.1.0 release. So if I had to guess, we are looking at behaviour in 2.0 or 0.x. |
Motivation and Context
I kinda had the feeling this could be the case but I haven't managed to actually produce the problem while I was testing the patch for #16770... but in the end there truly is a possibility that dirty data in pagecache might not get synced properly due to disabled pruning during the umount.
Description
How Has This Been Tested?
This was originally reported by our member, so we validated using Alpine
apk upgrade
and container shutdown, which on top of ZFS leads tocleanup_mnt->deactivate_super->deactivate_locked_super->zpl_kill_sb
- without this patch some of the upgraded files might have regions filled with zeros (probably thx toinit_on_{alloc,free}=1
)Types of changes
Checklist:
Signed-off-by
.