Linux: sync mapped data on umount #16817

snajpa · 2024-11-28T22:01:24Z

Motivation and Context

I kinda had the feeling this could be the case but I haven't managed to actually produce the problem while I was testing the patch for #16770... but in the end there truly is a possibility that dirty data in pagecache might not get synced properly due to disabled pruning during the umount.

Description

Linux: sync mapped data on umount
    
Since we have to disable zpl_prune_sb when umounting,
because the shrinker gets freed before zpl_kill_sb is
ever called, the umount might be unable to sync open files.
    
Let's do it in zfs_preumount.

How Has This Been Tested?

This was originally reported by our member, so we validated using Alpine apk upgrade and container shutdown, which on top of ZFS leads to cleanup_mnt->deactivate_super->deactivate_locked_super->zpl_kill_sb - without this patch some of the upgraded files might have regions filled with zeros (probably thx to init_on_{alloc,free}=1)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Since we have to disable zpl_prune_sb when umounting, because the shrinker gets freed before zpl_kill_sb is ever called, the umount might be unable to sync open files. Let's do it in zfs_preumount. Signed-off-by: Pavel Snajdr <[email protected]>

snajpa · 2024-11-28T22:49:18Z

Here are some thoughts how to mitigate the impact from this as much as possible: #16324 (comment)

snajpa · 2024-12-02T19:10:55Z

so it would seem that actually my patch in #16770 is innocent, we can reproduce this even with that reverted... thing is, it's not entirely deterministic, I can't reproduce it on my dev env, only my buddy @ vpsFree can - and the production nodes too...

the issue goes like this: a member reported that their Alpine apk upgrade and subsequent reboot in the container lead to corrupted files - we found that apk doesn't msync() at all, which is interesting (and I'd consider it a bug on their part) - but on a regular linux-native filesystem that isn't any issue; with OpenZFS, when we start a container with Alpine on top of a ZFS dataset, do the apk upgrade, then after we stop the container, if we run zfs umount, on those nodes where it reproduces, that zfs umount command is fast as it's not supposed to... and files that apk touched contain zero-filled regions

on those nodes where this reproduces, it can be reproduced with this simple C using mmap and no msync:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <time.h>
#include <sys/sysinfo.h>

int main() {
    char filename[64];
    snprintf(filename, sizeof(filename), "tempfile_%ld.tmp", time(NULL) + rand());

    int fd = open(filename, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        perror("open");
        exit(EXIT_FAILURE);
    }

    //size_t mem_size = 1024*1024*1024; // 1 GB
    size_t mem_size = 10*1024*1024; // 10 MiB
    if (ftruncate(fd, mem_size) == -1) {
        perror("ftruncate");
        close(fd);
        exit(EXIT_FAILURE);
    }
    void *map = mmap(NULL, mem_size, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        close(fd);
        exit(EXIT_FAILURE);
    }

    for (size_t i = 0; i < mem_size; i++) {
        ((char *)map)[i] = 0xAA;
    }

    //msync(map, mem_size, MS_SYNC);
    munmap(map, mem_size);
    close(fd);

    printf("Temporary file '%s' created and filled.\n", filename);
    return 0;
}

Steps to reproduce:

start a container with rootfs on top of ZFS dataset
compile and run mmap.c in that container
stop the container
zfs umount + zfs mount
start the container
observe the temp* file created with ./mmap it's not all filled with 0xAA, there are zero-filled regions which shouldn't be there

So while I'm glad I didn't cause this, it makes me scratch my head even more. Why doesn't my dev node reproduce this is the most burning question I currently have.

ping @robn and also @satmandu @TheUbuntuGuy @AllKind might be interested

so while this PR seems to fix that issue, I'm not happy as I'd like a full root cause analysis before taking any action... will need to spend more time on this

behlendorf · 2024-12-03T03:01:13Z

module/os/linux/zfs/zfs_vfsops.c

+			if (zp->z_sa_hdl)
+				filemap_write_and_wait(ZTOI(zp)->i_mapping);
+		}
+		mutex_exit(&zfsvfs->z_znodes_lock);


I'd have expected this writeback to happen in iput()->iput_final()-> when the last reference on the inode is dropped. Clearly that isn't happening, we'll need to get to the bottom of why.

Would you have any hints where to look what can be relevant if I can only reproduce this on older pools? With a sufficiently new pool (cca 2022+) I can't reproduce it and the data ends up consistently on disk as it should...

turns out this was the only difference between my dev setup, where it doesn't reproduce - and the rest where it does, production nodes tend to have pools from install time of that machine (and sometimes it goes back a few HW generations also) - I just didn't see how it could be relevant so I left it as a last thing to try - and boom :D

if I create a new pool with exactly the same set of features as the older pools, I get nothing, so it really must be an older pool - tried meddling with xattr=on|sa too (this is all with xattr=on FWIW)

wasn't there a difference in how the root znode/dentry of a dataset is set up? could that be relevant? what I don't understand is how could it, if we're creating new datasets using new code now?

snajpa · 2024-12-03T03:20:14Z

Actually, apk only uses mmap(2) for reads, it doesn't write that way, interesting.

IvanVolosyuk · 2024-12-05T12:59:12Z

That looks suspiciously close to the older issues with mmap. Are there any zfs parameters different between those systems? zfs_dmu_offset_next_sync?

snajpa · 2024-12-05T13:44:36Z

Are there any zfs parameters different between those systems? zfs_dmu_offset_next_sync?

No. It's all based off of same Nix definions, same nixpkgs revision, etc. - those builds are close to fully reproducible, plusminus a few details (definitions are hashed, not outputs, in practice it's good enough).

Btw there might still be "the old problems" with mmap, filemap has an invalidate lock we're supposed to be using but are not.

Though I would expect that to be a different problem than this thing I'm trying to solve, here, generally, the problem is that dirty data is not synced on umount, but it only happens if you have an old pool created with older code...

behlendorf · 2024-12-05T22:04:36Z

Can you quantify "old pool created with older code" a little more. Which versions were you testing with. Which feature flags are enabled for the pools?

snajpa · 2024-12-05T23:03:11Z

It's not even that back into the past as I thought, actually. So no historical stuff with any dentry or anything like that.

node1.stg (staging node) reproduces the problem with Alpine (and the mmap "reproducer", even if it's not fully what apk does, works too), has a pool from 2020, the patch in this PR mitigates the problem
node24.prg is a relatively new prod node from 2022 and doesn't reproduce the problem, neither with apk or the above mmap.c, no patching needed
and two additional datapoints are my and @aither64's qemu dev pools - my doesn't have a problem with any umounts, Aither's does reproduce this.

We tried current master and 2.3.0-rc4 WIP taken a few days ago (pools created with this version also don't reproduce the problem).

The two where this reproduces, pools from 2020

[[email protected]]
 ~ # zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Tue Jul 28 19:28 2020  -
[[email protected]]
 ~ # zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           active                         local
tank  feature@large_dnode            active                         local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          disabled                       local
tank  feature@draid                  disabled                       local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local

[root@vpsadminos-aither:~]# zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Sat Feb 29 16:12 2020  -

[root@vpsadminos-aither:~]# zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         disabled                       local
tank  feature@zstd_compress          disabled                       local
tank  feature@draid                  disabled                       local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local

And the two pools where it doesn't, where it works without patching, pools from 2022

[[email protected]]
 ~ # zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Thu Sep 29 15:00 2022  -
[[email protected]]
 ~ # zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           active                         local
tank  feature@large_dnode            active                         local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          enabled                        local
tank  feature@draid                  enabled                        local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local

[root@vpsadminos:~]# zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Thu Jan 13 10:49 2022  -

[root@vpsadminos:~]# zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          enabled                        local
tank  feature@draid                  enabled                        local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local

Worst case I'll have to script out a really long bisect :D No better ideas at this point.

TheUbuntuGuy · 2024-12-12T06:21:43Z

The oldest pools I have (that are in systems with a working Docker daemon) are from ~July 2021:

zfs get creation rootpool
NAME      PROPERTY  VALUE                  SOURCE
rootpool  creation  Sat Jul 24 13:18 2021  -

I tried your mmap.c reproducer, and did not observe any zeroes. They were created with these options:

History for 'rootpool':
2021-07-24.13:18:25 zpool create -o ashift=12 -O compression=lz4 -O xattr=sa -O mountpoint=none -O acltype=posixacl rootpool /dev/nvme0n1p2

and through the magic of having whole OS snapshots, were created using ZFS version 2.1.0 release. So if I had to guess, we are looking at behaviour in 2.0 or 0.x.

Linux: sync mapped data on umount

7a376ea

Since we have to disable zpl_prune_sb when umounting, because the shrinker gets freed before zpl_kill_sb is ever called, the umount might be unable to sync open files. Let's do it in zfs_preumount. Signed-off-by: Pavel Snajdr <[email protected]>

github-actions bot added the Status: Work in Progress Not yet ready for general review label Nov 28, 2024

snajpa mentioned this pull request Nov 28, 2024

Linux: syncfs(2) should sync all cached files #16818

Draft

13 tasks

snajpa marked this pull request as ready for review November 28, 2024 22:19

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Nov 28, 2024

snajpa mentioned this pull request Nov 28, 2024

Kernel oops page fault triggered by Docker in arc_prune #16324

Open

This was referenced Nov 29, 2024

zfs-2.3.0-rc4 patchset #16760 #16794

Merged

zfs-2.2.7 patchset #16720

Merged

behlendorf reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux: sync mapped data on umount #16817

Linux: sync mapped data on umount #16817

snajpa commented Nov 28, 2024 •

edited

Loading

snajpa commented Nov 28, 2024

snajpa commented Dec 2, 2024 •

edited

Loading

behlendorf Dec 3, 2024

snajpa Dec 3, 2024 •

edited

Loading

snajpa commented Dec 3, 2024

IvanVolosyuk commented Dec 5, 2024

snajpa commented Dec 5, 2024 •

edited

Loading

behlendorf commented Dec 5, 2024

snajpa commented Dec 5, 2024 •

edited

Loading

TheUbuntuGuy commented Dec 12, 2024

Linux: sync mapped data on umount #16817

Are you sure you want to change the base?

Linux: sync mapped data on umount #16817

Conversation

snajpa commented Nov 28, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

snajpa commented Nov 28, 2024

snajpa commented Dec 2, 2024 • edited Loading

behlendorf Dec 3, 2024

Choose a reason for hiding this comment

snajpa Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

snajpa commented Dec 3, 2024

IvanVolosyuk commented Dec 5, 2024

snajpa commented Dec 5, 2024 • edited Loading

behlendorf commented Dec 5, 2024

snajpa commented Dec 5, 2024 • edited Loading

TheUbuntuGuy commented Dec 12, 2024

snajpa commented Nov 28, 2024 •

edited

Loading

snajpa commented Dec 2, 2024 •

edited

Loading

snajpa Dec 3, 2024 •

edited

Loading

snajpa commented Dec 5, 2024 •

edited

Loading

snajpa commented Dec 5, 2024 •

edited

Loading