Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux: sync mapped data on umount #16817

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Nov 28, 2024

Motivation and Context

I kinda had the feeling this could be the case but I haven't managed to actually produce the problem while I was testing the patch for #16770... but in the end there truly is a possibility that dirty data in pagecache might not get synced properly due to disabled pruning during the umount.

Description

Linux: sync mapped data on umount
    
Since we have to disable zpl_prune_sb when umounting,
because the shrinker gets freed before zpl_kill_sb is
ever called, the umount might be unable to sync open files.
    
Let's do it in zfs_preumount.

How Has This Been Tested?

This was originally reported by our member, so we validated using Alpine apk upgrade and container shutdown, which on top of ZFS leads to cleanup_mnt->deactivate_super->deactivate_locked_super->zpl_kill_sb - without this patch some of the upgraded files might have regions filled with zeros (probably thx to init_on_{alloc,free}=1)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Since we have to disable zpl_prune_sb when umounting,
because the shrinker gets freed before zpl_kill_sb is
ever called, the umount might be unable to sync open files.

Let's do it in zfs_preumount.

Signed-off-by: Pavel Snajdr <[email protected]>
@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Nov 28, 2024
@snajpa snajpa marked this pull request as ready for review November 28, 2024 22:19
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Nov 28, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Nov 28, 2024

Here are some thoughts how to mitigate the impact from this as much as possible: #16324 (comment)

This was referenced Nov 29, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Dec 2, 2024

so it would seem that actually my patch in #16770 is innocent, we can reproduce this even with that reverted... thing is, it's not entirely deterministic, I can't reproduce it on my dev env, only my buddy @ vpsFree can - and the production nodes too...

the issue goes like this: a member reported that their Alpine apk upgrade and subsequent reboot in the container lead to corrupted files - we found that apk doesn't msync() at all, which is interesting (and I'd consider it a bug on their part) - but on a regular linux-native filesystem that isn't any issue; with OpenZFS, when we start a container with Alpine on top of a ZFS dataset, do the apk upgrade, then after we stop the container, if we run zfs umount, on those nodes where it reproduces, that zfs umount command is fast as it's not supposed to... and files that apk touched contain zero-filled regions

on those nodes where this reproduces, it can be reproduced with this simple C using mmap and no msync:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <time.h>
#include <sys/sysinfo.h>

int main() {
    char filename[64];
    snprintf(filename, sizeof(filename), "tempfile_%ld.tmp", time(NULL) + rand());

    int fd = open(filename, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        perror("open");
        exit(EXIT_FAILURE);
    }

    //size_t mem_size = 1024*1024*1024; // 1 GB
    size_t mem_size = 10*1024*1024; // 10 MiB
    if (ftruncate(fd, mem_size) == -1) {
        perror("ftruncate");
        close(fd);
        exit(EXIT_FAILURE);
    }
    void *map = mmap(NULL, mem_size, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
        perror("mmap");
        close(fd);
        exit(EXIT_FAILURE);
    }

    for (size_t i = 0; i < mem_size; i++) {
        ((char *)map)[i] = 0xAA;
    }

    //msync(map, mem_size, MS_SYNC);
    munmap(map, mem_size);
    close(fd);

    printf("Temporary file '%s' created and filled.\n", filename);
    return 0;
}

Steps to reproduce:

  • start a container with rootfs on top of ZFS dataset
  • compile and run mmap.c in that container
  • stop the container
  • zfs umount + zfs mount
  • start the container
  • observe the temp* file created with ./mmap it's not all filled with 0xAA, there are zero-filled regions which shouldn't be there

So while I'm glad I didn't cause this, it makes me scratch my head even more. Why doesn't my dev node reproduce this is the most burning question I currently have.

ping @robn and also @satmandu @TheUbuntuGuy @AllKind might be interested

so while this PR seems to fix that issue, I'm not happy as I'd like a full root cause analysis before taking any action... will need to spend more time on this

if (zp->z_sa_hdl)
filemap_write_and_wait(ZTOI(zp)->i_mapping);
}
mutex_exit(&zfsvfs->z_znodes_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have expected this writeback to happen in iput()->iput_final()-> when the last reference on the inode is dropped. Clearly that isn't happening, we'll need to get to the bottom of why.

Copy link
Contributor Author

@snajpa snajpa Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have any hints where to look what can be relevant if I can only reproduce this on older pools? With a sufficiently new pool (cca 2022+) I can't reproduce it and the data ends up consistently on disk as it should...

turns out this was the only difference between my dev setup, where it doesn't reproduce - and the rest where it does, production nodes tend to have pools from install time of that machine (and sometimes it goes back a few HW generations also) - I just didn't see how it could be relevant so I left it as a last thing to try - and boom :D

if I create a new pool with exactly the same set of features as the older pools, I get nothing, so it really must be an older pool - tried meddling with xattr=on|sa too (this is all with xattr=on FWIW)

wasn't there a difference in how the root znode/dentry of a dataset is set up? could that be relevant? what I don't understand is how could it, if we're creating new datasets using new code now?

@snajpa
Copy link
Contributor Author

snajpa commented Dec 3, 2024

Actually, apk only uses mmap(2) for reads, it doesn't write that way, interesting.

@IvanVolosyuk
Copy link
Contributor

That looks suspiciously close to the older issues with mmap. Are there any zfs parameters different between those systems? zfs_dmu_offset_next_sync?

@snajpa
Copy link
Contributor Author

snajpa commented Dec 5, 2024

Are there any zfs parameters different between those systems? zfs_dmu_offset_next_sync?

No. It's all based off of same Nix definions, same nixpkgs revision, etc. - those builds are close to fully reproducible, plusminus a few details (definitions are hashed, not outputs, in practice it's good enough).

Btw there might still be "the old problems" with mmap, filemap has an invalidate lock we're supposed to be using but are not.

Though I would expect that to be a different problem than this thing I'm trying to solve, here, generally, the problem is that dirty data is not synced on umount, but it only happens if you have an old pool created with older code...

@behlendorf
Copy link
Contributor

Can you quantify "old pool created with older code" a little more. Which versions were you testing with. Which feature flags are enabled for the pools?

@snajpa
Copy link
Contributor Author

snajpa commented Dec 5, 2024

It's not even that back into the past as I thought, actually. So no historical stuff with any dentry or anything like that.

  • node1.stg (staging node) reproduces the problem with Alpine (and the mmap "reproducer", even if it's not fully what apk does, works too), has a pool from 2020, the patch in this PR mitigates the problem
  • node24.prg is a relatively new prod node from 2022 and doesn't reproduce the problem, neither with apk or the above mmap.c, no patching needed
  • and two additional datapoints are my and @aither64's qemu dev pools - my doesn't have a problem with any umounts, Aither's does reproduce this.

We tried current master and 2.3.0-rc4 WIP taken a few days ago (pools created with this version also don't reproduce the problem).

The two where this reproduces, pools from 2020
[[email protected]]
 ~ # zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Tue Jul 28 19:28 2020  -
[[email protected]]
 ~ # zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           active                         local
tank  feature@large_dnode            active                         local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          disabled                       local
tank  feature@draid                  disabled                       local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
[root@vpsadminos-aither:~]# zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Sat Feb 29 16:12 2020  -

[root@vpsadminos-aither:~]# zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         disabled                       local
tank  feature@zstd_compress          disabled                       local
tank  feature@draid                  disabled                       local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local
And the two pools where it doesn't, where it works without patching, pools from 2022
[[email protected]]
 ~ # zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Thu Sep 29 15:00 2022  -
[[email protected]]
 ~ # zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           active                         local
tank  feature@large_dnode            active                         local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          enabled                        local
tank  feature@draid                  enabled                        local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local
[root@vpsadminos:~]# zfs get creation tank
NAME  PROPERTY  VALUE                  SOURCE
tank  creation  Thu Jan 13 10:49 2022  -

[root@vpsadminos:~]# zpool get all tank | grep feature@
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
tank  feature@encryption             enabled                        local
tank  feature@project_quota          active                         local
tank  feature@device_removal         enabled                        local
tank  feature@obsolete_counts        enabled                        local
tank  feature@zpool_checkpoint       enabled                        local
tank  feature@spacemap_v2            active                         local
tank  feature@allocation_classes     enabled                        local
tank  feature@resilver_defer         enabled                        local
tank  feature@bookmark_v2            enabled                        local
tank  feature@redaction_bookmarks    enabled                        local
tank  feature@redacted_datasets      enabled                        local
tank  feature@bookmark_written       enabled                        local
tank  feature@log_spacemap           active                         local
tank  feature@livelist               enabled                        local
tank  feature@device_rebuild         enabled                        local
tank  feature@zstd_compress          enabled                        local
tank  feature@draid                  enabled                        local
tank  feature@zilsaxattr             disabled                       local
tank  feature@head_errlog            disabled                       local
tank  feature@blake3                 disabled                       local
tank  feature@block_cloning          disabled                       local
tank  feature@vdev_zaps_v2           disabled                       local
tank  feature@redaction_list_spill   disabled                       local
tank  feature@raidz_expansion        disabled                       local
tank  feature@fast_dedup             disabled                       local
tank  feature@longname               disabled                       local
tank  feature@large_microzap         disabled                       local

Worst case I'll have to script out a really long bisect :D No better ideas at this point.

@TheUbuntuGuy
Copy link

The oldest pools I have (that are in systems with a working Docker daemon) are from ~July 2021:

zfs get creation rootpool
NAME      PROPERTY  VALUE                  SOURCE
rootpool  creation  Sat Jul 24 13:18 2021  -

I tried your mmap.c reproducer, and did not observe any zeroes. They were created with these options:

History for 'rootpool':
2021-07-24.13:18:25 zpool create -o ashift=12 -O compression=lz4 -O xattr=sa -O mountpoint=none -O acltype=posixacl rootpool /dev/nvme0n1p2

and through the magic of having whole OS snapshots, were created using ZFS version 2.1.0 release. So if I had to guess, we are looking at behaviour in 2.0 or 0.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants