Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zil_prune_commit_list causes write to be acknowledged without being written durably #16333

Open
atkinsam opened this issue Jul 9, 2024 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@atkinsam
Copy link
Contributor

atkinsam commented Jul 9, 2024

System information

Reproduced with:

  • Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.1.7
  • Amazon Linux / kernel 6.1.92 / x86_64 / OpenZFS 2.1.7
  • Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.2.4

Describe the problem you're observing

During some disk availability testing, specifically a test involving preempt-and-abort'ing NVMe reservations on disks on an imported zpool, we observed ZFS occasionally acknowledge sync writes without those writes being durably written to disk.

Describe how to reproduce the problem

So far, we have only observed this problem when using NVMe reservations, preempt-and-abort'ing reservations on disk on an imported zpool. We have two identical disks in the zpool. No SLOG device. The zpool is created with the two disks in a non-redundant configuration (no raidz, mirror, etc.). We run a fio random-write workload against the pool, preempt-and-abort the disks, crash the file server, and then have fio verify that all ack'd writes are durable.

We observe this issue more frequently when putting these two disks behind an LVM logical volume:

pvcreate /dev/sdf
pvcreate /dev/sdg
vgcreate zfsvg /dev/sdf /dev/sdg
lvcreate --type raid0 -n zfslv -l '100%VG' zfsvg -I 128k -i 2
zpool create -o ashift=12 -O recordsize=128K scratch /dev/zfsvg/zfslv

But we have also observed the issue without LVM in the mix. Our assumption is that LVM impacts the frequency of this because it sets REQ_FAILFAST_MASK, increasing the probability of us hitting this race.

We added some probes throughout the code and found that the zil_commit_waiter for our ack'd write is always entering this condition:

zfs/module/zfs/zil.c

Lines 2784 to 2793 in 2566592

lwb_t *last_lwb = zilog->zl_last_lwb_opened;
if (last_lwb == NULL ||
last_lwb->lwb_state == LWB_STATE_FLUSH_DONE) {
/*
* All of the itxs this waiter was waiting on
* must have already completed (or there were
* never any itx's for it to wait on), so it's
* safe to skip this waiter and mark it done.
*/
zil_commit_waiter_skip(itx->itx_private);

subsequently getting marked as done in zil_commit_waiter_skip. It looks like the zl_last_lwb_opened is getting moved to LWB_STATE_FLUSH_DONE even though it has some non-zero zio error, and then the zio error on that previous lwb is never checked when this code decides to mark the waiter as done (it looks like the zios could be freed by this point anyway).

We tried removing the call to zil_prune_commit_list in zil_commit_writer:

zil_prune_commit_list(zilog);

When we remove this call, our reproducer has never again failed, through thousands of runs.

@atkinsam atkinsam added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 9, 2024
@ahrens
Copy link
Member

ahrens commented Jul 11, 2024

@prakashsurya I think you wrote this originally, and @amotin did some work in the ZIL more recently. Do either of you have thoughts on what might be happening here? It does look like zil_lwb_flush_vdevs_done() is called whenever the zio completes, regardless of whether it has an error, but it isn't checking zio->io_error. If the flush zio fails, maybe we should retry it or wait for a txg sync?

@amotin
Copy link
Member

amotin commented Jul 11, 2024

As I see, in case of error zil_lwb_flush_vdevs_done() sets zcw->zcw_zio_error, which should be checked by zil_commit_impl(), calling txg_wait_synced() before returning. But I suspect there may be a race on a parallel workloads, when while one thread is waiting for the transaction commit, parallel one may see the lwb_state == LWB_STATE_FLUSH_DONE and skip waiting.

I need to run right now, but I wonder if we could reuse lwb_error field, used now only for allocation errors, to signal write errors also, and check it also in zil_prune_commit_list() to block the optimization. Though may need to look deeper to recall all the error handling path there, it was a while ago.

@amotin
Copy link
Member

amotin commented Jul 12, 2024

I think that disabling zil_prune_commit_list() may workaround only part of the problem. I think it helps because new LWB write it produces also fails, and that error is what you handle, not the original one. It should not help with transient write errors, that may cause ZIL not replayable after the failed LWB. If the previous LWB's ZIO is still running when the next is issued, it will propagate its errors to all the following ZIOs via parent-children relations. But if LWB is in LWB_STATE_FLUSH_DONE state, there is no longer ZIO for zil_lwb_set_zio_dependency() to call zio_add_child() on to propagate the error, so all the following ZIOs till the TXG commit completion may successfully complete, but will never be replayed in case of crash. But it seems not exactly trivial to decide how far in LWBs to propagate the error, since on the edge of TXGs zil_sync() will free the stale part of ZIL chain, but it may vary depending on what ZIOs are still running, etc. It needs much deeper thinking, but my head is now is in several other projects.

@robn I wonder if you'd be interested as part of your #16314 work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants