zil_prune_commit_list causes write to be acknowledged without being written durably #16333

atkinsam · 2024-07-09T17:03:20Z

System information

Reproduced with:

Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.1.7
Amazon Linux / kernel 6.1.92 / x86_64 / OpenZFS 2.1.7
Amazon Linux / kernel 6.1.92 / aarch64 / OpenZFS 2.2.4

Describe the problem you're observing

During some disk availability testing, specifically a test involving preempt-and-abort'ing NVMe reservations on disks on an imported zpool, we observed ZFS occasionally acknowledge sync writes without those writes being durably written to disk.

Describe how to reproduce the problem

So far, we have only observed this problem when using NVMe reservations, preempt-and-abort'ing reservations on disk on an imported zpool. We have two identical disks in the zpool. No SLOG device. The zpool is created with the two disks in a non-redundant configuration (no raidz, mirror, etc.). We run a fio random-write workload against the pool, preempt-and-abort the disks, crash the file server, and then have fio verify that all ack'd writes are durable.

We observe this issue more frequently when putting these two disks behind an LVM logical volume:

pvcreate /dev/sdf
pvcreate /dev/sdg
vgcreate zfsvg /dev/sdf /dev/sdg
lvcreate --type raid0 -n zfslv -l '100%VG' zfsvg -I 128k -i 2
zpool create -o ashift=12 -O recordsize=128K scratch /dev/zfsvg/zfslv

But we have also observed the issue without LVM in the mix. Our assumption is that LVM impacts the frequency of this because it sets REQ_FAILFAST_MASK, increasing the probability of us hitting this race.

We added some probes throughout the code and found that the zil_commit_waiter for our ack'd write is always entering this condition:

zfs/module/zfs/zil.c

Lines 2784 to 2793 in 2566592

    
           		lwb_t *last_lwb = zilog->zl_last_lwb_opened; 
        
           		if (last_lwb == NULL || 
        
           		    last_lwb->lwb_state == LWB_STATE_FLUSH_DONE) { 
        
           			/* 
        
           			 * All of the itxs this waiter was waiting on 
        
           			 * must have already completed (or there were 
        
           			 * never any itx's for it to wait on), so it's 
        
           			 * safe to skip this waiter and mark it done. 
        
           			 */ 
        
           			zil_commit_waiter_skip(itx->itx_private);

subsequently getting marked as done in zil_commit_waiter_skip. It looks like the zl_last_lwb_opened is getting moved to LWB_STATE_FLUSH_DONE even though it has some non-zero zio error, and then the zio error on that previous lwb is never checked when this code decides to mark the waiter as done (it looks like the zios could be freed by this point anyway).

We tried removing the call to zil_prune_commit_list in zil_commit_writer:

zfs/module/zfs/zil.c

Line 3144 in 2566592

zil_prune_commit_list(zilog);

When we remove this call, our reproducer has never again failed, through thousands of runs.

The text was updated successfully, but these errors were encountered:

ahrens · 2024-07-11T19:22:15Z

@prakashsurya I think you wrote this originally, and @amotin did some work in the ZIL more recently. Do either of you have thoughts on what might be happening here? It does look like zil_lwb_flush_vdevs_done() is called whenever the zio completes, regardless of whether it has an error, but it isn't checking zio->io_error. If the flush zio fails, maybe we should retry it or wait for a txg sync?

amotin · 2024-07-11T21:56:50Z

As I see, in case of error zil_lwb_flush_vdevs_done() sets zcw->zcw_zio_error, which should be checked by zil_commit_impl(), calling txg_wait_synced() before returning. But I suspect there may be a race on a parallel workloads, when while one thread is waiting for the transaction commit, parallel one may see the lwb_state == LWB_STATE_FLUSH_DONE and skip waiting.

I need to run right now, but I wonder if we could reuse lwb_error field, used now only for allocation errors, to signal write errors also, and check it also in zil_prune_commit_list() to block the optimization. Though may need to look deeper to recall all the error handling path there, it was a while ago.

amotin · 2024-07-12T15:33:18Z

I think that disabling zil_prune_commit_list() may workaround only part of the problem. I think it helps because new LWB write it produces also fails, and that error is what you handle, not the original one. It should not help with transient write errors, that may cause ZIL not replayable after the failed LWB. If the previous LWB's ZIO is still running when the next is issued, it will propagate its errors to all the following ZIOs via parent-children relations. But if LWB is in LWB_STATE_FLUSH_DONE state, there is no longer ZIO for zil_lwb_set_zio_dependency() to call zio_add_child() on to propagate the error, so all the following ZIOs till the TXG commit completion may successfully complete, but will never be replayed in case of crash. But it seems not exactly trivial to decide how far in LWBs to propagate the error, since on the edge of TXGs zil_sync() will free the stale part of ZIL chain, but it may vary depending on what ZIOs are still running, etc. It needs much deeper thinking, but my head is now is in several other projects.

@robn I wonder if you'd be interested as part of your #16314 work.

atkinsam added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zil_prune_commit_list causes write to be acknowledged without being written durably #16333

zil_prune_commit_list causes write to be acknowledged without being written durably #16333

atkinsam commented Jul 9, 2024

ahrens commented Jul 11, 2024

amotin commented Jul 11, 2024

amotin commented Jul 12, 2024

zil_prune_commit_list causes write to be acknowledged without being written durably #16333

zil_prune_commit_list causes write to be acknowledged without being written durably #16333

Comments

atkinsam commented Jul 9, 2024

System information

Describe the problem you're observing

Describe how to reproduce the problem

ahrens commented Jul 11, 2024

amotin commented Jul 11, 2024

amotin commented Jul 12, 2024