Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert if we're logging after final txg was set #16795

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Nov 21, 2024

Motivation and Context

Running into this from time to time when running ZTS

Description

This should avoid this VERIFY trip:

VERIFY3(tx->tx_txg <= spa_final_dirty_txg(os->os_spa)) failed (63 <= 62 PANIC at dbuf.c:2324:dbuf_dirty()
Showing stack for process 587858
CPU: 1 PID: 587858 Comm: txg_sync Tainted: P           OE      6.10.12-
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-
Call Trace:
 <TASK>
 dump_stack_lvl+0x64/0x80
 spl_panic+0x100/0x120 [spl]
 dbuf_dirty+0xd5a/0x1300 [zfs]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? dmu_buf_will_dirty_impl+0xdf/0x330 [zfs]
 spa_history_log_sync+0x11e/0x620 [zfs]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? list_head+0x9/0x30 [zfs]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __cv_destroy+0x92/0x250 [spl]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? mutex_lock+0x12/0x30
 dsl_sync_task_sync+0xb9/0x120 [zfs]
 dsl_pool_sync+0x4c7/0x6a0 [zfs]
 spa_sync_iterate_to_convergence+0xd8/0x320 [zfs]
 spa_sync+0x308/0x8e0 [zfs]
 ? __wake_up+0x44/0x60
 txg_sync_thread+0x272/0x3c0 [zfs]
 ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 thread_generic_wrapper+0x66/0x90 [spl]
 kthread+0xd2/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x34/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@snajpa snajpa marked this pull request as ready for review November 21, 2024 16:02
@behlendorf
Copy link
Contributor

This issue was recently resolved by #16782. Are you still seeing this with this patch applied?

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

Here's an example: https://github.com/openzfs/zfs/actions/runs/11938548960/job/33284373903 - it's (barely) visible in vm1 serial console log

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

This issue was recently resolved by #16782. Are you still seeing this with this patch applied?

I'm going to force push the other three PRs I have open rebased on current master to re-check, I've seen this but thought this might be something separate so I didn't re-validate whether it fixes it, so let's find out :)

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 21, 2024
@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

Pushed, so in ~3hrs we'll know :)

@amotin
Copy link
Member

amotin commented Nov 21, 2024

I agree with Brian that I should have fixed one of the case. Related to this change I think instead of ignoring the messages we should assert it, otherwise it is very difficult to diagnose it. The logging is only a symptom.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

I agree with Brian that I should have fixed one of the case. Related to this change I think instead of ignoring the messages we should assert it, otherwise it is very difficult to diagnose it. The logging is only a symptom.

Good idea, I'll let the tests in the other PRs finish and redo this one to VERIFY or an ASSERT, which one fits better?

@amotin
Copy link
Member

amotin commented Nov 21, 2024

Good idea, I'll let the tests in the other PRs finish and redo this one to VERIFY or an ASSERT, which one fits better?

I'd say ASSERT() + ignore, since it should not be fatal for production, but would be good to catch on debug. BTW, logging the message in the panic was really helpful for diagnostics.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

BTW, logging the message in the panic was really helpful for diagnostics.

So maybe no change would be for the best here? :)

@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Work in Progress Not yet ready for general review labels Nov 21, 2024
@amotin
Copy link
Member

amotin commented Nov 21, 2024

BTW, logging the message in the panic was really helpful for diagnostics.

So maybe no change would be for the best here? :)

I meant using ASSERT3UF() just when the logging is requested, including the message text, instead of useless panic when it is applied much later. That is how I actually diagnosed the L2ARC issue:

@@ -535,6 +536,7 @@ log_internal(nvlist_t *nvl, const char *operation, spa_t *spa,
        fnvlist_add_uint64(nvl, ZPOOL_HIST_TXG, tx->tx_txg);
        fnvlist_add_uint64(nvl, ZPOOL_HIST_TIME, gethrestime_sec());
 
+       VERIFY3UF(tx->tx_txg, <=, spa_final_dirty_txg(spa), "%s %s", operation, fmt);
        if (dmu_tx_is_syncing(tx)) {
                spa_history_log_sync(nvl, tx);
        } else {

I am not saying the patch is correct, but it was helpful.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

@amotin I think the change as you've made makes sense, there doesn't seem to be anything left to figure out, so I'd say the change is yours, would you mind if I mark you as an author? Maybe I'd just change the format to include what's going on, like "Logged after final txg was set: %s %s"

snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 21, 2024
This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 21, 2024
This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

(forgot to change it into an actual assert, that's why the second push)

@amotin
Copy link
Member

amotin commented Nov 21, 2024

There is another entrance to spa_history_log_sync() via spa_history_log_nvl(), where I haven't looked deep how to log the message.

snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 21, 2024
This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 21, 2024
This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

Right, how about I just duplicate the assert in spa_history_log_nvl then?

Didn't know I could expect more logging paths, but they seem to have in common that they call spa_history_log_sync, so grepping for that it seems it's exactly these two paths and that's it, right?

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except I don't like the extra empty line added before if (err) {.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

oh oops I'll remove it

snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request Nov 21, 2024
This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa
Copy link
Contributor Author

snajpa commented Nov 21, 2024

Funny, I can't use the 3UF assertion in a way I could with printf, the varargs are required it seems... can't seem to find any variant that would take str without VA? So I'll do it this dumber way :D

This change was developed under this PR:

openzfs#16795

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa snajpa changed the title Avoid logging after final dirty txg is set Assert if we're logging after final txg was set Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants