Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayland event polling crash when VSync is disabled, with message "explicit sync is used, but no release point is set" #93669

Closed
akien-mga opened this issue Jun 27, 2024 · 3 comments · Fixed by #93684

Comments

@akien-mga
Copy link
Member

Tested versions

  • Reproducible in 4.3.beta (cae2f85)
  • Introduced in 4.3.beta1, not reproducible in 4.3.dev6 and earlier

System information

Fedora Linux 40 (KDE Plasma) - Wayland - Vulkan (Forward+) - dedicated AMD Radeon RX 7600M XT (RADV NAVI33) - AMD Ryzen 7 7840HS w/ Radeon 780M Graphics (16 Threads)

Issue description

A project using Wayland and with VSync disabled seems to segfault shortly after starting.

Here is the config in project.godot:

[display]

display_server/driver.linuxbsd="wayland"
window/vsync/vsync_mode=0

And the stacktrace:

wp_linux_drm_syncobj_surface_v1@61: error 3: explicit sync is used, but no buffer is attached
ERROR: Wayland protocol error 3 on interface wp_linux_drm_syncobj_surface_v1@61.
   at: _poll_events_thread (platform/linuxbsd/wayland/wayland_thread.cpp:2740)

Thread 21 "godot-git" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7fffdd8006c0 (LWP 142689)]
0x0000000005ba46a4 in WaylandThread::_poll_events_thread (p_data=0xbba9700) at platform/linuxbsd/wayland/wayland_thread.cpp:2740
2740                                    CRASH_NOW_MSG(vformat("Wayland protocol error %d on interface %s@%d.", error_code, wl_interface ? wl_interface->name : "unknown", id));

(gdb) bt
#0  0x0000000005ba46a4 in WaylandThread::_poll_events_thread (p_data=0xbba9700) at platform/linuxbsd/wayland/wayland_thread.cpp:2740
#1  0x000000000a5fa180 in Thread::callback (p_caller_id=21, p_settings=..., p_callback=0x5ba4490 <WaylandThread::_poll_events_thread(void*)>, p_userdata=0xbba9700) at ./core/os/thread.cpp:64
#2  0x000000000a60fd43 in std::__invoke_impl<void, void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*>
    (__f=@0xbbb5de8: 0xa5fa0f2 <Thread::callback(unsigned long, Thread::Settings const&, void (*)(void*), void*)>) at /usr/include/c++/14/bits/invoke.h:61
#3  0x000000000a60fc0c in std::__invoke<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*>
    (__fn=@0xbbb5de8: 0xa5fa0f2 <Thread::callback(unsigned long, Thread::Settings const&, void (*)(void*), void*)>) at /usr/include/c++/14/bits/invoke.h:96
#4  0x000000000a60fac7 in std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul> (this=0xbbb5dc8) at /usr/include/c++/14/bits/std_thread.h:301
#5  0x000000000a60e5fa in std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> >::operator()
    (this=0xbbb5dc8) at /usr/include/c++/14/bits/std_thread.h:308
#6  0x000000000a60d572 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(unsigned long, Thread::Settings const&, void (*)(void*), void*), unsigned long, Thread::Settings, void (*)(void*), void*> > >::_M_run (this=0xbbb5dc0) at /usr/include/c++/14/bits/std_thread.h:253
#7  0x000000000b080fc4 in execute_native_thread_routine ()
#8  0x00007ffff7d621b7 in start_thread () at /lib64/libc.so.6
#9  0x00007ffff7de43cc in clone3 () at /lib64/libc.so.6

Steps to reproduce

  • On Linux, enable prefer_wayland and disable Vsync
  • Run an empty scene

Minimal reproduction project (MRP)

test12.zip

@Riteo
Copy link
Contributor

Riteo commented Jun 27, 2024

All right, so, after a short panic, I think I got it.

Funnily enough, another user reported this to me in a more exotic usecase and I, out of experience, shrugged it off as a driver bug (oh was I naive). Note that they used vsync, so this is certainly not a vsync issue.

It looks like putting wl_surface_commit inside some asynchronous methods (namely WaylandThread::_frame_callback_on_done) made a threading issue evident, after the huge "explicit sync revolution" (apparently also firefox has this issue1).

If, due to multithreading, the command gets queued while an explicit sync is getting set up and before a buffer is set, the no_buffer error rightfully shows up and it explodes.

This apparently got also noticed while developing the new protocol2 and it has been concluded as misuse of the protocol by clients like us, so we ought to fix that by limiting commits to at least the same thread (ideally once by the renderer). I already tried reducing commits before, but wasn't able to do so because I forgot about low-consumption mode (my archnemesis).

I think I got a quick patch that should remove even more unneeded commits and sync the only one needed for frame timeout to the main thread . It's not pretty, but if it works we can look to perhaps only spam-commit if in low-usage mode or only when checking, not sure. (this should get only called each frame anyways so whatever).

What?

If all of that flew over the reader's head, don't worry, I'm still a bit panicking :gdsweat: . Please try this patch:

UPDATE: There's a PR now which is a bit more complex: #93684

diff --git a/platform/linuxbsd/wayland/display_server_wayland.cpp b/platform/linuxbsd/wayland/display_server_wayland.cpp
index 3fad8c2987..363c14bbcd 100644
--- a/platform/linuxbsd/wayland/display_server_wayland.cpp
+++ b/platform/linuxbsd/wayland/display_server_wayland.cpp
@@ -1192,6 +1192,7 @@ void DisplayServerWayland::process_events() {
 
 	wayland_thread.keyboard_echo_keys();
 
+	wl_surface_commit(wayland_thread.window_get_wl_surface(MAIN_WINDOW_ID));
 	if (!suspended) {
 		if (emulate_vsync) {
 			// Due to various reasons, we manually handle display synchronization by
diff --git a/platform/linuxbsd/wayland/wayland_thread.cpp b/platform/linuxbsd/wayland/wayland_thread.cpp
index 63a8db07df..02b204f6be 100644
--- a/platform/linuxbsd/wayland/wayland_thread.cpp
+++ b/platform/linuxbsd/wayland/wayland_thread.cpp
@@ -968,7 +968,6 @@ void WaylandThread::_frame_wl_callback_on_done(void *data, struct wl_callback *w
 
 	ws->frame_callback = wl_surface_frame(ws->wl_surface),
 	wl_callback_add_listener(ws->frame_callback, &frame_wl_callback_listener, ws);
-	wl_surface_commit(ws->wl_surface);
 
 	if (ws->wl_surface && ws->buffer_scale_changed) {
 		// NOTE: We're only now setting the buffer scale as the idea is to get this
@@ -3241,10 +3240,6 @@ void WaylandThread::window_create(DisplayServer::WindowID p_window_id, int p_wid
 	ws.frame_callback = wl_surface_frame(ws.wl_surface);
 	wl_callback_add_listener(ws.frame_callback, &frame_wl_callback_listener, &ws);
 
-	// NOTE: This commit is only called once to start the whole frame callback
-	// "loop".
-	wl_surface_commit(ws.wl_surface);
-
 	if (registry.xdg_exporter) {
 		ws.xdg_exported = zxdg_exporter_v1_export(registry.xdg_exporter, ws.wl_surface);
 		zxdg_exported_v1_add_listener(ws.xdg_exported, &xdg_exported_listener, &ws);

Note: I've not tested the above as I don't have an explicit sync setup (yet), I'm working on getting one ASAP.

See also: NVIDIA/egl-wayland#110

Edit: made wording a little bit clearer, it sounded like the protocol was wrong or something. We're in the wrong and have always been, it popped up just now.

Footnotes

  1. https://bugzilla.mozilla.org/show_bug.cgi?id=1898476

  2. https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/90#note_2243522

@Riteo
Copy link
Contributor

Riteo commented Jun 27, 2024

Ok, update, I managed to set up an environment with explicit-sync and I can indeed replicate this bug.

It seems indeed that the aforementioned patch fixes the issue! I'd be tempted to make a PR as-is, but I'll look at it harder to make sure whether committing each process frame is a problem (I don't think) and if so whether I can avoid doing that (detecting low-usage mode?)

@Riteo
Copy link
Contributor

Riteo commented Jun 27, 2024

All right, I think that, finally, the PR is ready! Note that it was a bit more complicated than I expected due to low processor usage mode but according to my tests, it fixes the problem!

Here it is: #93684.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Release Blocker
Development

Successfully merging a pull request may close this issue.

2 participants