-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safe darktable shutdown #18061
base: master
Are you sure you want to change the base?
Safe darktable shutdown #18061
Conversation
Lua doesn't need the GUI to run, except for the case of
Can you give me an instance or scenario where it fails or causes a crash? Or are you talking about the final lock in
Can I have a definition or example of that so I have some idea what to test? |
OK. So this remarks should be changed. Maybe you make a proposal or i just take your comment ?
Ok, as long as as does not read "fresh" data or uses the lib stuff.
Sure - just look at the commit code :-)
|
What are backjobs? Is that in 5.0? EDIT: So why not move dt_lua_finalize() after all the backjobs? dt_lua_lock() works until dt_lua_finalize() runs. |
Lua has been working just fine until backjobs were added. So since the only thing that changed was backjobs, that's where the problem is, not in Lua. The "problematic" code is that the backjobs are firing events that Lua is supposed to handle. The reason for the crashes is that the backjobs were added after Lua has stopped. So the fix is to put the backjobs before Lua finalizes so that Lua can handle the events. |
df0bad0
to
8e00c68
Compare
Well, the backjobs were not added but continue running ... Just mentioning this, i was also
I agree. I was a bit reluctant to do so as that has been in that position for a "long time" . Had that here in "test commit" and indeed could not find any issue the last days doing that, so squashed and pushing that now. I would appreciate testing on your side as you will know best if anythings behaves strange. |
8e00c68
to
3ef0845
Compare
Fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status |
3ef0845
to
8093805
Compare
8093805
to
5319fa8
Compare
Again fixed, squashed and force-pushed after above discussion, updating the top "presentation" to current status |
Added a 4-th commit with two additional and i think rather important fixes. Affected were |
a2b085a
to
8442b6a
Compare
8442b6a
to
d59c97e
Compare
@TurboGit i did a lot more testing the last week or so, checking what could possibly be done for the 5.0 branch as a bugfix. Remember this all is about safe closing without a crash if there is a background job still running when you do ctrl-q. If we would take only a497c56 the chance for a safe close is already pretty good (all the other commits fix bugs that happen mostly with heavy cache pressure by multiple background jobs or by discarding a job) so i personally would suggest to take that for 5.0 too. |
Potentially fixes #18208 |
Nope. This PR is fixing instabilities and crashes while shutting down in the background. |
How is darktable shutdown working safely? 1. dt_control_quit() is called via user request, it sets control->running anatomically to DT_CONTROL_STATE_CLEANUP implying that dt_control_running() will not be TRUE any more. It finally calls gtk_main_quit() so with current code we don't have gtk events after that. 2. Quitting gtk also means **we are exactly here** to continue the shutdown. Anything requiring a still active UI must be done before ... 3. dt_control_shutdown() first waits for all threads to be joined, that means waiting for all pending jobs to be done completely. 4. So we have to ensure: 1) a full working software stack including image_cache, mipmap_cache and darktable.imageio to allow processing the pixelpipe and full access to all images. 2) we don't access darktable.lib related stuff with dereferencing. 3) The pipeline processing uses access to gui related data - focus, active module ... so make sure to avoid those too. 4) As lua events might be fired by the backthreads it's state mutex must still be unlocked. 5. After dt_control_shutdown() has finished we are sure there are no background threads running any more so we can safely close all mentioned subsystems and continue. 6. In shutdown mode - tested via dt_control_running() is FALSE - we will not add new jobs. Avoids unlikely but possible race conditions for now.
1. A very annoying bug in dt_control_progress_cancel() has been fixed. Calling the progress->cancel callback is somewhat tricky as that might include destroying the progress structure which also includes destroying the process->mutex. _control_progress_cancel_callback() doing in this order --> dt_control_job_cancel() --> _control_job_set_state --> dt_control_progress_destroy So in this case - we used dt_control_progress_attach_job() a) we **must not** call progress->cancel with the mutex locked as that would make the mutex destroy failing leading to assert() failures in debug builds and an undefined behaviour in release builds. b) We unlock a mutex immediately after that but that location is not existing any more. Otherwise, do the cancel callback in mutex locked state as the progress struct will not be destroyed. 2. We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies, in case of putters we avoid the function.
…own part 3) In dt_image_update_final_size() we **must not** use dt_cache_release() if the entry is NULL. In dt_image_get_final_size() we kept a pointer to a dt_image_t struct and later used it after the original dt_image_t has been released. If system is under heavy cache load this could very well fail - and it does.
If we get `dt_image_t *img = dt_image_cache_get()` from the image cache, img can be NULL for various reasons. In all those cases we must not de-reference img in any way and must use a meaningful value - like when getting the imgid - and avoid further processing on that data. We had that being checked in most cases, a few ones have been overseen yet.
d59c97e
to
77c1255
Compare
EDIT again 01/04
This is a comprehensive collection of everything necessary to avoid darktable crashing after closing the app via ctrl-q or click on close-button. There have been lengthy discussions in previous pr's #18023 and #17984 for those of you interested in the progress we made. It has been a very hard and bumpy ride to get this all sorted out, with this pr it is very hard to get dt crashing after closing the ui. Thanks to all of you having tested or reviewed commented on those pr's.
I had still observed rare and spurious problems with the late PRs if there are many backjobs running exporting and flipping the same images for example in rawspeed loader, exif blob reading and mem access in nirvana.
Also there might be issues in dbus as reported by @zisoft but i couldn't trigger those.
The lua terminations problems have been finally understood and fixed.
The good one: The bunch is intended for master and 5.0 branch if reviewed for this.
The bad one for 5.0, to work safely the first 3 commits! are required
Commit 1 Fixes the order in control shutdown
Commit 2+3 fix crashes related to wrong cache usage. All 3 big bugs were not detected and were likely not happening if backthread processing on images had much lower cache pressure.
Proposal 4 Some additional easy fixes related to mipmap caching and bad dt_image_t returned and not fully checking for safe results. Not as important ...
Let's go into details from the commit messages for
easier
understanding why this all necessary.Control quit and shutdown order issue (shutdown part 1)
How is darktable shutdown working safely?
dt_control_quit()
is called via user request, it setscontrol->running
anatomically to DT_CONTROL_STATE_CLEANUP implying thatdt_control_running()
will not be TRUE any more.It finally calls
gtk_main_quit()
so with current code we don't have gtk events after that. Avoid race conditions by using an anatomic flag ensuring the valid code part is executed only once.Quitting gtk also means we are exactly here to continue the shutdown. Anything requiring a still active UI must be done before ...
dt_control_shutdown()
first waits for all threads to be joined, that means waiting for all pending jobs to be done completely.So we have to ensure:
After
dt_control_shutdown()
has finished we are sure there are no background threads running any more so we can safely close all mentioned subsystems and continue.In shutdown mode - tested via
dt_control_running()
is FALSE - we will not add new jobs. Avoids unlikely but possible race conditions for now.Progress destroy and safety (shutdown part 2)
A very annoying bug in
dt_control_progress_cancel()
has been fixed.Calling the
progress->cancel
callback is somewhat tricky as that might include destroying the progress structure which also includes destroying the process->mutex.dt_control_progress_cancel_callback()
is doing that as it deep-down callsdt_control_job_cancel()
which finally wants to destroy the mutex but as it's locked it can't do so.So, if
progress->cancel == &dt_control_progress_cancel_callback
we must not callprogress->cancel
with the mutex locked as that would make the mutex destroy failing leading to assert() failures in debug builds and we can't use that mutex anyway leading to a crash.Otherwise do the cancel callback in mutex locked state.
We strictly avoid progress de-referencing if being NULL for safety, in case of getters we return safe dummies,
in case of putters we avoid the function.
Fix
dt_image_update_final_size()
anddt_image_get_final_size()
(shutdown part 3)In
dt_image_update_final_size()
we must not usedt_cache_release()
if the entry is NULL because of dereferencing.In
dt_image_get_final_size()
we kept a pointer to adt_image_t
struct and later used it after the originaldt_image_t
has been released.If system is under heavy cache load this could very well fail - and it does.
dt_image_t not available fixes (proposal shutdown part 4)
If we get
dt_image_t *img = dt_image_cache_get()
from the image cacheimg
can beNULL
for various reasons.In all those cases we must not de-reference img in any way and must use a meaningful value - like when getting the imgid - and avoid further processing on that data.
We had that being checked in most cases, a few ones have been overseen yet.
Above has been edited to current commit status