Error out if we fail to kill child test processes #80

yarikoptic · 2024-07-08T18:54:42Z

inspired by davfs2 stall and our test processes also stalling and not succumbing to kill -9. We should test (for up to a minute) that the process we kill upon time out dies off. If it doesn't and we did try kill -9 even -- error out entire process to bring attention to the matter.

The text was updated successfully, but these errors were encountered:

jwodder · 2024-07-08T18:58:33Z

@yarikoptic When I ran kill -9 on the test processes last week, it was outside of/independent from the dandisets-healthstatus script. This has nothing to do with the timeout implemented within the script, which simply does:

with anyio.fail_after(TIMEOUT):
    await anyio.run_process( ... )

When it comes to just dandisets-healthstatus, all of the termination of timed-out processes is handled by anyio, and I don't know what signals it sends.

yarikoptic · 2024-07-08T19:17:54Z

so this code might need to be modified here after analysis of what anyio does internally.

jwodder · 2024-07-09T18:26:24Z

@yarikoptic I believe the relevant code in anyio is:

https://github.com/agronholm/anyio/blob/5675f09e7ac9e1b9c5e6d81ab523fd83f6a0b00f/src/anyio/_backends/_asyncio.py#L961-L968

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

Specifically, if a subprocess needs to be cancelled (e.g., due to an enclosing timeout), anyio kills the process and then waits un-cancellablely for it to exit, and that's where our processes were hanging.

I believe that handling things any other way (without reimplementing anyio) would require changes to anyio itself.

yarikoptic · 2024-07-09T21:17:29Z

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

it used to do that I think for me but no longer does

could we subclass Process class there with desired logic and use it instead?

could you then prepare PR to anyio with the needed logic?

jwodder · 2024-07-10T13:11:21Z

@yarikoptic Subclassing Process isn't enough; we'd need to copy & modify anyio.run_process() to make it use the new subclass, and I would rather not do that.

I've filed a feature request with anyio: agronholm/anyio#757

jwodder · 2024-07-11T11:36:57Z

@yarikoptic The maintainer of anyio replied on the issue I linked above; it seems the requested functionality would be undesirable due to leaving behind zombies.

yarikoptic · 2024-07-11T16:40:19Z

I followed up there. Can we have another thread/whatever which would monitor if any of tests stall and do extra killing/erroring out?

jwodder · 2024-07-11T16:45:31Z

@yarikoptic Even if we could come up with a decent way to check for stalling, if we wanted the program to error out on a stall, the exception would still trigger the same process cleanup code I linked to above, and any cleanup currently stalled would just continue to be stalled. I believe the only way out would be for the healthstatus program to send SIGKILL to itself, which doesn't seem like a good idea.

jwodder · 2024-07-22T12:20:21Z

@yarikoptic Ping; do you still want to do this somehow?

jwodder · 2024-07-26T15:42:23Z

@yarikoptic Ping.

yarikoptic · 2024-07-26T19:46:28Z

Well, the main problem ATM is that we simply do not have an idea that the stall has happened. If we detect and exit with some ERROR and cause with that some kind of an email to be sent to notify us, we would be good even if there is still some stuck process -- we would know that there is an issue and would come to mitigate it.

Overall it might also be an issue of establishing some "invocation/progress monitoring" e.g. via https://healthchecks.io/ where we curl the ping point after one run completion and expect that to happen at least daily. WDYT?

jwodder · 2024-07-26T19:58:27Z

@yarikoptic I think just using healthchecks.io or similar for monitoring should be cleaner.

yarikoptic · 2024-07-26T20:11:43Z

Let's close whenever we add some such monitoring (thought to do now but failing to login... will try later)

yarikoptic assigned jwodder Jul 8, 2024

jwodder changed the title ~~error out if we fail to kill started by us child test process~~ Error out if we fail to kill child test processes started by us Jul 9, 2024

jwodder changed the title ~~Error out if we fail to kill child test processes started by us~~ Error out if we fail to kill child test processes Jul 9, 2024

jwodder added the performance More efficient use of time and space label Jul 12, 2024

jwodder assigned jwodder and yarikoptic and unassigned jwodder Jul 29, 2024

yarikoptic mentioned this issue Aug 23, 2024

stuck process in D -- I killed davfs2 #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error out if we fail to kill child test processes #80

Error out if we fail to kill child test processes #80

yarikoptic commented Jul 8, 2024

jwodder commented Jul 8, 2024 •

edited

Loading

yarikoptic commented Jul 8, 2024

jwodder commented Jul 9, 2024

yarikoptic commented Jul 9, 2024

jwodder commented Jul 10, 2024

jwodder commented Jul 11, 2024

yarikoptic commented Jul 11, 2024

jwodder commented Jul 11, 2024

jwodder commented Jul 22, 2024

jwodder commented Jul 26, 2024

yarikoptic commented Jul 26, 2024

jwodder commented Jul 26, 2024

yarikoptic commented Jul 26, 2024

Error out if we fail to kill child test processes #80

Error out if we fail to kill child test processes #80

Comments

yarikoptic commented Jul 8, 2024

jwodder commented Jul 8, 2024 • edited Loading

yarikoptic commented Jul 8, 2024

jwodder commented Jul 9, 2024

yarikoptic commented Jul 9, 2024

jwodder commented Jul 10, 2024

jwodder commented Jul 11, 2024

yarikoptic commented Jul 11, 2024

jwodder commented Jul 11, 2024

jwodder commented Jul 22, 2024

jwodder commented Jul 26, 2024

yarikoptic commented Jul 26, 2024

jwodder commented Jul 26, 2024

yarikoptic commented Jul 26, 2024

jwodder commented Jul 8, 2024 •

edited

Loading