Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error out if we fail to kill child test processes #80

Open
yarikoptic opened this issue Jul 8, 2024 · 13 comments
Open

Error out if we fail to kill child test processes #80

yarikoptic opened this issue Jul 8, 2024 · 13 comments
Assignees
Labels
performance More efficient use of time and space

Comments

@yarikoptic
Copy link
Member

inspired by davfs2 stall and our test processes also stalling and not succumbing to kill -9. We should test (for up to a minute) that the process we kill upon time out dies off. If it doesn't and we did try kill -9 even -- error out entire process to bring attention to the matter.

@jwodder
Copy link
Member

jwodder commented Jul 8, 2024

@yarikoptic When I ran kill -9 on the test processes last week, it was outside of/independent from the dandisets-healthstatus script. This has nothing to do with the timeout implemented within the script, which simply does:

with anyio.fail_after(TIMEOUT):
    await anyio.run_process( ... )

When it comes to just dandisets-healthstatus, all of the termination of timed-out processes is handled by anyio, and I don't know what signals it sends.

@yarikoptic
Copy link
Member Author

so this code might need to be modified here after analysis of what anyio does internally.

@jwodder jwodder changed the title error out if we fail to kill started by us child test process Error out if we fail to kill child test processes started by us Jul 9, 2024
@jwodder jwodder changed the title Error out if we fail to kill child test processes started by us Error out if we fail to kill child test processes Jul 9, 2024
@jwodder
Copy link
Member

jwodder commented Jul 9, 2024

@yarikoptic I believe the relevant code in anyio is:

https://github.com/agronholm/anyio/blob/5675f09e7ac9e1b9c5e6d81ab523fd83f6a0b00f/src/anyio/_backends/_asyncio.py#L961-L968

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

Specifically, if a subprocess needs to be cancelled (e.g., due to an enclosing timeout), anyio kills the process and then waits un-cancellablely for it to exit, and that's where our processes were hanging.

I believe that handling things any other way (without reimplementing anyio) would require changes to anyio itself.

@yarikoptic
Copy link
Member Author

(Shouldn't GitHub show the code in the comment here? It's not doing it for me.)

it used to do that I think for me but no longer does

could we subclass Process class there with desired logic and use it instead?

could you then prepare PR to anyio with the needed logic?

@jwodder
Copy link
Member

jwodder commented Jul 10, 2024

@yarikoptic Subclassing Process isn't enough; we'd need to copy & modify anyio.run_process() to make it use the new subclass, and I would rather not do that.

I've filed a feature request with anyio: agronholm/anyio#757

@jwodder
Copy link
Member

jwodder commented Jul 11, 2024

@yarikoptic The maintainer of anyio replied on the issue I linked above; it seems the requested functionality would be undesirable due to leaving behind zombies.

@yarikoptic
Copy link
Member Author

I followed up there. Can we have another thread/whatever which would monitor if any of tests stall and do extra killing/erroring out?

@jwodder
Copy link
Member

jwodder commented Jul 11, 2024

@yarikoptic Even if we could come up with a decent way to check for stalling, if we wanted the program to error out on a stall, the exception would still trigger the same process cleanup code I linked to above, and any cleanup currently stalled would just continue to be stalled. I believe the only way out would be for the healthstatus program to send SIGKILL to itself, which doesn't seem like a good idea.

@jwodder jwodder added the performance More efficient use of time and space label Jul 12, 2024
@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@yarikoptic Ping; do you still want to do this somehow?

@jwodder
Copy link
Member

jwodder commented Jul 26, 2024

@yarikoptic Ping.

@yarikoptic
Copy link
Member Author

Well, the main problem ATM is that we simply do not have an idea that the stall has happened. If we detect and exit with some ERROR and cause with that some kind of an email to be sent to notify us, we would be good even if there is still some stuck process -- we would know that there is an issue and would come to mitigate it.

Overall it might also be an issue of establishing some "invocation/progress monitoring" e.g. via https://healthchecks.io/ where we curl the ping point after one run completion and expect that to happen at least daily. WDYT?

@jwodder
Copy link
Member

jwodder commented Jul 26, 2024

@yarikoptic I think just using healthchecks.io or similar for monitoring should be cleaner.

@yarikoptic
Copy link
Member Author

Let's close whenever we add some such monitoring (thought to do now but failing to login... will try later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance More efficient use of time and space
Projects
None yet
Development

No branches or pull requests

2 participants