-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear out /dev/shm when things go wrong #378
Comments
@o-smirnov we should do this for the cubical and ddfacet cabs! |
Great idea. But Rather do this for DDF:
And this for CubiCal:
|
Cool, I'll add these commands after cubical and ddfcacet runs. |
Whoops! I'll steal your method instead. |
Hehe, have you been mistakenly nuking peoples DDF jobs @IanHeywood ? |
Well I book entire nodes for DDFacet, so if I have it's what they get for sneaking around trying to jump the queue. |
Sadly, filesystem permissions won't let you do that to others. But it's a great way to shoot yourself in the foot! |
yup.... lets please have a atexit to at minimum handle graceful SIGINT... - seeing a lot of stuff in shared mem. The gridder has a atexit handler already, must be something in the visibility machine... |
My DDFacet job just failed on an IDIA node with /dev/shm issues. Nothing surprising whatsoever about that, but after the job had failed I noticed:
which turns out to be from a (probably failed) CubiCal run on May 11:
That's hogging a fair amount of real estate there. Some suggestions for good citizenship, in order of decreasing hassle for users (increasing hassle for devs):
Users remember log in and clean up their mess after a crash.
Every time I invoke DDFacet in a script I've been running CleanSHM.py immediately afterwards. Pipeline people could consider implementing something similar. I guess this script could be modified for the CubiCal output in /dev/shm.
Develop some kind of Lazarus the Janitor feature for CubiCal where it comes back to life just long enough to tidy up after its been killed.
I think this is important, especially when testing out pipelines on new systems. Build up of this junk might make things fail when they otherwise wouldn't.
Cheers.
The text was updated successfully, but these errors were encountered: