Clear out /dev/shm when things go wrong #378

IanHeywood · 2020-05-15T22:09:51Z

My DDFacet job just failed on an IDIA node with /dev/shm issues. Nothing surprising whatsoever about that, but after the job had failed I noticed:

ianh@slwrk-301:~$ du -hs /dev/shm/
71G	/dev/shm/

which turns out to be from a (probably failed) CubiCal run on May 11:

ianh@slwrk-301:~$ ls -l /dev/shm/
total 0
drwxr-xr-x 5 [redacted] idia-group 100 May 11 16:58 cubical.32097

That's hogging a fair amount of real estate there. Some suggestions for good citizenship, in order of decreasing hassle for users (increasing hassle for devs):

Users remember log in and clean up their mess after a crash.
Every time I invoke DDFacet in a script I've been running CleanSHM.py immediately afterwards. Pipeline people could consider implementing something similar. I guess this script could be modified for the CubiCal output in /dev/shm.
Develop some kind of Lazarus the Janitor feature for CubiCal where it comes back to life just long enough to tidy up after its been killed.

I think this is important, especially when testing out pipelines on new systems. Build up of this junk might make things fail when they otherwise wouldn't.

Cheers.

The text was updated successfully, but these errors were encountered:

SpheMakh · 2020-05-16T11:18:02Z

@o-smirnov we should do this for the cubical and ddfacet cabs!

o-smirnov · 2020-05-16T11:40:18Z

Great idea. But CleanSHM.py per se is too brute force, because I think it will nuke your other DDFs you may have running on that box. Which is ok when you do it manually, but not suitable for automatic use in a pipeline.

Rather do this for DDF:

python -c "from DDFacet.Other import Multiprocessing; Multiprocessing.cleanupStaleShm()"

And this for CubiCal:

python -c "from cubical.tools import shm_utils; shm_utils.cleanupStaleShm()"

SpheMakh · 2020-05-16T11:42:17Z

Cool, I'll add these commands after cubical and ddfcacet runs.

IanHeywood · 2020-05-16T12:18:20Z

Great idea. But CleanSHM.py per se is too brute force, because I think it will nuke your other DDFs you may have running on that box. Which is ok when you do it manually, but not suitable for automatic use in a pipeline.

Whoops! I'll steal your method instead.

SpheMakh · 2020-05-16T13:11:53Z

Hehe, have you been mistakenly nuking peoples DDF jobs @IanHeywood ?

IanHeywood · 2020-05-16T13:49:48Z

Well I book entire nodes for DDFacet, so if I have it's what they get for sneaking around trying to jump the queue.

o-smirnov · 2020-05-16T17:30:59Z

Sadly, filesystem permissions won't let you do that to others. But it's a great way to shoot yourself in the foot!

bennahugo · 2021-08-19T12:19:17Z

yup.... lets please have a atexit to at minimum handle graceful SIGINT... - seeing a lot of stuff in shared mem. The gridder has a atexit handler already, must be something in the visibility machine...

IanHeywood added the enhancement label May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear out /dev/shm when things go wrong #378

Clear out /dev/shm when things go wrong #378

IanHeywood commented May 15, 2020

SpheMakh commented May 16, 2020

o-smirnov commented May 16, 2020

SpheMakh commented May 16, 2020

IanHeywood commented May 16, 2020

SpheMakh commented May 16, 2020

IanHeywood commented May 16, 2020

o-smirnov commented May 16, 2020

bennahugo commented Aug 19, 2021 •

edited

Loading

Clear out /dev/shm when things go wrong #378

Clear out /dev/shm when things go wrong #378

Comments

IanHeywood commented May 15, 2020

SpheMakh commented May 16, 2020

o-smirnov commented May 16, 2020

SpheMakh commented May 16, 2020

IanHeywood commented May 16, 2020

SpheMakh commented May 16, 2020

IanHeywood commented May 16, 2020

o-smirnov commented May 16, 2020

bennahugo commented Aug 19, 2021 • edited Loading

bennahugo commented Aug 19, 2021 •

edited

Loading