-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bomb out with lots of complaints if I/O worker dies #439
Comments
@o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing
|
Decrease the chunk size and set --dist-max-chunks instead of
--dist-min-chunks. It will run in serial but it will reduce the footprint.
…On Sun, Sep 19, 2021 at 9:51 PM Lexy Andati ***@***.***> wrote:
@o-smirnov <https://github.com/o-smirnov> Any fix/workaround for this
yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu,
--dist-min-chunks from 7 to 4 to no avail
INFO 19:42:07 - main [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
Traceback (most recent call last):
File "/home/CubiCal/cubical/main.py", line 582, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop
return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop
stats = future.result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#439 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6SDTUK7DLPJOQZH72DUCY5LLANCNFSM4YS4AFXA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
--
Benjamin Hugo
PhD. student,
Centre for Radio Astronomy Techniques and Technologies
Department of Physics and Electronics
Rhodes University
Junior software developer
Radio Astronomy Research Group
South African Radio Astronomy Observatory
Black River Business Park
Observatory
Cape Town
|
I'm running into this BrokenProcessPool error with an oom-kill notice at the end of the log file - taking that to mean the system thinks I'll run out RAM at some point, so kills it. What I can't understand is that earlier in the log when it's calculating all the memory requirements, it says my max memory requirement will be ~57 GB - the system I'm running on has max 62 GB available, so I don't know why things are being killed. I'm using --data-freq-chunk=256 (reduced down from 1024), --data-time-chunk=36, --dist-max-chunks=2, and ncpus=20 (the max available on the node). What other memory-related knobs can I twiddle to try solve this? It's only 2 hours of data, but running into the same issue with even smaller MSs as well. |
The memory estimation is just that - a guess based on some empirical experiments I did. So take it with a pinch of salt. If it is an option, I would really suggest taking a look at QuartiCal. It is much less memory hungry, and has fewer knobs to boot. I am only too happy to help you out on that front. That said, could you please post your log and config. That will help identify what is going wrong. |
@JSKenyon running it as part of oxkat - guess we can have a chat about incorporating QuartiCal on an ad hoc basis. I'll take a look at it. But for now, here's the log and the parset and the code run was |
OK, in this instance I suspect it is just the fact that the memory footprint is underestimated. I think that the easiest solution in this instance is to set |
Ok thanks, I'll give that a go. |
The oxkat defaults are tuned so they work on standard worker nodes at IDIA and CHPC for standard MeerKAT continuum processing (assuming 1024 channel data). The settings should actually leave a fair bit of overhead to account for things like differing numbers of antennas, and the slurm / PBS controllers being quite trigger happy when jobs step out of line in terms of memory usage. But if you have a node with 64 GB of RAM then the defaults will certainly be too ambitious. Is this running on hippo? Also I'm not sure whether moving from a a single solution for the entire band ( Cheers. PS: @JSKenyon swapping to QuartiCal remains on my to-do list! |
If the I/O worker dies, this is a little hard for the end user to diagnose, as the solver workers carry on and fill up the log with messages. The error message is then buried somewhere mid-log and the whole process hangs waiting on I/O, instead of exiting with an error.
Surely a subprocess error is catchable at the main process level. #319 is related.
The text was updated successfully, but these errors were encountered: