-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BiocParallel does not keep logs during job failure #103
Comments
Hi @andr-kun it is possible to
Note: Since this process of saving the entire registry can be expensive, please submit a smaller job to debug if you have cluster limitations. Otherwise, you can inspect the logs of your entire job with this options. |
Thanks @nturaga for the information regarding The problem with my situation is that some nodes in the cluster can just fail without any warning - so most of the jobs will actually work and suddenly a few jobs will start failing due to being assigned to failed nodes, which stops the entire From the testing I have done with * There is a possibility of getting the logs from the cluster management software in LSF, but this is only kept for a short period of time. |
Hello,
I am currently running a parallel jobs (with bplapply) on an LSF cluster using BatchToolsParam and I found an issue where there are no logs produced in runs that have a few jobs failing.
Here is the error log from the failed run:
When I tried to check the logs from the batchtools jobs, I noticed that there was no logs being produced at all which made figuring out the reason for the job failures difficult. I eventually managed to capture the logs manually by copying the temporary registry directory before the bplapply job finishes, where I found that the cause of the job failure is due to a missing executable in a few of the cluster nodes, resulting in the job exiting before R was even executed.
It would be really useful to actually be able to get the logs from the batchtools jobs even if some of the jobs failed to execute R, especially in LSF cluster as the logs contain the job execution information.
The text was updated successfully, but these errors were encountered: