Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BiocParallel does not keep logs during job failure #103

Open
andr-kun opened this issue Sep 4, 2019 · 2 comments
Open

BiocParallel does not keep logs during job failure #103

andr-kun opened this issue Sep 4, 2019 · 2 comments
Assignees

Comments

@andr-kun
Copy link

andr-kun commented Sep 4, 2019

Hello,

I am currently running a parallel jobs (with bplapply) on an LSF cluster using BatchToolsParam and I found an issue where there are no logs produced in runs that have a few jobs failing.

Here is the error log from the failed run:

Quitting from lines 86-88 (demultiplex.Rmd) 
Error in .reduceResultsList(ids, fun, ..., missing.val = missing.val,  : 
  All jobs must be have been successfully computed
Calls: <Anonymous> ... bplapply -> bplapply -> <Anonymous> -> .reduceResultsList
Execution halted

When I tried to check the logs from the batchtools jobs, I noticed that there was no logs being produced at all which made figuring out the reason for the job failures difficult. I eventually managed to capture the logs manually by copying the temporary registry directory before the bplapply job finishes, where I found that the cause of the job failure is due to a missing executable in a few of the cluster nodes, resulting in the job exiting before R was even executed.

It would be really useful to actually be able to get the logs from the batchtools jobs even if some of the jobs failed to execute R, especially in LSF cluster as the logs contain the job execution information.

@nturaga
Copy link
Contributor

nturaga commented Sep 6, 2019

Hi @andr-kun

it is possible to saveregistry=TRUE to avoid deleting your logs.

> BiocParallel::BatchtoolsParam
function (workers = batchtoolsWorkers(cluster), cluster = batchtoolsCluster(),
    registryargs = batchtoolsRegistryargs(), saveregistry = FALSE,
    resources = list(), template = batchtoolsTemplate(cluster),
    stop.on.error = TRUE, progressbar = FALSE, RNGseed = NA_integer_,
    timeout = 30L * 24L * 60L * 60L, exportglobals = TRUE, log = FALSE,
    logdir = NA_character_, resultdir = NA_character_, jobname = "BPJOB")


saveregistry: 'logical(1)'
     Option given to store the entire registry for the job(s). This
     functionality should only be used when debugging. The storage of
     the entire registry can be time and space expensive on disk. The
     registry will be saved in the directory specified by 'file.dir' in
     'registryargs'; the default locatoin is the current working
     directory. The saved registry directories will have suffix "-1",
     "-2" and so on, for each time the 'BatchtoolsParam' is used.

Note: Since this process of saving the entire registry can be expensive, please submit a smaller job to debug if you have cluster limitations. Otherwise, you can inspect the logs of your entire job with this options.

@nturaga nturaga self-assigned this Sep 6, 2019
@andr-kun
Copy link
Author

andr-kun commented Sep 7, 2019

Thanks @nturaga for the information regarding saveregistry=TRUE. This would definitely be useful for debugging smaller job as you mentioned.

The problem with my situation is that some nodes in the cluster can just fail without any warning - so most of the jobs will actually work and suddenly a few jobs will start failing due to being assigned to failed nodes, which stops the entire bplapply run. Given that the run can take hours to finish, I am now looking into bptry and BPREDO in order to try and recover from the failed jobs.

From the testing I have done with bptry and BPREDO, I noticed that there is still no logs produced by BiocParallel in cases where the job failed to even start (with the same error of All jobs must be have been successfully computed returned by bptry(bplapply(...))). It would be really helpful if BiocParallel can actually recover the logs for these cases as it can be used for reporting issues to the cluster administrator and for blacklisting the nodes for future job runs. This is especially needed for LSF clusters as the logs in LSF clusters are only available from the log files itself*, rather than from the cluster management software like in SLURM.

* There is a possibility of getting the logs from the cluster management software in LSF, but this is only kept for a short period of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants