Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle worker abort better #249

Open
mtmorgan opened this issue Apr 12, 2023 · 1 comment
Open

Handle worker abort better #249

mtmorgan opened this issue Apr 12, 2023 · 1 comment

Comments

@mtmorgan
Copy link
Collaborator

When a worker aborts (e.g., out of memory?) the result is an error in BiocParallel code, rather than an error understandable by the user.

> bplapply(1:2, \(...) q())
Error in reducer$value.cache[[as.character(idx)]] <- values :
  wrong args for environment subassignment
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
@HenrikBengtsson
Copy link
Contributor

To add some ideas, I run a post-mortem analysis when this happens and include the findings in the error message, e.g.

> library(future)
> plan(multicore)
> f <- future({ tools::pskill(Sys.getpid()) })
> value(f)
Error: Failed to retrieve the result of MulticoreFuture (<none>) from the forked
worker (on localhost; PID 118742). Post-mortem diagnostic: No process exists
with this PID, i.e. the forked localhost worker is no longer alive
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result

and

> library(future)
> plan(multisession)
> f <- future(tools::pskill(Sys.getpid()))
TRACKER:  loadedNamespaces() changed:  1 package loaded ('crayon')
> value(f)
Error in unserialize(node$con) : 
  MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 119302 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic: No
process exists with this PID, i.e. the localhost worker is no longer alive

In some cases, we can give more clues. For example, when a non-exportable object may be in play, e.g.

> library(future)
> plan(multisession)
> library(XML)
> doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
> a <- getNodeSet(doc, "/doc//a[@status]")[[1]]
> f <- future(xmlGetAttr(a, "status"))
> value(f)
Error in unserialize(node$con) :
  MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 31541 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic:
No process exists with this PID, i.e. the localhost worker is no
longer alive. Detected a non-exportable reference ('externalptr' of
class 'XMLInternalElementNode') in one of the globals ('a' of class
'XMLInternalElementNode') used in the future expression. The total
size of the 1 globals exported is 520 bytes. There is one global: 'a'
(520 bytes of class 'externalptr')

That exported non-exportable XML object causes XML to segfault the parallel worker, cf. https://future.futureverse.org/articles/future-4-non-exportable-objects.html#package-xml.

I found that these type of error messages helps the user to help themselves, but it also saves me a lot of time when someone reaches out for help.

@mtmorgan mtmorgan assigned mtmorgan and unassigned mtmorgan Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants