-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle worker abort better #249
Comments
To add some ideas, I run a post-mortem analysis when this happens and include the findings in the error message, e.g. > library(future)
> plan(multicore)
> f <- future({ tools::pskill(Sys.getpid()) })
> value(f)
Error: Failed to retrieve the result of MulticoreFuture (<none>) from the forked
worker (on localhost; PID 118742). Post-mortem diagnostic: No process exists
with this PID, i.e. the forked localhost worker is no longer alive
In addition: Warning message:
In mccollect(jobs = jobs, wait = TRUE) :
1 parallel job did not deliver a result and > library(future)
> plan(multisession)
> f <- future(tools::pskill(Sys.getpid()))
TRACKER: loadedNamespaces() changed: 1 package loaded ('crayon')
> value(f)
Error in unserialize(node$con) :
MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 119302 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic: No
process exists with this PID, i.e. the localhost worker is no longer alive In some cases, we can give more clues. For example, when a non-exportable object may be in play, e.g. > library(future)
> plan(multisession)
> library(XML)
> doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
> a <- getNodeSet(doc, "/doc//a[@status]")[[1]]
> f <- future(xmlGetAttr(a, "status"))
> value(f)
Error in unserialize(node$con) :
MultisessionFuture (<none>) failed to receive results from cluster
RichSOCKnode #1 (PID 31541 on localhost 'localhost'). The reason
reported was 'error reading from connection'. Post-mortem diagnostic:
No process exists with this PID, i.e. the localhost worker is no
longer alive. Detected a non-exportable reference ('externalptr' of
class 'XMLInternalElementNode') in one of the globals ('a' of class
'XMLInternalElementNode') used in the future expression. The total
size of the 1 globals exported is 520 bytes. There is one global: 'a'
(520 bytes of class 'externalptr') That exported non-exportable XML object causes XML to segfault the parallel worker, cf. https://future.futureverse.org/articles/future-4-non-exportable-objects.html#package-xml. I found that these type of error messages helps the user to help themselves, but it also saves me a lot of time when someone reaches out for help. |
When a worker aborts (e.g., out of memory?) the result is an error in BiocParallel code, rather than an error understandable by the user.
The text was updated successfully, but these errors were encountered: