Skip to content

Recovering from OOMkill pod failures/evictions  #10

@1beb

Description

@1beb

One critical piece that I think makes this challenging to use at a larger scale is that R is a garbage collected language.

There are a number of odd situations, especially when reading or writing files that will continue to "grow" memory that ought to be garbage collected but never does. We were discussing this a little bit in the future repository. Henrik suggested using the callr plan which works extremely well when you're working on a single computer, but is incompatible with the setup command that is specified in the future-kubernetes helm chart.

I've been thinking about a number of alternative approaches:

  • Find a way to restart the R process when it finishes on a pod, before the next iteration.
  • Instead of setting up the cluster via helm chart, use ssh based cluster by distributing tasks over ssh from within your primary parallel loop.

Do you have any thoughts on how one might approach this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions