Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom parallel job queue #21

Open
andnp opened this issue Dec 1, 2023 · 0 comments
Open

Custom parallel job queue #21

andnp opened this issue Dec 1, 2023 · 0 comments

Comments

@andnp
Copy link
Owner

andnp commented Dec 1, 2023

Gnu-parallel is no longer cutting it. A few issues:

  • Need to pass custom signals on to children. These signals are always getting eaten by parallel.
  • Need finer-grained control over ssh processes. Would love to avoid srun where possible due to extreme cost in interacting with the scheduler. Should be able to replace srun with custom ssh + environment build scripts.
  • Challenging to have homogeneity across different compute backends. Likely losing access to compute canada soon, but don't want these scripts to go to waste! Need some homogenous way to make use of beowulf cluster.

Notes:

  • Need a way to get num_cpus and hostnames from slurm when allocated across nodes.
  • Need a way to use MIG when available.
  • Need a way to batch jobs within a process (e.g. so one sub-process can handle a batch)
  • Nice-to-have: mark a parameter as "batchable". For instance, we can jax.vmap over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant