-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMCEE Batch Submit CLI Tool #394
base: dev
Are you sure you want to change the base?
Conversation
Functionality to render jinja2 templates to make creating batch files easier. Also opens up the door to future use of templates like creating config files. Associated docs+tests included.
Added `gempyor.logging` module with `_get_logging_level`, `get_script_logger`, and `ClickHandler` along with corresponding documentation and unit tests.
Added private helper `gempyor.utils._shutil_which` which has a `check` argument that behaves very similar in spirit to the same argument of `subprocess.run`. Corresponding docs/tests included.
Pytest hooks into the root logger, which breaks the use case here where we only want a single handler to emit a log record. Create a special carve out for pytest and put it in `get_script_logger` to make testing logging results of any function easy.
Internal utility to submit a batch file to slurm via `sbatch` along with detailed logging and corresponding unit tests and docs.
Added an internal utility function to get the sha commit corresponding to the head of a given git repository. For use in creating manifest files.
Added `gempyor.manifest.write_manifest` for writing batch manifest JSON files containing metadata about the run. Also has extended functionality to store more metadata then the current required.
The `manifest` module will be too small to have as its own, moved into `batch` and will start consolidating other batch functionality into as well.
Moved the `_slurm` module into the new `batch` module to similar reasons as `manifest`. There is not enough slurm functionality to justify an individual module and slurm interactions will only occur in the context of working with batch jobs.
Added a dataclass to represent batch job sizes, including light validation.
Added a class method to `gempyor.batch.JobSize` class that is slightly more convenient to use from a CLI script. Provides a good home for future code realted to GH-360. Corresponding tests and docs.
In anticipation of GH-358, minimize future work.
Added internal utility to convert a dict of options into a list appropriate for `subprocess.run` and similar.
Added `gempyor.batch._sbatch_template` that is a thin wrapper around `_sbatch` that takes advantage of templating. There are docs, but no unit tests. Still determining how this interface should be best setup.
Draft solution to GH-342 with the ability to extend to other types of generic config/info in the future. Add `gempyor.info` module with exported function `get_cluster_info` which pulls and parses an information yaml into a pydantic model.
Very draft, needs to be better connected to a CLI to fully implement.
Added specification for path modifications to the cluster info (namely to add AWS cli on rockfish to path). Specification includes the path to add, prepend vs append, and a flag to error if path is missing.
Not fleshed out at all, but at least contains the main sections that will need to be filled in. Now includes dynamic modules and $PATH modifications that mirror the objects returned from `get_cluster_info`.
Just rebased `dev` into `GH-365/emcee-submit` which now contains black formatting for python.
Added a helper to log the inputs given to a CLI tool when the verbosity is set to debug. Mostly for user/dev debugging. Also added corresponding docs and unit tests.
Internal utility to generate a human readable unique batch job name along with corresponding docs and tests.
Added internal `gempyor.batch._resolve_batch_system` utility to make sense of batch system CLI options.
Very preliminary work on the `flepimop batch` CLI tool including some light testing for the not implemented errors thrown.
* Changed to generic import of `click` since the `gempyor.batch` module relies on it heavily. * Add manifest json to `flepimop batch`. * Add subpops to the job size calculation of `flepimop batch`. * Leftover TODO note about resume/continuation locations.
Added a representation of batch job time limits, similar to `JobSize`, along with corresponding documentation and unit tests.
Added the `JobTimeLimit.from_per_simulation_time` class method to easily create instances from user provided inputs. Also added comparison methods for ease of unit testing.
Started work constructing a call to `_sbatch_template` in the `flepimop batch` CLI. Incorporated `JobTimeLimit` class and added additional comments into the `_click_batch` function.
Added the `DurationParamType` for specifying custom durations in a user friendly way for CLI interfaces along with corresponding unit tests. Also incorporated the new custom param type into the `flepimop batch` CLI.
* Changed default from `timedelta` to `str`. Looks like the `default` value is passed to the `convert` method of `type` if instance of `click.ParamType`. * Began work incorporating cluster info into batch submission with slurm.
* Final config file is now parsed out to a file, either user specified or temp. * Further work on the batch script, mostly in translating variables given to `_click_batch` to the template.
Bug in `MemoryParamType.convert` was failing if a unit was not given with a value. Similar to `DurationParamType` it assumes that numbers without a unit are given in the default unit of the type.
Moved from the slurm specific code path in `_submit_scenario_job`, should allow for using the `--local` flag on the cluster.
d36aca0
to
4f936db
Compare
Made the `ValueError` from `_infer_cluster_from_fqdn` optional, but default to `True` for current behavior.
* Export `$FLEPI_PATH` in the GH action before running tests to run the unit tests for `get_cluster_info`. * Attempt to fix Jinja2 lader issue by falling back to a file system loader if a package loader fails.
02d550e
to
b76bb3b
Compare
To deal with a `DeprecationWarning` that python keeps producing.
445624f
to
6f51a3d
Compare
f74cdd5
to
0149bc7
Compare
While this PR is ready for review I think it's unlikely there will be sufficient time to get this into the upcoming release. If folks have the bandwidth to review that's great, but definitely prioritize the other PRs noted in slack first. |
Thanks Tim - I intended to test this functionality properly when I'm done with RSV too and that might take a while to cover all bases. Will prioritise the others first. |
I'm converting this back to a draft as I extract some of the functionality here into smaller PRs that are easier to review. The first PR extracted from this one is GH-446. |
Describe your changes.
This pull request adds a
flepimop submit
subcommand for submitting batch jobs. Still a work in progress, so it only works for EMCEE inference with slurm. A quick example of how using this would look is:Although I will add more documentation/usage guides after some initial refinement.
Running todo list of what's remaining:
gempyor.utils._git_checkout
gempyor.batch._click_submit
, including thinking about initial attempts at integration like testing.gempyor.batch._local/_local_template/_sbatch/_sbatch_template
and add further unit testing (especially for the_local/sbatch_template
functions`).gempyor.batch._submit_scenario_job
.inference_*.j2
templates. Only has to concern a few details since the individual components are well tested._infer_cluster_from_fqdn
togempyor.info
to "guess" the cluster fromsocket.getfqdn()
if not given an explicit one.gempyor.shared_cli.DurationParamType/MemoryParamType
.Does this pull request make any user interface changes? If so please describe.
The user interface changes are the addition of a
flepimop submit
subcommand. This is new and causes no other interfaces changes.These changes have not been documented yet, this is a draft PR for some initial feedback but will add appropriate documentation after some refinement.
What does your pull request address? Tag relevant issues.
This pull request addresses:
gempyor
#310