Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMCEE Batch Submit CLI Tool #394

Draft
wants to merge 78 commits into
base: dev
Choose a base branch
from
Draft

EMCEE Batch Submit CLI Tool #394

wants to merge 78 commits into from

Conversation

TimothyWillard
Copy link
Contributor

@TimothyWillard TimothyWillard commented Nov 8, 2024

Describe your changes.

This pull request adds a flepimop submit subcommand for submitting batch jobs. Still a work in progress, so it only works for EMCEE inference with slurm. A quick example of how using this would look is:

(flepimop-env) [twillard@longleaf-login6 RSV_USA]$ flepimop submit --jobs 4 --slurm --cluster longleaf --debug -vvv --simulations 2 --blocks 1 --initial-time 2h --memory 12288 --email [email protected]
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> CLI was given 36 arguments:
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> jobs                        = 4.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> slurm                       = True.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> cluster                     = longleaf.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> debug                       = True.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> verbosity                   = 3.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> simulations                 = 2.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> blocks                      = 1.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> initial_time                = 2:00:00.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> memory                      = 12288.
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> email                       = [email protected].
2024-11-08 16:11:24,408:DEBUG:gempyor.shared_cli> config_files                = ().
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> config_filepath             = (PosixPath('/work/users/t/w/twillard/RSV_USA/config_rsvnet_2024_test.yml'),).
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> seir_modifiers_scenarios    = ().
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> outcome_modifiers_scenarios = ().
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> nslots                      = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> in_run_id                   = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> out_run_id                  = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> in_prefix                   = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> first_sim_index             = 1.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> stoch_traj_flag             = False.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> write_csv                   = False.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> write_parquet               = True.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> batch_system                = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> aws                         = False.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> local                       = False.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> flepi_path                  = /work/users/t/w/twillard/flepiMoP.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> project_path                = /work/users/t/w/twillard/RSV_USA.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> partition                   = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> simulation_time             = 0:03:00.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> conda_env                   = flepimop-env.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> config_out                  = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> run_id                      = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> prefix                      = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> nodes                       = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> cpus                        = None.
2024-11-08 16:11:24,409:DEBUG:gempyor.shared_cli> dry_run                     = False.
2024-11-08 16:11:24,640:INFO:gempyor.batch> Assigning job name of 'rsv_2024_maternal-20241108T211124'.
2024-11-08 16:11:24,640:INFO:gempyor.batch> Using a run id of '20241108_211124UTC'.
2024-11-08 16:11:24,640:INFO:gempyor.batch> Constructing a job to submit to BatchSystem.SLURM.
2024-11-08 16:11:24,641:INFO:gempyor.batch> Preparing a job with size JobSize(jobs=4, simulations=2, blocks=1).
2024-11-08 16:11:24,641:INFO:gempyor.batch> Setting a total job time limit of 126 minutes.
2024-11-08 16:11:24,643:INFO:gempyor.batch> Utilizing info for the 'longleaf' cluster to construct this job.
2024-11-08 16:11:24,643:DEBUG:gempyor.batch> The full settings for cluster 'longleaf' are {'name': 'longleaf', 'modules': [{'name': 'gcc', 'version': '9.1.0'}, {'name': 'anaconda', 'version': '2023.03'}, {'name': 'git', 'version': None}, {'name': 'aws', 'version': None}], 'path_exports': []}.
2024-11-08 16:11:24,681:INFO:gempyor.batch> Writing manifest metadata to '/work/users/t/w/twillard/RSV_USA/manifest.json'.
2024-11-08 16:11:24,766:INFO:gempyor.batch> Dumped the final config for this batch submission to /work/users/t/w/twillard/RSV_USA/config_rsv_2024_maternal-20241108T211124.yml.
2024-11-08 16:11:24,776:DEBUG:gempyor.batch> Using batch script '/tmp/tmpbd6fitcj.sbatch' to submit to slurm.
2024-11-08 16:11:24,776:INFO:gempyor.batch> Submitting to slurm with: /usr/bin/sbatch --export=ALL /tmp/tmpbd6fitcj.sbatch.
2024-11-08 16:11:24,829:DEBUG:gempyor.batch> Captured stdout from sbatch submission: Submitted batch job 53653650

Although I will add more documentation/usage guides after some initial refinement.

Running todo list of what's remaining:

  • Add AWS support. This will likely get punted to a new issue/PR, I don't think there's a strong need for that at the moment.
  • Incorporate restart/resume/continuation locations. Likely will get moved to a separate issue if the goal is to incorporate into Release cycle, December 2024 #375.
  • Add unit tests for gempyor.utils._git_checkout
  • Expand testing of gempyor.batch._click_submit, including thinking about initial attempts at integration like testing.
  • Consolidate the gempyor.batch._local/_local_template/_sbatch/_sbatch_template and add further unit testing (especially for the _local/sbatch_template functions`).
  • Add documentation and tests for gempyor.batch._submit_scenario_job.
  • Add testing for the inference_*.j2 templates. Only has to concern a few details since the individual components are well tested.
  • Smooth out the operations UX (requires some solid feedback first).
  • Add support for slack notifications. Already have email notifications for slurm.
  • Add _infer_cluster_from_fqdn to gempyor.info to "guess" the cluster from socket.getfqdn() if not given an explicit one.
  • Add documentation for gempyor.shared_cli.DurationParamType/MemoryParamType.

Does this pull request make any user interface changes? If so please describe.

The user interface changes are the addition of a flepimop submit subcommand. This is new and causes no other interfaces changes.

These changes have not been documented yet, this is a draft PR for some initial feedback but will add appropriate documentation after some refinement.

What does your pull request address? Tag relevant issues.

This pull request addresses:

Functionality to render jinja2 templates to make creating batch files
easier. Also opens up the door to future use of templates like creating
config files. Associated docs+tests included.
Added `gempyor.logging` module with `_get_logging_level`,
`get_script_logger`, and `ClickHandler` along with corresponding
documentation and unit tests.
Added private helper `gempyor.utils._shutil_which` which has a `check`
argument that behaves very similar in spirit to the same argument of
`subprocess.run`. Corresponding docs/tests included.
Pytest hooks into the root logger, which breaks the use case here where
we only want a single handler to emit a log record. Create a special
carve out for pytest and put it in `get_script_logger` to make testing
logging results of any function easy.
Internal utility to submit a batch file to slurm via `sbatch` along with
detailed logging and corresponding unit tests and docs.
Added an internal utility function to get the sha commit corresponding
to the head of a given git repository. For use in creating manifest
files.
Added `gempyor.manifest.write_manifest` for writing batch manifest JSON
files containing metadata about the run. Also has extended functionality
to store more metadata then the current required.
The `manifest` module will be too small to have as its own, moved into
`batch` and will start consolidating other batch functionality into as
well.
Moved the `_slurm` module into the new `batch` module to similar reasons
as `manifest`. There is not enough slurm functionality to justify an
individual module and slurm interactions will only occur in the context
of working with batch jobs.
Added a dataclass to represent batch job sizes, including light
validation.
Added a class method to `gempyor.batch.JobSize` class that is slightly
more convenient to use from a CLI script. Provides a good home for
future code realted to GH-360. Corresponding tests and docs.
In anticipation of GH-358, minimize future work.
Added internal utility to convert a dict of options into a list
appropriate for `subprocess.run` and similar.
Added `gempyor.batch._sbatch_template` that is a thin wrapper around
`_sbatch` that takes advantage of templating. There are docs, but no
unit tests. Still determining how this interface should be best setup.
Draft solution to GH-342 with the ability to extend to other types of
generic config/info in the future. Add `gempyor.info` module with
exported function `get_cluster_info` which pulls and parses an
information yaml into a pydantic model.
Very draft, needs to be better connected to a CLI to fully implement.
Added specification for path modifications to the cluster info (namely
to add AWS cli on rockfish to path). Specification includes the path to
add, prepend vs append, and a flag to error if path is missing.
Not fleshed out at all, but at least contains the main sections that
will need to be filled in. Now includes dynamic modules and $PATH
modifications that mirror the objects returned from `get_cluster_info`.
Just rebased `dev` into `GH-365/emcee-submit` which now contains black
formatting for python.
Added a helper to log the inputs given to a CLI tool when the verbosity
is set to debug. Mostly for user/dev debugging. Also added corresponding
docs and unit tests.
Internal utility to generate a human readable unique batch job name
along with corresponding docs and tests.
Added internal `gempyor.batch._resolve_batch_system` utility to make
sense of batch system CLI options.
Very preliminary work on the `flepimop batch` CLI tool including some
light testing for the not implemented errors thrown.
* Changed to generic import of `click` since the `gempyor.batch` module
  relies on it heavily.
* Add manifest json to `flepimop batch`.
* Add subpops to the job size calculation of `flepimop batch`.
* Leftover TODO note about resume/continuation locations.
Added a representation of batch job time limits, similar to `JobSize`,
along with corresponding documentation and unit tests.
Added the `JobTimeLimit.from_per_simulation_time` class method to easily
create instances from user provided inputs. Also added comparison
methods for ease of unit testing.
Started work constructing a call to `_sbatch_template` in the `flepimop
batch` CLI. Incorporated `JobTimeLimit` class and added additional
comments into the `_click_batch` function.
Added the `DurationParamType` for specifying custom durations in a user
friendly way for CLI interfaces along with corresponding unit tests.
Also incorporated the new custom param type into the `flepimop batch`
CLI.
* Changed default from `timedelta` to `str`. Looks like the `default`
  value is passed to the `convert` method of `type` if instance of
 `click.ParamType`.
* Began work incorporating cluster info into batch submission with
  slurm.
* Final config file is now parsed out to a file, either user specified
  or temp.
* Further work on the batch script, mostly in translating variables
  given to `_click_batch` to the template.
Bug in `MemoryParamType.convert` was failing if a unit was not given
with a value. Similar to `DurationParamType` it assumes that numbers
without a unit are given in the default unit of the type.
Moved from the slurm specific code path in `_submit_scenario_job`,
should allow for using the `--local` flag on the cluster.
Made the `ValueError` from `_infer_cluster_from_fqdn` optional, but
default to `True` for current behavior.
@TimothyWillard TimothyWillard marked this pull request as ready for review November 26, 2024 20:55
* Export `$FLEPI_PATH` in the GH action before running tests to run the
  unit tests for `get_cluster_info`.
* Attempt to fix Jinja2 lader issue by falling back to a file system
  loader if a package loader fails.
@TimothyWillard
Copy link
Contributor Author

While this PR is ready for review I think it's unlikely there will be sufficient time to get this into the upcoming release. If folks have the bandwidth to review that's great, but definitely prioritize the other PRs noted in slack first.

@saraloo
Copy link
Contributor

saraloo commented Dec 2, 2024

Thanks Tim - I intended to test this functionality properly when I'm done with RSV too and that might take a while to cover all bases. Will prioritise the others first.

@TimothyWillard TimothyWillard marked this pull request as draft January 6, 2025 22:46
@TimothyWillard
Copy link
Contributor Author

I'm converting this back to a draft as I extract some of the functionality here into smaller PRs that are easier to review. The first PR extracted from this one is GH-446.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch Relating to batch processing. cli Relating to command line interfaces enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request]: EMCEE integration with inference_job.py
2 participants