Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch timeseries analysis in slurm #120

Merged
merged 29 commits into from
Nov 28, 2023
Merged

Batch timeseries analysis in slurm #120

merged 29 commits into from
Nov 28, 2023

Conversation

ledm
Copy link
Collaborator

@ledm ledm commented Nov 24, 2023

Closes #118

It's looking like this script is working now.

This PR adds a slurm queue based batch parallel processing of single job timeseries tool.

It's got the following features:

  • Loads from existing input_yaml files.
  • Run with a single command
  • Fully parallelises the slowest part of the analysis_comparison tool: the single job analysis.
  • Won't submit the same jobID if it's already existing*
  • However, if two analyses call the same jobID with different suites (ie one has bgc and one has physics), it will only run the first one.

Need to do:

  • Documentation in README.md
  • Is it @DrYool proof?

Copy link
Owner

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey bud, nice and fast turnaround - I am a bit concerned about the actual submission process (call to subprocess, see me comment) - I'd also think you should add job requirements, no? Also, I'd rename it to batch so and so, since it's not quite parallel-parallel 😁

.gitignore Show resolved Hide resolved
else:
# Submit job:
print('Submitting:', ' '.join(command_txt))
command1 = subprocess.Popen(command_txt)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd defo encase this in a try/except with except fishing for some key elements in stderr, or just printing the whole stderr to screen; if you don't pipe stderr as out it'll be hidden, and the user won't know why their jobs have not been submitted when they thought they have

setup.py Outdated Show resolved Hide resolved
@valeriupredoi
Copy link
Owner

oh and also - this things really do deserve a test, not thru and thru with SLURM submission, but everything up to that. I can write the test when it's about ready 👍

Copy link
Owner

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure the piping is done correctly; also, have you tried this in practice? We don't need any special env to pass to sbatch do we? Like any special environment variables

bgcval2/batch_timeseries.py Outdated Show resolved Hide resolved
@ledm
Copy link
Collaborator Author

ledm commented Nov 27, 2023

let's make sure the piping is done correctly; also, have you tried this in practice? We don't need any special env to pass to sbatch do we? Like any special environment variables

I've been using this for a few days and it works on jasmin. Your amendment to the subprocess also works too.

If batch_timeseries fails you get normal python errors. If it fails inside inside the sbatch script, then you get error messages in the places that we tell it to fail.

@valeriupredoi
Copy link
Owner

yeh that's how we want it to behave, so stdout can be piped to eg a file. Looks good, bud! Let me write a test for it!

@ledm
Copy link
Collaborator Author

ledm commented Nov 27, 2023

I'm not ready to merge. Still need to add documentation & maybe get @DrYool to try it.

README.md Outdated Show resolved Hide resolved
@ledm
Copy link
Collaborator Author

ledm commented Nov 27, 2023

The next question I have:

  • Do we want to make this the default behaviour? This would mean that we run this script from inside analysis_compare when the -s flag is absent instead of calling the analysis_timeseries command?

@ledm
Copy link
Collaborator Author

ledm commented Nov 27, 2023

Basically, the process for adding a new job, input.yml:

  1. analysis_compare -s -y input.yml: This generates the job download commands, which will run overnight. It also creates an html report, but it breaks if there's no data downloaded yet.
  2. Wait overnight for data to download on mass.
  3. batch_timeseries -y input.yml: This submits the job timeseries onto the processing nodes.
  4. analysis_compare -s -y input.yml: This job creates the html report.
  5. ./rsync_to_esmeval.sh: this copies to html to the web visible location on disk.

I suspect that these can be merged into fewer commands!

@valeriupredoi
Copy link
Owner

@ledm I added some test gubbins, make sure to pull or merge so there are no conflictseses

@ledm
Copy link
Collaborator Author

ledm commented Nov 27, 2023

The logo in the README points towards a file on the main branch, but of course it's not available yet until this PR is merged.

@ledm ledm changed the title Parrallel timeseries analysis in slurm Batch timeseries analysis in slurm Nov 28, 2023
return args


def submits_lotus(compare_yml, config_user, dry_run=False):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not actually using config_user here, but I'm keeping it for completeness, just in case.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@ledm
Copy link
Collaborator Author

ledm commented Nov 28, 2023

Okay @valeriupredoi, I'm happy with this now.

@valeriupredoi
Copy link
Owner

all good by me too, bud! Go ahead and merge when you good 🍺

@ledm ledm merged commit b086386 into main Nov 28, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

analysis_timeseries parrallelisation?
2 participants