Improved the current SDA workflow to reach the North American runs with 6400 sites. #3340

DongchenZ · 2024-07-22T22:41:36Z

Description

Motivation and Context

The foreach package seems to be better compared to the furrr package concerning memory allocation. Thus, in this PR, I replaced every furrr with foreach during the general SDA workflow.
The computational power and memory are limited when executing certain SDA procedures locally (e.g., splitting meteorology files, writing configuration files, reading SDA outputs, running Bayesian MCMC analysis part, and removing files (e.g., removing NC files after the first SDA run)). Therefore, in the PR, I developed features of qsub submissions (specified by the batch.settings section in the XML file, which is also documented in the pecan book) during the SDA workflow.
To avoid the complex if-else usage within the current sda.enkf_Multisite function for the above batch job submissions, I developed a new sda.enkf_NorthAmerica function (although most of them are just copy-and-paste from the sda.enkf_Multisite function), which is cleaner and will only be used if the batch.settings is not empty.

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…elop

modules/assim.sequential/NAMESPACE

infotroph · 2024-07-23T16:43:07Z

modules/assim.sequential/R/Analysis_sda_block.R

+  PEcAn.logger::logger.info("Checking results.")
+  while (length(list.files(outdir, pattern="results.Rdata", recursive = T)) < folder.num) {
+    Sys.sleep(60)
+  }


Flagging that this has high potential to become an infinite loop if any job fails / is rejected by the queue / etc

More generally, I'm cautious of submitting anything to qsub and waiting for it to complete before returning from the submitting function -- Part of the point of qsub is that it does the waiting for you.

I am not sure if I resolve this correctly. Here, instead, I detect the job completion by checking if the job still exists on the server. This will ensure a finite while loop even if, for some reason, some jobs never get finished and are killed by the server.

detect the job completion by checking if the job still exists on the server

👍 That seems like a solid improvement. -- still has the potential to be waiting a long time if the queue is stalled / moving slowly, but will only be waiting for outputs the server still plans to provide.

modules/assim.sequential/R/Analysis_sda_block.R

infotroph · 2024-07-23T16:51:18Z

modules/assim.sequential/R/Analysis_sda_block.R

+  load(file.path(folder.path, "block.Rdata"))
+  # initialize parallel.
+  cores <- parallel::detectCores()
+  if (cores > 28) cores <- 28


This looks like a BU-specific number and should be configurable

modules/assim.sequential/R/Analysis_sda_block.R

infotroph · 2024-07-24T05:32:28Z

modules/assim.sequential/R/Analysis_sda_block.R

+  results <- foreach::foreach(l = blocks, .packages=c("Kendall", "purrr"), .options.snow=opts) %dopar% {
+    MCMC_block_function(l)
+  }


Naive question: How closely equivalent is this to parallel::parLapply(cl, blocks, MCMC_block_function)? Would it be worth considering that approach since it only uses existing dependencies?

So, I haven't used this parallel function before. The reason that I used foreach is that it has great management of memory usage with large data inputs. But I can try this function out to see if it performs similarly to the foreach package.

mdietze · 2024-07-24T12:51:54Z

modules/assim.sequential/R/Analysis_sda_block.R

+    dir.create(folder.path)
+    # save corresponding block list to the folder.
+    blocks <- block.list[head.num:tail.num]
+    save(blocks, file = file.path(folder.path, "block.Rdata"))


I find it hard to tell what the structure of blocks actually is, but is this something that could be saved as a csv dataframe or something similar to avoid Rdata? At the minimum I'd recommend switching to RDS, which is more transparent about what objects are being reloaded.

mdietze · 2024-07-24T12:53:07Z

modules/assim.sequential/R/Analysis_sda_block.R

+    blocks <- block.list[head.num:tail.num]
+    save(blocks, file = file.path(folder.path, "block.Rdata"))
+    # create job file.
+    jobsh <- readLines(con = system.file("analysis_qsub.job", package = "PEcAn.ModelName"), n = -1, warn=FALSE)


Why is the generic template model being hard-coded here? Do we really want to create a package dependency of the SDA module on the generic template model?

mdietze · 2024-07-24T12:56:47Z

modules/assim.sequential/R/Analysis_sda_block.R

+##' @title qsub_analysis_submission
+##' @param block.list list: MCMC configuration lists for the block SDA analysis.
+##' @param outdir character: SDA output path.
+##' @param job.per.folder numeric: number of jobs per folder.


This logic seems backwards to me. Generally with a HPC or cloud queueing system you want to specify the number of cores and/or nodes that you have, and then divide jobs across them.

based on folder numbers.

mdietze · 2024-07-24T12:59:26Z

modules/assim.sequential/R/Analysis_sda_block.R

+  for (i in 1:folder.num) {
+    # create folder for each set of job runs.
+    # calculate start and end index for the current folder.
+    head.num <- (i-1)*job.per.folder + 1


Are all jobs expected to take the same amount of time? If not (e.g. could one block have 1 site while another block has 1000 sites?), can we estimate which are expected to be longer or shorter so that we can load balance a bit more intelligently than doing so uniformly?

Should be balanced by number of sites within each block.

mdietze · 2024-07-24T13:01:50Z

modules/assim.sequential/R/Analysis_sda_block.R

+    jobsh <- gsub("@FOLDER_PATH@", folder.path, jobsh)
+    writeLines(jobsh, con = file.path(folder.path, "job.sh"))
+    # qsub command.
+    qsub <- "qsub -l h_rt=48:00:00 -l buyin -pe omp 28 -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -S /bin/bash"


qsub settings are system and user specific, and thus need to be read from the settings object, not hard-coded. Also, I'd STRONGLY recommend setting up a single array-style qsub over submitting multiple jobs in a loop.

See remote execution. Also a lot of this code seems to be in the PEcAn::remote package for example start_qsub

robkooper · 2024-07-25T14:30:29Z

models/template/inst/analysis_qsub.job

@@ -0,0 +1,5 @@
+#!/bin/bash -l
+module load R/4.1.2


this is very server specific, in the pecan.xml we have a section for qsub that allows you to specify the modules that need to be loaded. See https://pecanproject.github.io/pecan-documentation/master/xml-core-config.html#xml-host for an example.

robkooper · 2024-07-25T14:36:09Z

modules/assim.sequential/R/Analysis_sda_block.R

+  # loop over sub-folders.
+  folder.paths <- job.ids <- c()
+  PEcAn.logger::logger.info(paste("Submitting", folder.num, "jobs."))
+  for (i in 1:folder.num) {


some systems will penalize you if you do many submissions in parallel, or only run two and then wait for those to be done, and more jobs you submit, the lower your priority in the queue. To overcome some of this I added the modellauncher this will create a text file with as first line the command to execute, and will have a list of folders it needs to run the command in.

robkooper · 2024-07-25T14:42:03Z

modules/assim.sequential/R/Analysis_sda_block.R

+    jobsh <- gsub("@FOLDER_PATH@", folder.path, jobsh)
+    writeLines(jobsh, con = file.path(folder.path, "job.sh"))
+    # qsub command.
+    qsub <- "qsub -l h_rt=48:00:00 -l buyin -pe omp 28 -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -S /bin/bash"


See remote execution. Also a lot of this code seems to be in the PEcAn::remote package for example start_qsub

Merge branch 'develop' of https://github.com/PecanProject/pecan into develop # Conflicts: # modules/assim.sequential/DESCRIPTION

…ion.

…sing qsub status detection functions.

…on design.

infotroph · 2024-09-21T07:58:45Z

base/remote/R/check_qsub_status.R

@@ -25,7 +26,9 @@ qsub_run_finished <- function(run, host, qstat) {
  }

  if (length(out) > 0 && substring(out, nchar(out) - 3) == "DONE") {
-    PEcAn.logger::logger.debug("Job", run, "for run", run_id_string, "finished")
+    if (verbose) {
+      PEcAn.logger::logger.debug("Job", run, "for run", run_id_string, "finished")


PEcAn.logger already has built-in verbosity control via logger.setLevel() -- is a function-specific verbose flag needed here or is it enough for the user to set the logger level to something higher than debug so that this message isn't printed?

I am unsure if I fully understand how logger.setLevel() works inside the PEcAn.logger package.

For a small amount of jobs, I think it will still be helpful to have job info printed instead of creating a progress bar.

modules/assim.sequential/R/Analysis_sda_block.R

modules/assim.sequential/DESCRIPTION

infotroph · 2024-09-21T08:07:36Z

modules/assim.sequential/R/Multi_Site_Constructors.R

@@ -6,9 +6,6 @@
 ##' @param var.names vector names of state variable names.
 ##' @param X a matrix of state variables. In this matrix rows represent ensembles, while columns show the variables for different sites.
 ##' @param localization.FUN This is the function that performs the localization of the Pf matrix and it returns a localized matrix with the same dimensions.
-##' @param t not used
-##' @param blocked.dis passed to `localization.FUN`
-##' @param ... passed to `localization.FUN`


This is undoing a fix I made in #3346 (and apparently forgot to update Rcheck_reference.log, sorry! That's why the checks didn't complain about this being undone.)

...But if you have better descriptions for the parameters, please do improve my wording!

modules/assim.sequential/R/Multi_Site_Constructors.R

modules/assim.sequential/R/Nimble_codes.R

modules/assim.sequential/R/sda.batch.functions.R

infotroph · 2024-09-21T08:14:47Z

modules/assim.sequential/R/sda.enkf_NorthAmerica.R

+#' @description State Variable Data Assimilation: Ensemble Kalman Filter and Generalized ensemble filter. Check out SDA_control function for more details on the control arguments.
+#' 
+#' @return NONE
+#' @import nimble


Can we importFrom just the functions we need instead of bringing all of nimble into the namespace?

I removed it cause I found there is no place where nimble is applied.

This reverts commit 345530f.

…elop

Dongchen Zhang added 4 commits July 22, 2024 18:22

add the template for the MCMC qsub job.

81d6c05

added namespaces

39a886b

added qsub parallel analysis functions.

cead5d5

Merge branch 'develop' of https://github.com/DongchenZ/pecan into dev…

6ce1b39

…elop

github-actions bot added Modules Models labels Jul 22, 2024

mdietze requested a review from robkooper July 22, 2024 23:21

Dongchen Zhang added 3 commits July 23, 2024 11:53

include qsub analysis part into the main function.

dc06316

Update namespace.

7dc12d3

Update dependencies.

a538129

github-actions bot added the Dockerfile label Jul 23, 2024

Dongchen Zhang added 2 commits July 23, 2024 12:12

Update dependencies.

4c172a1

Update namespace.

c7d0ca9

infotroph reviewed Jul 23, 2024

View reviewed changes

modules/assim.sequential/NAMESPACE Show resolved Hide resolved

infotroph reviewed Jul 23, 2024

View reviewed changes

modules/assim.sequential/R/Analysis_sda_block.R Outdated Show resolved Hide resolved

infotroph reviewed Jul 23, 2024

View reviewed changes

modules/assim.sequential/R/Analysis_sda_block.R Outdated Show resolved Hide resolved

Apply suggestions.

f6f1c2a

infotroph reviewed Jul 24, 2024

View reviewed changes

Dongchen Zhang added 2 commits July 24, 2024 09:50

Update dependency.

cc025a4

Update dependency.

fecb781

mdietze reviewed Jul 24, 2024

View reviewed changes

robkooper requested changes Jul 25, 2024

View reviewed changes

Dongchen Zhang and others added 6 commits September 18, 2024 13:58

Pull from main branch.

b407404

Merge branch 'develop' of https://github.com/PecanProject/pecan into develop # Conflicts: # modules/assim.sequential/DESCRIPTION

Removing the job file template.

e9edbd6

automatic add-up.

9346ed3

Update the qsub option to submit analysis jobs.

08b93dc

Using foreach to create Y and R.

3b1a250

Resolve dimension issue.

0fc424a

Dongchen Zhang added 3 commits September 20, 2024 11:40

Add the verbose argument to the function of checking qsub job complet…

8cc437e

…ion.

Remove the library call from the functions.

ab56e79

Add qsub statements to the creation of the settings in case you are u…

bb71ff8

…sing qsub status detection functions.

github-actions bot added the Base label Sep 20, 2024

Dongchen Zhang added 7 commits September 20, 2024 12:14

Relocate qsub functions for the analysis part and optimize the functi…

9ab9b08

…on design.

Update the roxygen structure.

4c0131d

Update dependency.

296b0d6

Update and fix namespaces.

daf3558

Fix GitHub checks.

86a3d16

Fix function usage.

c43cba4

Replace read function.

a683263

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/Analysis_sda_block.R Outdated Show resolved Hide resolved

Update modules/assim.sequential/R/Analysis_sda_block.R

e8ac7bc

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/DESCRIPTION Outdated Show resolved Hide resolved

Update modules/assim.sequential/DESCRIPTION

0224572

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/Multi_Site_Constructors.R Outdated Show resolved Hide resolved

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/Multi_Site_Constructors.R Outdated Show resolved Hide resolved

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/Nimble_codes.R Outdated Show resolved Hide resolved

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/Nimble_codes.R Outdated Show resolved Hide resolved

infotroph reviewed Sep 21, 2024

View reviewed changes

modules/assim.sequential/R/sda.batch.functions.R Outdated Show resolved Hide resolved

infotroph reviewed Sep 21, 2024

View reviewed changes

infotroph and others added 7 commits September 21, 2024 01:29

whitespace

76f2b72

Update roxygen text.

345530f

Revert "Update roxygen text."

fcc3804

This reverts commit 345530f.

Merge branch 'PecanProject:develop' into develop

9fa90a8

Merge branch 'develop' of https://github.com/DongchenZ/pecan into dev…

382e38d

…elop

Update roxygen.

2cdfea8

Revert the changes.

59c348d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved the current SDA workflow to reach the North American runs with 6400 sites. #3340

Improved the current SDA workflow to reach the North American runs with 6400 sites. #3340

DongchenZ commented Jul 22, 2024 •

edited

Loading

infotroph Jul 23, 2024

infotroph Jul 23, 2024

DongchenZ Jul 23, 2024

infotroph Jul 23, 2024

infotroph Jul 23, 2024

DongchenZ Jul 23, 2024

infotroph Jul 24, 2024

DongchenZ Jul 24, 2024

mdietze Jul 24, 2024

mdietze Jul 24, 2024

mdietze Jul 24, 2024

DongchenZ Jul 24, 2024

mdietze Jul 24, 2024

DongchenZ Jul 24, 2024

mdietze Jul 24, 2024

robkooper Jul 25, 2024

robkooper Jul 25, 2024

robkooper Jul 25, 2024

robkooper Jul 25, 2024

infotroph Sep 21, 2024

DongchenZ Sep 30, 2024

infotroph Sep 21, 2024 •

edited

Loading

infotroph Sep 21, 2024

DongchenZ Sep 30, 2024

infotroph Sep 21, 2024

DongchenZ Sep 30, 2024

Improved the current SDA workflow to reach the North American runs with 6400 sites. #3340

Are you sure you want to change the base?

Improved the current SDA workflow to reach the North American runs with 6400 sites. #3340

Conversation

DongchenZ commented Jul 22, 2024 • edited Loading

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infotroph Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DongchenZ commented Jul 22, 2024 •

edited

Loading

infotroph Sep 21, 2024 •

edited

Loading