Problem when running on hpc computing cluster. #326

samreenzafer · 2024-06-27T18:59:22Z

Hi.
I've been able to run a few traits along with my data on command line (on our department's computing cluster) and I'm now trying to scale it for thousands of PGSids, by using the lsf queue system. I've finally been able to get 1 job running as I'll show below, but it fails every time I submit it at a different point. The log files are quite large, so I'll try to upload them here instead of pasting here.

My main job has exited with error but I still see one of the sub-jobs that the workflow creates and submits to the cluster, being in PENDING state on our cluster, which is strange.

I submitted the main job as below.

job=job.CNICS.lsf.sh
pop="CAU"
bsub -J CNICS.$pop  -P acc_rareADRs -q premium -n 2 -W 4:00 -R rusage[mem=5000] -oo $job.$pop.o -eo $job.$pop.e -L /bin/bash sh $job $pop

The shell script looks like this:

[zafers02@li03c02 test_nextflow_CNICSonly]$ cat job.CNICS.lsf.sh

dir=`readlink -f .`
cd $dir 

ml proxies
ml singularity/3.11.0
ml nextflow/24.04.2.5914

pop=$1  ### to get sample sheet it  plink_files/samplesheet.CNICSAFR.csv or CNICSCAU.csv 
mkdir -p `pwd`/genotypes_cache/CNICS/$pop

export NXF_SINGULARITY_CACHEDIR="/---myfullpath--/PRS/pgsc-calc/work/singularity/"
export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

nextflow run pgscatalog/pgsc_calc --max_cpus 1 -profile singularity --input plink_files/samplesheet.CNICS${pop}.csv --pgs_id PGS000036,PGS003446,PGS002237,PGS002280 --target_build GRCh37 -w `pwd`/work/  --genotypes_cache `pwd`/genotypes_cache/CNICS/$pop  --run_ancestry /---myfullpath--/PRS/pgsc-calc/resources/pgsc_HGDP+1kGP_v1.tar.zst  --min_overlap 0.20 -c nextflow.lsf.config

And this is what the nextflow.lsf.config file looks like.

[zafers02@li03c02 test_nextflow_CNICSonly]$ cat nextflow.lsf.config

process {
    queue = 'premium'
    clusterOptions = ' -P acc_CranioProject '
    scratch = true

    withLabel:process_low {
        cpus   = 1
        memory = 4.GB
        time   = 2.h
    }
    withLabel:process_medium {
        cpus   = 8
        memory = 64.GB
        time   = 4.h
    }
}

executor {
    name = 'lsf'
    jobName = { "$task.hash" }
}

I still see the following sub-job pending execution on the cluster queue, even though the main job "CNICS.CAU" had exited with error.
[zafers02@li03c02 test_nextflow_CNICSonly]$ bjobs
JOBID USER JOB_NAME STAT QUEUE FROM_HOST EXEC_HOST SUBMIT_TIME START_TIME TIME_LEFT
131073368 zafers02 *3900ddf6662 PEND premium lc02c03.ch - Jun 27 14:13 - -

-rw-rw-rw- 1 zafers02 nicolp01a 52K Jun 27 14:15 .nextflow.log
-rw-rw-rw- 1 zafers02 nicolp01a 0 Jun 27 14:15 job.CNICS.lsf.sh.CAU.e
-rw-rw-rw- 1 zafers02 nicolp01a 8.0K Jun 27 14:15 job.CNICS.lsf.sh.CAU.o

I am uploading the job.CNICS.lsf.sh.CAU.o file as job.CNICS.lsf.sh.CAU.o.txt and .nextflow.log file as job1.nextflow.log.txt here.
I'm wondering what I'm doing wrong here.
Thank you for your time.

job.CNICS.lsf.sh.CAU.o.txt
job1.nextflow.log.txt

The text was updated successfully, but these errors were encountered:

nebfield · 2024-07-01T11:09:51Z

On a HPC it's normal for Nextflow to submit many smaller jobs when you use the lsf executor. Pending jobs can sometimes get stuck if Nextflow exits suddenly and doesn't have time to clean up.

When a process exits with code 137, it means a process has been killed because it exceeded requested resources. The EXTRACT_DATABASE process has been killed by your scheduler, which causes the workflow to exit with an error (exit code 1).

Here's a configuration profile I use for UK Biobank: #328 (comment)

It works fine for ~150 scores. This configuration does a few things:

It will automatically resubmit jobs up to 3 times if they failed because of resource problems, but request more resources
It defines the amount of resources precisely needed for each process

samreenzafer · 2024-07-01T16:57:44Z

Thank You.
I tried using your configuartion profile, but changed a few lines as highlighted in BOLD below:

**executor {
    name = 'lsf'
    jobName = { "$task.hash" }
}**

process {
    errorStrategy = 'retry'
    maxRetries = 3
    maxErrors = '-1'
    **executor = 'lsf'
    queue = 'premium'
    clusterOptions = ' -P acc_CranioProject '**

    withName: 'SAMPLESHEET_JSON' {

and so on...

Then I submitted the job with new config file as follows: (I asked for 4 cores, and 64Gb for each) since the module requring the largest process in your config file had such requirements.
bsub -J CNICS.$pop -P acc_rareADRs -q premium -n 4 -W 4:00 -R rusage[mem=64000] -oo $job.$pop.o -eo $job.$pop.e -L /bin/bash sh $job $pop
I did not get the Error 137, but got a "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API" while DOWNLOAD_SCORE attempted it three times, with Exit status 11. I then tried to manually see if the DOWNLOAD_SCORE would work on command line and it in fact did, as show below. So I'm confused if there could be other settings I should consider changing.

[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell  ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img  ^C
[zafers02@li03c02 test_nextflow_CNICSonly]$ cat work/7f/357285bcc8f87e5542676c30dce421/.command.sh 
#!/bin/bash -euo pipefail
pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280                                    -b GRCh37         -o $PWD         -v         -c pgsc_calc/2.0.0-beta

cat <<-END_VERSIONS > versions.yml
DOWNLOAD_SCOREFILES:
    pgscatalog.core: $(echo $(python -c 'import pgscatalog.core; print(pgscatalog.core.__version__)'))
END_VERSIONS
[zafers02@li03c02 test_nextflow_CNICSonly]$ singularity shell  ../pgsc-calc/work/singularity/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.1.2-singularity.img  
Singularity> pgscatalog-download -i PGS000036 PGS003446 PGS002237 PGS002280                                    -b GRCh37         -o $PWD         -v         -c pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 DEBUG    Verbose logging enabled
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO     Setting user agent to pgsc_calc/2.0.0-beta
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:56 INFO     Downloading scoring files that have been harmonised to build=GenomeBuild.GRCh37
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS000036', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS002237', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS002280', target_build=GenomeBuild.GRCh37) download
pgscatalog.core.cli.download_cli: 2024-07-01 12:45:57 INFO     Submitting ScoringFile('PGS003446', target_build=GenomeBuild.GRCh37) download
  0%|                                                                                                                                                                        | 0/4 [00:00<?, ?it/s]pgscatalog.core.cli.download_cli: 2024-07-01 12:45:59 INFO     Download complete
 25%|████████████████████████████████████████                                                                                                                        | 1/4 [00:01<00:05,  1.89s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:03 INFO     Download complete
 50%|████████████████████████████████████████████████████████████████████████████████                                                                                | 2/4 [00:05<00:06,  3.19s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:04 INFO     Download complete
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                        | 3/4 [00:07<00:02,  2.28s/it]pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO     Download complete
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00,  2.53s/it]
pgscatalog.core.cli.download_cli: 2024-07-01 12:46:07 INFO     All downloads finished
Singularity>

Here is my nextflow log file
job2.nextflow.log.txt
and job_output_log file
job2.job.CNICS.lsf.sh.CAU.o.txt

I am going to try deleting the entire work folder and re-run the job.

samreenzafer · 2024-07-01T17:01:10Z

Can I download all PGS traits before hand and ask nextflow to use the downloaded files from a direcotry, rather than trying to download files live when the pipeline is running? Something similar to the reference files?

nebfield · 2024-07-02T09:52:38Z

You could use pgscatalog-download to preload scoring files

The --scorefile parameter supports multiple local scoring files.

You can install the pgscatalog.core package with pip or bioconda.

samreenzafer · 2024-07-02T20:51:12Z

Thanks. I downloaded some PRS score files and then ran a job testing 1 PRS trait ( using --scorefile) and it runs to completion. So I've bypassed the issue of the job not being able to execute the "DOWNLOAD_SCORE" step.

I will now begin benchmarking the resource requirements for multiple traits since we intend to run all ~4800 traits from PGScatalog for both our CAU and AFR subset of target data which each have >4000 samples. I did have to reduce all the memory requirements for subtasks (in the config file) to 8GB for anything more than 8GB, otherwise my jobs were pending on the cluster queues without entering the "run" stage.

If you ever figure out the original "pgscatalog.core.lib.pgsexceptions.QueryError: Can't query PGS Catalog API" error, or if our HPC informaticians help me figure this out, I will update here.

thank you.

smlmbrt · 2024-07-03T15:02:12Z

I will now begin benchmarking the resource requirements for multiple traits since we intend to run all ~4800 traits from PGScatalog for both our CAU and AFR subset of target data which each have >4000 samples. I did have to reduce all the memory requirements for subtasks (in the config file) to 8GB for anything more than 8GB, otherwise my jobs were pending on the cluster queues without entering the "run" stage.

Hi @samreenzafer, you probably won't be able to run all scores at once with only 8 GB of RAM. Running all scores at once increases the RAM used in matching (number of variants read into memory) and scoring (as it has to create a scores x samples matrix, plink will likely complain at less than 16 GB for that amount of data).

samreenzafer · 2024-07-03T16:28:47Z

Yes I did infact get that error, even when I tried 150 traits at one time. I might just fire one job per trait, and then merge the individual results CSV at the end, although I did like to see one HTML report with multiple traits, for easier comparison, which is why I thought it would be better to run in batches of 150 or 100 together.

Thank You.

smlmbrt added the user-query User queries & requests label Jul 3, 2024

nebfield closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem when running on hpc computing cluster. #326

Problem when running on hpc computing cluster. #326

samreenzafer commented Jun 27, 2024

nebfield commented Jul 1, 2024

samreenzafer commented Jul 1, 2024

samreenzafer commented Jul 1, 2024

nebfield commented Jul 2, 2024

samreenzafer commented Jul 2, 2024

smlmbrt commented Jul 3, 2024

samreenzafer commented Jul 3, 2024

Problem when running on hpc computing cluster. #326

Problem when running on hpc computing cluster. #326

Comments

samreenzafer commented Jun 27, 2024

nebfield commented Jul 1, 2024

samreenzafer commented Jul 1, 2024

samreenzafer commented Jul 1, 2024

nebfield commented Jul 2, 2024

samreenzafer commented Jul 2, 2024

smlmbrt commented Jul 3, 2024

samreenzafer commented Jul 3, 2024