SAMtools sort memory allocation #199

yashpatel6 · 2022-06-08T00:26:18Z

Benchmark samtools sort memory usage with large sample and update allocation if necessary.

The text was updated successfully, but these errors were encountered:

tyamaguchi-ucla · 2022-06-13T23:07:23Z

Related to this PR - #189

jarbet · 2022-06-29T18:45:12Z

See #213 for testing results on CPCG0196-B1. Here is nextflow report for BWA-MEM2 aligner:
hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-merge-multiple-bams/CPCG0196-B1/BWA-MEM2/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220628T174653Z/nextflow-log/report.html

jarbet · 2022-07-07T22:50:58Z

Here are some interesting external testing results for samtools sort. I will still do more internal testing, but here are a few notes:

Interestingly, their results for # CPUs is similar to our results in #189 , i.e. the runtime benefit seems to level off at 9-12 cores

There does not appear to be a strong relationship between memory-per-thread and runtime: (the lines appear flat when you have >6 threads, suggesting no benefit of adding more memory)

This paper fixed their memory per thread to 3648 MiB, although they don't explain why.

Overall, my current understanding is that we need to increase the memory per thread beyond the default of 768 MiB, because this will throw errors under our current F72 config of 12 cores with 10 total Gb, as seen here. So I'm guessing we need somewhere between 1-5 Gb per thread in order to reliably avoid memory errors, and I would not not expect much difference in runtime within this range, but we'll see.

jarbet · 2022-07-15T16:29:40Z

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The largest sample I tested on in the past was here: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/CPCG0196-B1.csv

tyamaguchi-ucla · 2022-07-19T19:08:07Z

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The paired tumor CPCG0196-F1 has a higher coverage (multiple libraries and lanes). It looks like one of the previous undergrad students processed the sample here - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-F1.

tyamaguchi-ucla · 2022-07-19T19:15:00Z

Here are some interesting external testing results for samtools sort.

I thought we might want to look into sambamba at some point (I know Sanger uses (or used) the tools in their pipeline) but they dropped the CRAM support.. https://github.com/biod/sambamba#no-cram-support

tyamaguchi-ucla · 2022-07-21T20:36:54Z

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The paired tumor CPCG0196-F1 has a higher coverage (multiple libraries and lanes). It looks like one of the previous undergrad students processed the sample here - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-F1.

@jarbet I found the csv for you /hot/user/aanand/input/full/tumours/a-full-CPCG0196-F1/pipeline-alignDNA.inputs.CPCG0196-F1.csv

The registered files are under /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ

The CSV input was created by the student that I mentioned so please make sure that the file contains all the FASTQs and correct info if you're using the CSV.

jarbet · 2022-07-22T16:25:41Z

@jarbet I found the csv for you /hot/user/aanand/input/full/tumours/a-full-CPCG0196-F1/pipeline-alignDNA.inputs.CPCG0196-F1.csv

Thanks, I will start testing on CPCG0196-F1 shortly. In the meantime, here are the memory results from run_sort_SAMtools when testing on CPCG0196-B1 in #220 (recall we gave sort 15Gb):

HISAT2

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-merge-cpus-mem/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220721T194452Z/nextflow-log/report.html

BWA-MEM2

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-merge-cpus-mem/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220721T194452Z/nextflow-log/report.html.1

jarbet · 2022-07-25T04:44:43Z

@tyamaguchi-ucla, @yashpatel6 : I tested the pipeline on CPCG0196-F1 but it failed after ~10 hours once getting to the run_sort_SAMtools step.

Can you please take a look at the attached nextflow reports? I am not sure what the error message means but I think run_sort_SAMtools is running out of memory (15 Gb may not be enough for this sample). What do you think?

nextflow_reports.zip

tyamaguchi-ucla · 2022-07-25T14:53:27Z

It looks like this is not a memory issue but no scratch space issue. Did you take a look at the sbatch Nextflow log? (In general, tailing the last 100-1000 lines would show errors/issues if any)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log

samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0103.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0104.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0105.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0106.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0107.bam": No space left on device

Which node did you use? (F32 or F72?)

jarbet · 2022-07-25T16:36:32Z

Did you take a look at the sbatch Nextflow log? (In general, tailing the last 100-1000 lines would show errors/issues if any)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log

Yes, I checked that log as well but I also thought it meant the process was running out of memory.

Which node did you use? (F32 or F72?)

F32, should I try F72 instead?

tyamaguchi-ucla · 2022-07-25T18:17:42Z

Which node did you use? (F32 or F72?)

F32, should I try F72 instead?

I would like you to propose which one to use and understand why at this point. (If you're not sure, please review the cluster training and cluster configuration on confluence)

jarbet · 2022-07-25T18:46:17Z

I would like you to propose which one to use and understand why at this point. (If you're not sure, please review the cluster training and cluster configuration on confluence)

From the error message, it's unclear to me whether run_sort_SAMtools is running out of RAM or scratch space. I understand you think it is running out of scratch space, in which case, I will try an F72 since it has 2000 GB scratch space compared to 500 GB for F32.

That being said, 500 GB scratch sounds like a lot to me, and I don't understand how the log files indicate more than 500 GB is needed.

tyamaguchi-ucla · 2022-07-25T18:49:55Z

That being said, 500 GB scratch sounds like a lot to me, and I don't understand how the log files indicate more than 500 GB is needed.

Yup, this can be calculated by the total input size. How big is it in total?

jarbet · 2022-07-25T19:05:54Z

Yup, this can be calculated by the total input size. How big is it in total?

I see from the I/O section in the failed nextflow html reports that the Max input memory was 785.1 Gb, so at least this amount of scratch space is needed, which explains why the job failed on F32. I am running the job on an F72 now!

tyamaguchi-ucla · 2022-07-25T19:11:22Z

Yup, this can be calculated by the total input size. How big is it in total?

I see from the I/O section in the failed nextflow html reports that the Max input memory was 785.1 Gb, so at least this amount of scratch space is needed, which explains why the job failed on F32. I am running the job on an F72 now!

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

jarbet · 2022-07-25T19:25:06Z

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

The total input size is only 385.2 Gb, which suggests F32 would be sufficient. Although the Nextflow html report I/O section shows that > 500 Gb is being written to scratch, thus F32 is not enough.

tyamaguchi-ucla · 2022-07-25T19:27:07Z

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

The total input size is only 385.2 Gb, which suggests F32 would be sufficient.

Are you sure about this? How about intermediate files?

jarbet · 2022-07-25T19:43:13Z

Are you sure about this? How about intermediate files?

You asked about the total file size of all input fastqs, which is 385.2 Gb. I checked this in 2 ways:

du -sh /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ

This shows 386 Gb. Then I wrote a simple R script to calculate the file size of all fastqs listed in an input.csv file, and then take the sum:

path.input.csv <- '/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv'

input.csv <- read.csv(path.input.csv)

input.memory <- c(
    file.info(input.csv$read1_fastq)$size,
    file.info(input.csv$read2_fastq)$size
    );
utils:::format.object_size(
    x = sum(input.memory),
     units = 'Gb'
     );

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough. Nevertheless, the Nextflow I/O section clearly shows 500 Gb is not enough, so now I'm running the job on an F72.

tyamaguchi-ucla · 2022-07-25T19:51:50Z

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough.

Yup, what I'm getting at was that how much disk space you would need to have both aligned and sorted BAMs under /scratch given the total size of FASTQs is 385.3GB? (btw, I guess you meant GB not Gb?)

jarbet · 2022-07-25T21:09:41Z

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough.

Yup, what I'm getting at was that how much disk space you would need to have both aligned and sorted BAMs under /scratch given the total size of FASTQs is 385.3GB? (btw, I guess you meant GB not Gb?)

The previous results I gave were Gb. If using GB, the total input file size is 414 GB (R gives 413.6 GB):

du -sh --block-size=GB /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ

Okay, so in order to have both aligned and sorted BAMs under /scratch you would need about 2 times the total input fastq file size.

So moving forward, as a rule of thumb, I need at least 2 times the total input fastq file size in scratch space, correct?

tyamaguchi-ucla · 2022-07-25T21:22:07Z

So moving forward, as a rule of thumb, I need at least 2 times the total input fastq file size in scratch space, correct?

That's right. Yash added a process to remove intermediate files after sorting (see https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/common/intermediate_file_removal) but it used to require more scratch space.

pipeline-align-DNA/module/align_DNA_BWA_MEM2.nf

Lines 104 to 107 in acf77d9

    
           remove_intermediate_files( 
        
              run_sort_SAMtools.out.bam_for_deletion, 
        
              "decoy_signal" 
        
              )

jarbet · 2022-07-26T17:18:37Z

@tyamaguchi-ucla, @yashpatel6 : CPCG0196-F1 failed due to run_MarkDuplicatesSpark_GATK, which I think is related to issue #225. Here is the testing info:

Testing info

BWA-MEM2 (failed after 19 hours)
- sample: CPCG0196-F1
- input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
HISAT2 (failed after 22 hours)
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log

Note that BWA-MEM2 and HISAT2 give slightly different error messages. Both say the following:

Error executing process > 'align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK'
Caused by:
Process align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (3)

But only HISAT2 says the following (several times) in regards to run_MarkDuplicatesSpark_GATK :

No space left on device

Proposed plan

The good news is 15GB memory was enough for run_sort_SAMtools. Thus I'm proposing the following:

keep default configuration of 15GB for run_sort_SAMtools, and close this issue. The purpose of this issue was to check whether 15GB was enough for CPCG0196-F1, and it is.
Work on fixing the configuration of run_MarkDuplicatesSpark_GATK in add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226
- Note I am adding a retry method for both run_sort_SAMtools and run_MarkDuplicatesSpark_GATK, in case they run out of RAM.

Thoughts?

tyamaguchi-ucla · 2022-07-26T17:34:26Z

Yes, I'm ok with closing this issue. We can create a new issue for the spark error and discuss it there.

jarbet · 2022-07-29T17:59:57Z

Yes, I'm ok with closing this issue. We can create a new issue for the spark error and discuss it there.

I created a new issue #229 . I will close this issue now.

yashpatel6 assigned graceooh and yashpatel6 Jun 8, 2022

tyamaguchi-ucla assigned jarbet and unassigned graceooh Jun 13, 2022

tyamaguchi-ucla added the enhancement New feature or request label Jun 13, 2022

tyamaguchi-ucla unassigned yashpatel6 Jun 14, 2022

tyamaguchi-ucla mentioned this issue Jun 28, 2022

try merging BAMs after SAMtools sort #213

Merged

10 tasks

jarbet mentioned this issue Jul 19, 2022

optimize #CPUs for run_merge_SAMtools #220

Merged

10 tasks

tyamaguchi-ucla mentioned this issue Jul 20, 2022

Release 8.1.0 #222

Closed

This was referenced Jul 29, 2022

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

Merged

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

Open

jarbet closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAMtools sort memory allocation #199

SAMtools sort memory allocation #199

yashpatel6 commented Jun 8, 2022

tyamaguchi-ucla commented Jun 13, 2022

jarbet commented Jun 29, 2022

jarbet commented Jul 7, 2022 •

edited

Loading

jarbet commented Jul 15, 2022

tyamaguchi-ucla commented Jul 19, 2022

tyamaguchi-ucla commented Jul 19, 2022

tyamaguchi-ucla commented Jul 21, 2022

jarbet commented Jul 22, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022 •

edited

Loading

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 25, 2022 •

edited

Loading

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 26, 2022

tyamaguchi-ucla commented Jul 26, 2022 •

edited

Loading

jarbet commented Jul 29, 2022

SAMtools sort memory allocation #199

SAMtools sort memory allocation #199

Comments

yashpatel6 commented Jun 8, 2022

tyamaguchi-ucla commented Jun 13, 2022

jarbet commented Jun 29, 2022

jarbet commented Jul 7, 2022 • edited Loading

jarbet commented Jul 15, 2022

tyamaguchi-ucla commented Jul 19, 2022

tyamaguchi-ucla commented Jul 19, 2022

tyamaguchi-ucla commented Jul 21, 2022

jarbet commented Jul 22, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022 • edited Loading

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 25, 2022 • edited Loading

tyamaguchi-ucla commented Jul 25, 2022 • edited Loading

jarbet commented Jul 25, 2022

tyamaguchi-ucla commented Jul 25, 2022

jarbet commented Jul 26, 2022

Testing info

Proposed plan

tyamaguchi-ucla commented Jul 26, 2022 • edited Loading

jarbet commented Jul 29, 2022

jarbet commented Jul 7, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 25, 2022 •

edited

Loading

jarbet commented Jul 25, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 25, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 26, 2022 •

edited

Loading