Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMtools sort memory allocation #199

Closed
yashpatel6 opened this issue Jun 8, 2022 · 25 comments
Closed

SAMtools sort memory allocation #199

yashpatel6 opened this issue Jun 8, 2022 · 25 comments
Assignees
Labels
enhancement New feature or request

Comments

@yashpatel6
Copy link
Contributor

Benchmark samtools sort memory usage with large sample and update allocation if necessary.

@tyamaguchi-ucla tyamaguchi-ucla assigned jarbet and unassigned graceooh Jun 13, 2022
@tyamaguchi-ucla tyamaguchi-ucla added the enhancement New feature or request label Jun 13, 2022
@tyamaguchi-ucla
Copy link
Contributor

Related to this PR - #189

@jarbet
Copy link
Contributor

jarbet commented Jun 29, 2022

See #213 for testing results on CPCG0196-B1. Here is nextflow report for BWA-MEM2 aligner:
hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-merge-multiple-bams/CPCG0196-B1/BWA-MEM2/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220628T174653Z/nextflow-log/report.html

@jarbet
Copy link
Contributor

jarbet commented Jul 7, 2022

Here are some interesting external testing results for samtools sort. I will still do more internal testing, but here are a few notes:

Interestingly, their results for # CPUs is similar to our results in #189 , i.e. the runtime benefit seems to level off at 9-12 cores
Screen Shot 2022-07-07 at 3 22 49 PM

There does not appear to be a strong relationship between memory-per-thread and runtime: (the lines appear flat when you have >6 threads, suggesting no benefit of adding more memory)
image

This paper fixed their memory per thread to 3648 MiB, although they don't explain why.

Overall, my current understanding is that we need to increase the memory per thread beyond the default of 768 MiB, because this will throw errors under our current F72 config of 12 cores with 10 total Gb, as seen here. So I'm guessing we need somewhere between 1-5 Gb per thread in order to reliably avoid memory errors, and I would not not expect much difference in runtime within this range, but we'll see.

@jarbet
Copy link
Contributor

jarbet commented Jul 15, 2022

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The largest sample I tested on in the past was here: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/CPCG0196-B1.csv

@tyamaguchi-ucla
Copy link
Contributor

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The paired tumor CPCG0196-F1 has a higher coverage (multiple libraries and lanes). It looks like one of the previous undergrad students processed the sample here - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-F1.

@tyamaguchi-ucla
Copy link
Contributor

Here are some interesting external testing results for samtools sort.

I thought we might want to look into sambamba at some point (I know Sanger uses (or used) the tools in their pipeline) but they dropped the CRAM support.. https://github.com/biod/sambamba#no-cram-support

@tyamaguchi-ucla
Copy link
Contributor

@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for run_sort_SAMtools. Are there any samples you recommend I test on?

The paired tumor CPCG0196-F1 has a higher coverage (multiple libraries and lanes). It looks like one of the previous undergrad students processed the sample here - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-F1.

@jarbet I found the csv for you /hot/user/aanand/input/full/tumours/a-full-CPCG0196-F1/pipeline-alignDNA.inputs.CPCG0196-F1.csv

The registered files are under /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ

The CSV input was created by the student that I mentioned so please make sure that the file contains all the FASTQs and correct info if you're using the CSV.

@jarbet
Copy link
Contributor

jarbet commented Jul 22, 2022

@jarbet I found the csv for you /hot/user/aanand/input/full/tumours/a-full-CPCG0196-F1/pipeline-alignDNA.inputs.CPCG0196-F1.csv

Thanks, I will start testing on CPCG0196-F1 shortly. In the meantime, here are the memory results from run_sort_SAMtools when testing on CPCG0196-B1 in #220 (recall we gave sort 15Gb):

HISAT2

  • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-merge-cpus-mem/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220721T194452Z/nextflow-log/report.html

image

BWA-MEM2

  • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-merge-cpus-mem/align-DNA-8.0.0/CPCG0196-B1/log-align-DNA-8.0.0-20220721T194452Z/nextflow-log/report.html.1

image

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

@tyamaguchi-ucla, @yashpatel6 : I tested the pipeline on CPCG0196-F1 but it failed after ~10 hours once getting to the run_sort_SAMtools step.

Can you please take a look at the attached nextflow reports? I am not sure what the error message means but I think run_sort_SAMtools is running out of memory (15 Gb may not be enough for this sample). What do you think?

nextflow_reports.zip

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Jul 25, 2022

It looks like this is not a memory issue but no scratch space issue. Did you take a look at the sbatch Nextflow log? (In general, tailing the last 100-1000 lines would show errors/issues if any)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log

samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0103.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0104.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0105.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0106.bam": No space left on device
  samtools sort: failed to create temporary file "CPCG-0196-Pr-P-PE-746-WG-18.sorted.bam.tmp.0107.bam": No space left on device

Which node did you use? (F32 or F72?)

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

Did you take a look at the sbatch Nextflow log? (In general, tailing the last 100-1000 lines would show errors/issues if any)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log

Yes, I checked that log as well but I also thought it meant the process was running out of memory.

Which node did you use? (F32 or F72?)

F32, should I try F72 instead?

@tyamaguchi-ucla
Copy link
Contributor

Which node did you use? (F32 or F72?)

F32, should I try F72 instead?

I would like you to propose which one to use and understand why at this point. (If you're not sure, please review the cluster training and cluster configuration on confluence)

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

I would like you to propose which one to use and understand why at this point. (If you're not sure, please review the cluster training and cluster configuration on confluence)

From the error message, it's unclear to me whether run_sort_SAMtools is running out of RAM or scratch space. I understand you think it is running out of scratch space, in which case, I will try an F72 since it has 2000 GB scratch space compared to 500 GB for F32.

That being said, 500 GB scratch sounds like a lot to me, and I don't understand how the log files indicate more than 500 GB is needed.

@tyamaguchi-ucla
Copy link
Contributor

That being said, 500 GB scratch sounds like a lot to me, and I don't understand how the log files indicate more than 500 GB is needed.

Yup, this can be calculated by the total input size. How big is it in total?

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

Yup, this can be calculated by the total input size. How big is it in total?

I see from the I/O section in the failed nextflow html reports that the Max input memory was 785.1 Gb, so at least this amount of scratch space is needed, which explains why the job failed on F32. I am running the job on an F72 now!

@tyamaguchi-ucla
Copy link
Contributor

Yup, this can be calculated by the total input size. How big is it in total?

I see from the I/O section in the failed nextflow html reports that the Max input memory was 785.1 Gb, so at least this amount of scratch space is needed, which explains why the job failed on F32. I am running the job on an F72 now!

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

The total input size is only 385.2 Gb, which suggests F32 would be sufficient. Although the Nextflow html report I/O section shows that > 500 Gb is being written to scratch, thus F32 is not enough.

@tyamaguchi-ucla
Copy link
Contributor

Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant ls -lh ${fastq-filename}).

The total input size is only 385.2 Gb, which suggests F32 would be sufficient.

Are you sure about this? How about intermediate files?

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

Are you sure about this? How about intermediate files?

You asked about the total file size of all input fastqs, which is 385.2 Gb. I checked this in 2 ways:

du -sh /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ

This shows 386 Gb. Then I wrote a simple R script to calculate the file size of all fastqs listed in an input.csv file, and then take the sum:

path.input.csv <- '/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv'

input.csv <- read.csv(path.input.csv)

input.memory <- c(
    file.info(input.csv$read1_fastq)$size,
    file.info(input.csv$read2_fastq)$size
    );
utils:::format.object_size(
    x = sum(input.memory),
     units = 'Gb'
     );

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough. Nevertheless, the Nextflow I/O section clearly shows 500 Gb is not enough, so now I'm running the job on an F72.

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Jul 25, 2022

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough.

Yup, what I'm getting at was that how much disk space you would need to have both aligned and sorted BAMs under /scratch given the total size of FASTQs is 385.3GB? (btw, I guess you meant GB not Gb?)

@jarbet
Copy link
Contributor

jarbet commented Jul 25, 2022

This gives 385.2 Gb. I understand more memory than this will be needed when making intermediate files, but again, I don't understand how I should have known apriori that 500 Gb was not enough.

Yup, what I'm getting at was that how much disk space you would need to have both aligned and sorted BAMs under /scratch given the total size of FASTQs is 385.3GB? (btw, I guess you meant GB not Gb?)

The previous results I gave were Gb. If using GB, the total input file size is 414 GB (R gives 413.6 GB):

du -sh --block-size=GB /hot/data/PRAD/PRAD0000003/CPCG0000000196/CPCG0000000196-T001-P01-F/DNA/WGS/raw/FASTQ
image

Okay, so in order to have both aligned and sorted BAMs under /scratch you would need about 2 times the total input fastq file size.

So moving forward, as a rule of thumb, I need at least 2 times the total input fastq file size in scratch space, correct?

@tyamaguchi-ucla
Copy link
Contributor

So moving forward, as a rule of thumb, I need at least 2 times the total input fastq file size in scratch space, correct?

That's right. Yash added a process to remove intermediate files after sorting (see https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/common/intermediate_file_removal) but it used to require more scratch space.

remove_intermediate_files(
run_sort_SAMtools.out.bam_for_deletion,
"decoy_signal"
)

@jarbet
Copy link
Contributor

jarbet commented Jul 26, 2022

@tyamaguchi-ucla, @yashpatel6 : CPCG0196-F1 failed due to run_MarkDuplicatesSpark_GATK, which I think is related to issue #225. Here is the testing info:

Testing info

  • BWA-MEM2 (failed after 19 hours)

    • sample: CPCG0196-F1
    • input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
      • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
  • HISAT2 (failed after 22 hours)

    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
      • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log

Note that BWA-MEM2 and HISAT2 give slightly different error messages. Both say the following:

Error executing process > 'align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK'
Caused by:
Process align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (3)

But only HISAT2 says the following (several times) in regards to run_MarkDuplicatesSpark_GATK :

No space left on device

Proposed plan

The good news is 15GB memory was enough for run_sort_SAMtools. Thus I'm proposing the following:

  • keep default configuration of 15GB for run_sort_SAMtools, and close this issue. The purpose of this issue was to check whether 15GB was enough for CPCG0196-F1, and it is.
  • Work on fixing the configuration of run_MarkDuplicatesSpark_GATK in add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226
    • Note I am adding a retry method for both run_sort_SAMtools and run_MarkDuplicatesSpark_GATK, in case they run out of RAM.

Thoughts?

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Jul 26, 2022

Yes, I'm ok with closing this issue. We can create a new issue for the spark error and discuss it there.

@jarbet
Copy link
Contributor

jarbet commented Jul 29, 2022

Yes, I'm ok with closing this issue. We can create a new issue for the spark error and discuss it there.

I created a new issue #229 . I will close this issue now.

@jarbet jarbet closed this as completed Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants