-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAMtools sort memory allocation #199
Comments
Related to this PR - #189 |
See #213 for testing results on CPCG0196-B1. Here is nextflow report for BWA-MEM2 aligner: |
Here are some interesting external testing results for Interestingly, their results for # CPUs is similar to our results in #189 , i.e. the runtime benefit seems to level off at 9-12 cores There does not appear to be a strong relationship between memory-per-thread and runtime: (the lines appear flat when you have >6 threads, suggesting no benefit of adding more memory) This paper fixed their memory per thread to 3648 MiB, although they don't explain why. Overall, my current understanding is that we need to increase the memory per thread beyond the default of 768 MiB, because this will throw errors under our current F72 config of 12 cores with 10 total Gb, as seen here. So I'm guessing we need somewhere between 1-5 Gb per thread in order to reliably avoid memory errors, and I would not not expect much difference in runtime within this range, but we'll see. |
@uclahs-cds/nextflow-wg: I want to test align-DNA on a large sample in order to determine the memory configuration for The largest sample I tested on in the past was here: |
The paired tumor CPCG0196-F1 has a higher coverage (multiple libraries and lanes). It looks like one of the previous undergrad students processed the sample here - |
I thought we might want to look into sambamba at some point (I know Sanger uses (or used) the tools in their pipeline) but they dropped the CRAM support.. https://github.com/biod/sambamba#no-cram-support |
@jarbet I found the csv for you The registered files are under The CSV input was created by the student that I mentioned so please make sure that the file contains all the FASTQs and correct info if you're using the CSV. |
Thanks, I will start testing on HISAT2
BWA-MEM2
|
@tyamaguchi-ucla, @yashpatel6 : I tested the pipeline on CPCG0196-F1 but it failed after ~10 hours once getting to the Can you please take a look at the attached nextflow reports? I am not sure what the error message means but I think |
It looks like this is not a memory issue but no scratch space issue. Did you take a look at the sbatch Nextflow log? (In general, tailing the last 100-1000 lines would show errors/issues if any)
Which node did you use? (F32 or F72?) |
Yes, I checked that log as well but I also thought it meant the process was running out of memory.
F32, should I try F72 instead? |
I would like you to propose which one to use and understand why at this point. (If you're not sure, please review the cluster training and cluster configuration on confluence) |
From the error message, it's unclear to me whether That being said, 500 GB scratch sounds like a lot to me, and I don't understand how the log files indicate more than 500 GB is needed. |
Yup, this can be calculated by the total input size. How big is it in total? |
I see from the I/O section in the failed nextflow html reports that the Max input memory was 785.1 Gb, so at least this amount of scratch space is needed, which explains why the job failed on F32. I am running the job on an F72 now! |
Hmm, I think we can make a better estimate by just looking at the input FASTQ files (I meant |
The total input size is only 385.2 Gb, which suggests F32 would be sufficient. Although the Nextflow html report I/O section shows that > 500 Gb is being written to scratch, thus F32 is not enough. |
Are you sure about this? How about intermediate files? |
You asked about the total file size of all input fastqs, which is 385.2 Gb. I checked this in 2 ways:
This shows 386 Gb. Then I wrote a simple R script to calculate the file size of all fastqs listed in an input.csv file, and then take the sum:
This gives |
Yup, what I'm getting at was that how much disk space you would need to have both aligned and sorted BAMs under |
That's right. Yash added a process to remove intermediate files after sorting (see https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/common/intermediate_file_removal) but it used to require more scratch space. pipeline-align-DNA/module/align_DNA_BWA_MEM2.nf Lines 104 to 107 in acf77d9
|
@tyamaguchi-ucla, @yashpatel6 : Testing info
Note that
But only
Proposed planThe good news is 15GB memory was enough for
Thoughts? |
Yes, I'm ok with closing this issue. We can create a new issue for the spark error and discuss it there. |
I created a new issue #229 . I will close this issue now. |
Benchmark
samtools sort
memory usage with large sample and update allocation if necessary.The text was updated successfully, but these errors were encountered: