Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sfitz by readgroup and add FastQC #62

Merged
merged 46 commits into from
Jul 19, 2024
Merged

Sfitz by readgroup and add FastQC #62

merged 46 commits into from
Jul 19, 2024

Conversation

sorelfitzgibbon
Copy link
Collaborator

@sorelfitzgibbon sorelfitzgibbon commented Jun 12, 2024

Description

  • Processing is now done with respect to samples, libraries and readgroups.
  • SAMtools stats now runs three times, by sample, by library and by readgroup, unless there is only a single library or readgroup.
  • FastQC runs at a single level set by input parameter fastqc_level
  • CollectWgsMetrics and Qualimap are functionally unchanged.
  • Added HG003 downsampled to NFTest
  • algorithms standardized to algorithm.

Closes #42 #44 #45 #50 #63 #66 #67

Testing Results

  • NFTest Samples:
    1. GIAB HG003_0.05x-selected-readgroups =
      • H003 0.05x downsampled (~0.05x coverage? / 1.4GB BAM)
        • then 6 readgroups (3 libraries) selected out 1005 total readgroups (-> 11MB BAM 33,400 reads).
      • 1 sample (HG003) with...
        • library1: 1 readgroup
        • library2: 2 readgroups
        • library3: 3 readgroups
    2. A_mini
      • 2 samples, each with one library and one readgroup
  • NFTest runs
    • hg003-all-tools, sample level FastQC
    • a_mini-all-tools
    • a_mini-all-tools-multiple-samples
    • log: /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup/log-nftest-20240626T210242Z.log
    • hg003-fastqc, readgroup level FastQC
    • log: /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup/log-nftest-20240626T204641Z.log
    • a_mini-fastqc, sample level FastQC
    • log: /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup/log-nftest-20240626T204702Z.log

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline using NFTest, or I have justified why I did not need to run NFTest above.

@sorelfitzgibbon sorelfitzgibbon requested a review from a team as a code owner June 12, 2024 04:06
Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review in greater depth in the next couple of days but a suggestion in the meantime:

Comment on lines 32 to 46
if (task.process == "run_stats_SAMtools_sample") {
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}",
params.dataset_id,
sm_id,
[:])
outdir = "."
log_suffix = "${sm_id}"
} else if (task.process == "run_stats_SAMtools_library") {
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}",
params.dataset_id,
lib_id,
[:])
outdir = sm_id
log_suffix = "${sm_id}/${lib_id}"
} else if (task.process == "run_stats_SAMtools_readgroup") {
output_filename = generate_standard_filename("SAMtools-${params.samtools_version}",
params.dataset_id,
rg_id,
[:])
outdir = "${sm_id}/${lib_id}"
log_suffix = "${sm_id}/${lib_id}/${rg_id}"
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider re-organizing this block to something like:

if (params.stat_mode == 'library') {
    filename_id = lib_id
    outdir = ...
    log_suffix = ...
} else if (params.stat_mode == 'readgroup') {
    filename_id = rg_id
    outdir = ...
    log_suffix = ...
} else {
    filename_id = sm_id
    outdir = ...
    log_suffix = ...
}

output_filename = generate_standard_filename("SAMtools-${params.samtools_version}",
            params.dataset_id,
            filename_id,
            [:])

and then using addParams when importing the process to set stat_mode.

This reduces the amount of duplicated code (ex. the standard filename generation function is only coded once) and also removes the need for the process itself to have to be imported with a specific name and instead be controlled by a parameter set during import.

@yashpatel6 yashpatel6 self-assigned this Jun 13, 2024
@yashpatel6 yashpatel6 self-requested a review June 13, 2024 01:19
@sorelfitzgibbon sorelfitzgibbon changed the title Sfitz by readgroup Sfitz by readgroup and add FastQC Jun 14, 2024
@sorelfitzgibbon
Copy link
Collaborator Author

@yashpatel6 this is ready for review

Channel
.fromList(params.libraries_to_process)
.map { lib ->
def rg_arg = lib.rgs.collect { "-r ${it}" }.join(' ')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to change this. In a separate PR I will make this a list of readgroups and add the -r within the processes

Copy link
Collaborator Author

@sorelfitzgibbon sorelfitzgibbon Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the GIAB samples for example have > 1000 readgroups. I think I should impose a maximum number of readgroups, and I suppose libraries, that will be processed separately by SAMtools stats. But for FastQC they will all be processed separately as this is equivalent to running FastQC on the FASTQ files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added parameters for maximum number of readgroups and libraries to process separately with defaults = 20.

I also parameterized a choice of levels to run FastQC.

module/fastqc.nf Outdated
Comment on lines 33 to 44
set -euo pipefail
mkdir -p ${sm_id}
samtools view -F 0x900 -h ${rg_arg} ${path} | \
samtools fastq | \
fastqc \
--outdir "${sm_id}" \
--format fastq \
--extract \
--delete \
${params.fastqc_additional_options} \
stdin:${output_filename}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorelfitzgibbon
question: I think I understand what this command tries to archive but can you elaborate on this a bit (can this -F 0x900 be optional, etc)? Also, have you tried to incorporate threading? This would be probably really slow for large samples. Is it possible to use fastqc -format bam with threading for the sample level run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tyamaguchi-ucla

  • -F 0x900 removes secondary and supplementary alignments. It is actually the default but I thought I'd leave it here so there is no question that its being done.
  • Surprisingly fastqc -format bam does not allow per readgroup analysis.
  • I have not tried to incorporate threading. All three commands allow threading. I will try it and test speedup.

Copy link
Collaborator Author

@sorelfitzgibbon sorelfitzgibbon Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE 2: After discussion with @yashpatel6:

  • Since, within metapipeline, this pipeline will eventually be used to gate the running of further pipelines, realtime will be more important than cpu hours, thus the default will be 2 cpus. More than 2 cpus provides very little time advantage for the large (more typical size) readgroup.
  • cpus will be adjusted down to 1 for input with large numbers of readgroups.

UDPATE: I now think it would be better to avoid threading entirely, at least when running in metapipeline.

  • Testing on a sample with 112 readgroups took twice as long with 2 threads vs 1 (9.5 hours vs 4.7 hours).
    /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-add-mosdepth/slurm-184021.out
    /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-add-mosdepth/slurm-184029.out
  • The per cpu benefit of 2 threads on samples with fewer readgroups is minimal, if even real.

Below are the results for running with various threads on SGSALASC000015-T011-O04-F.bam (185 GB).

  • This sample has five readgroups with one being much larger than the others.
  • All output here: /hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup
  • It looks like the sweet spot is 2 threads

Speedup (single thread realtime/realtime)

threads 1 2 3 4 6 12
small readgroup 1 1.9 2.1 3.8 4 3.4
large readgroup 1 2 2.5 2.2 2.6 2.5
total cpu hours 9.3 9.2 12.2 12.4 16.7 37.0

1 thread; full run CPU HOURS = 9.3

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (2) 1 1h 24m 33s 97.50% 444.2 MB 3.3 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (5) 1 1h 27m 31s 99.70% 444.8 MB 3.3 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (4) 1 1h 27m 31s 99.70% 447.5 MB 3.3 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (3) 1 1h 27m 31s 99.80% 445.4 MB 3.3 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (1) 1 3h 31m 22s 99.60% 444.5 MB 3.3 GB 1.2 TB 1 TB

2 threads; full run cpu hours = 9.2

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (3) 2 44m 11s 188.30% 459.8 MB 3.9 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (4) 2 44m 11s 190.20% 465.7 MB 3.9 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (5) 2 44m 10s 189.90% 465.2 MB 3.9 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (2) 2 38m 54s 191.90% 462 MB 3.9 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (1) 2 1h 45m 35s 200.20% 459.1 MB 3.9 GB 1.2 TB 1 TB

3 threads; full run cpu hours = 12.2

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (2) 3 40m 48s 182.00% 877.1 MB 4.9 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (3) 3 40m 48s 181.90% 885.5 MB 4.9 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (4) 3 38m 42s 187.30% 885.5 MB 4.9 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (5) 3 38m 43s 187.50% 880.5 MB 5 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (1) 3 1h 24m 36s 237.80% 894.7 MB 5 GB 1.2 TB 1 TB

4 threads; full run cpu hours = 12.4

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (2) 4 22m 33s 350.10% 1.1 GB 5.7 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (3) 4 22m 19s 358.50% 905.9 MB 5.8 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (5) 4 22m 7s 356.40% 970.9 MB 5.7 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (4) 4 22m 10s 353.60% 971.2 MB 5.7 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (1) 4 1h 37m 8s 233.10% 1.1 GB 5.7 GB 1.2 TB 1 TB

6 threads; full run CPU HOURS = 16.7

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (2) 6 21m 36s 332.00% 1.5 GB 6.9 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (3) 6 21m 36s 319.90% 1.4 GB 6.8 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (5) 6 21m 36s 332.20% 1.5 GB 6.9 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (4) 6 21m 36s 318.70% 1.5 GB 6.9 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (1) 6 1h 20m 29s 263.50% 1.4 GB 7 GB 1.2 TB 1 TB

12 threads; full run CPU HOURS = 37.0

name threads realtime %cpu peak_rss peak_vmem rchar wchar
assess_ReadQuality_FastQC (5) 12 25m 22s 271.30% 2.5 GB 10.4 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (2) 12 25m 22s 265.00% 2.3 GB 10.6 GB 369.5 GB 185.2 GB
assess_ReadQuality_FastQC (4) 12 25m 22s 265.00% 2.4 GB 10.5 GB 369.6 GB 185.4 GB
assess_ReadQuality_FastQC (3) 12 25m 22s 271.30% 2.5 GB 10.4 GB 370.4 GB 186.2 GB
assess_ReadQuality_FastQC (1) 12 1h 23m 44s 250.30% 2.1 GB 10.5 GB 1.2 TB 1 TB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-F 0x900 removes secondary and supplementary alignments. It is actually the default but I thought I'd leave it here so there is no question that its being done.

Thanks for the benchmarking results. I wasn't aware of this default. Do we want to include them though (i.e. -f?) to get a complete result for the RG level FASTQs, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we do want to include them. They are duplicates of reads that are already represented. Removing them creates the output that directly matches the original FASTQs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorelfitzgibbon (suggestion/question: non-blocking) You've generated lots of benchmarking results and we should post them on GitHub discussions as well. Also, do you typically adjust for caching performance when conducting benchmarking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you typically adjust for caching performance when conducting benchmarking

Is there a recommended way of doing that? I ran almost all of it on nodes that were already running and had space available. They were run at more or less the same time. I could run them all concurrently on a single F72, although I assume even that background will change as jobs start to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching adjustments are hard to control in general; a way around it could to be make the tests totally independent of the cache by copying the input files to /scratch and then run all benchmarking runs using files under scratch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do some cache independent benchmarking.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching adjustments are hard to control in general; a way around it could to be make the tests totally independent of the cache by copying the input files to /scratch and then run all benchmarking runs using files under scratch.

Yeah, I was going suggest this. Maybe worth creating a confluence page for general benchmark recommendations? We've discussed pre-caching, etc with OHIA and MSFT a while ago but using /scratch would be the easiest solution.

@sorelfitzgibbon
Copy link
Collaborator Author

@yashpatel6 I think this is ready for another review. I will update based on cache-independent benchmarking in the next PR.

@sorelfitzgibbon
Copy link
Collaborator Author

@yashpatel6 I think this is ready for another review. I will update based on cache-independent benchmarking in the next PR.

Actually hold off. I need to stop single libraries and readgroups being run multiple times.

@yashpatel6
Copy link
Contributor

@yashpatel6 I think this is ready for another review. I will update based on cache-independent benchmarking in the next PR.

Actually hold off. I need to stop single libraries and readgroups being run multiple times.

Gotcha, feel free to re-request the review when ready!

@sorelfitzgibbon
Copy link
Collaborator Author

In order to optimize this pipeline, it would be helpful to start to specify

In order to optimize this pipeline, it would be helpful to start to specify what QC results would trigger a cessation of the metapipeline run. For example coverage calculations take the longest time and may not be necessary before the gatekeeping. Results from SAMtools stats should be sufficient, e.g. read QC and count, mapping QC & % duplicated.

In this case, perhaps this pipeline could be run twice within metapipeline, first targeting the gatekeeping information (SAMtools stats) and second filling in the rest (FastQC and coverage (mosdepth or collectWgsMetrics)).

Regardless, I hope to merge this PR and implement optimizations after adding mosdepth. I've created issues for the various issues left hanging.

I started a Discussion about goals for integrating sample level QC into metapipeline-DNA

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments and suggestions but otherwise looks good:

}
}
}
withName: run_stats_SAMtools_readgroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with maintaining the three sections and using run_statsReadgroups_SAMtools

reference = '/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta'
output_dir = '/path/to/output/directory'
blcds_registered_dataset = false // if you want the output to be registered
save_intermediate_files = true

// SAMtools stats options
samtools_remove_duplicates = false
samtools_stats_additional_options = ''
stats_max_rgs_per_sample = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Do we also want to add the max libs parameter here?

Copy link
Collaborator Author

@sorelfitzgibbon sorelfitzgibbon Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashpatel6 I left it off in the interest of streamlining the template since library numbers that are large enough to be an issue are currently unlikely. It is in the README. Do you want me to add it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I would suggest adding it

@sorelfitzgibbon
Copy link
Collaborator Author

done and tested:
/hot/software/pipeline/pipeline-generate-SQC-BAM/Nextflow/development/unreleased/sfitz-by-readgroup/log-nftest-20240627T222505Z.log

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good technically! The specific usage in terms of failing/passing QC we can discuss in the metapipeline context

@sorelfitzgibbon sorelfitzgibbon merged commit ef50a59 into main Jul 19, 2024
7 checks passed
@sorelfitzgibbon sorelfitzgibbon deleted the sfitz-by-readgroup branch July 23, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants