This pipeline takes BAMs and runs selected Quality Control (QC) steps. Available algorithms are currently SAMtools stats
, Picard CollectWgsMetrics
and Qualimap bamqc
. Generally either Qualimap bamqc
or SAMtools stats and Picard CollectWgsMetrics
should be run, not both. Qualimap bamqc
uses a lot of memory and should not be run within uclahs-cds/metapipeline-DNA
. Input can include any combination of tumor and normal BAMs from a single donor. Each will be processed independently. RNA specific QC is not yet implemented but is expected soon.
-
Update the params section of the
.config file
(Example config). -
Update the input YAML (Template YAMLs).
-
See the submission script, here, to submit your pipeline
Currently supported Nextflow versions: v23.04.2
Each of the below algorithms, if selected, will run in parallel subject to available resources.
- Note about duplicated reads:
SAMtools stats
does not ignore reads marked as duplicate by default. The optionsamtools_remove_duplicates
can be set totrue
to override this.Picard CollectWgsMetrics
andQualimap bamqc
do ignore reads marked as duplicate by default.
samtools stats collects basic statistics from BAM files including read counts, qualities, GC content, insert sizes, read lengths, proper pairing, and duplicated bases.
picard CollectWgsMetrics collects coverage metrics from WGS BAM files.
qualimap bamqc collects basic statistics and coverage metrics from BAM files. Example output: html pdf. Qualimap bamqc
uses a lot of memory and should not be run within uclahs-cds/metapipeline-DNA
.
Example:
---
patient_id: 'patient_id'
dataset_id: 'dataset_id'
input:
normal:
- path: /absolute/path/to/normal.bam
read_length: length
tumor:
- path: /absolute/path/to/tumor.bam
read_length: length
Field | Type | Required | Description |
---|---|---|---|
algorithm |
list | no | List of tools to be run: ['stats', 'collectwgsmetrics', 'bamqc'], default = ['stats', 'collectwgsmetrics'] |
reference |
path | yes/no | Reference fasta is required only for CollectWgsMetrics |
output_dir |
path | yes | Not required if blcds_registered_dataset = true |
blcds_registered_dataset |
boolean | no | Default is false . Only uclahs_cds users should change this. When true , BLCDS folder structure is used |
work_dir |
path | no | Path of working directory for Nextflow. When included, Nextflow intermediate files and logs will be saved to this directory. With uclahs_cds = true , the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively. |
Field | Type | Required | Description |
---|---|---|---|
stats_max_rgs_per_sample | integer | no | If a sample has more than this number of readgroups, SAMtools stats will not run per readgroup analysis. Default = 20 |
stats_max_libs_per_sample | integer | no | If a sample has more than this number of libraries, SAMtools stats will not run per library analysis. Default = 20 |
stats_remove_duplicates | boolean | no | Ignore reads marked as duplicate. default = false |
stats_additional_options | string | no | Any additional options recognized by samtools stats |
Field | Type | Required | Description |
---|---|---|---|
fastqc_level | string | yes | 'readgroup', 'library' or 'sample' |
fastqc_additional_options | string | no | Any additional options recognized by FastQC |
Field | Type | Required | Description |
---|---|---|---|
cwm_coverage_cap | integer | no | Cap coverage at this value. Default = 250 |
cwm_minimum_mapping_quality | integer | no | Ignore reads with mapping quality below this value. Default = 20 |
cwm_minimum_base_quality | integer | no | Ignore bases with quality below this value. Default = 20 |
cwm_use_fast_algorithm | boolean | no | If true , fast algorithm is used |
cwm_additional_options | string | no | Any additional options recognized by CollectWgsMetrics |
Field | Type | Required | Description |
---|---|---|---|
bamqc_output_format | string | no | Choice of 'pdf' or 'html', default = 'pdf' |
bamqc_additional_options | string | no | Any additional options recognized by bamqc |
To update the base resource (cpus or memory) allocations for processes, use the following structure. The default allocations can be found in the node-specific config files
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.
Examples:
- To double memory of all processes:
base_resource_update {
memory = [
[[], 2]
]
}
- To double memory for
run_CollectWgsMetrics_Picard
and triple memory forrun_statsSamples_SAMtools
andrun_bamqc_Qualimap
:
base_resource_update {
memory = [
['run_CollectWgsMetrics_Picard', 2],
[['run_statsSamples_SAMtools', 'run_bamqc_Qualimap'], 3]
]
}
- To double CPUs and memory for
run_CollectWgsMetrics_Picard
and double memory forrun_statsSamples_SAMtools
:
base_resource_update {
cpus = [
['run_CollectWgsMetrics_Picard', 2]
]
memory = [
[['run_CollectWgsMetrics_Picard', 'run_statsSamples_SAMtools'], 2]
]
}
Output | Description |
---|---|
{SAMtools-version}_{dataset_id}_{sample_id}_stats.txt |
SAMtools stats results |
{Picard-version}_{dataset_id}_{sample_id}_wgs-metrics.txt |
Picard CollectWgsMetrics results |
{Qualimap-version}_{dataset_id}_{sample_id}_stats |
Directory of Qualimap results, including, genome_results.txt and either .pdf or .html and supporting directories |
- Issue Tracker to report errors and enhancement ideas.
- Discussions can take place in generate-SQC-BAM Discussions
- generate-SQC-BAM Pull Requests are also open for discussion
Please see list of Contributors at GitHub.
Generate-SQC-BAM is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
Generate-SQC-BAM takes BAM files and generates per sample QC metrics
Copyright (C) 2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.