Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sfitz concat vcf #213

Merged
merged 18 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- Add concatenation of consensus variants to one VCF
- Add variant intersection Venn diagram
- Add regions filter to variant intersections
- Add second BCFtools step to create full presence/absence variant table (including private)
Expand All @@ -15,8 +16,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add `split_VCF_bcftools` to `Mutect2` workflow, separating SNVs, MNVs and Indels

### Changed
- reconfigure `intersect_regions` to use all contigs except `decoy`
- reconfigure `call_regions` to `intersect_regions`
- Update resource allocation to include new processes
- Reconfigure `intersect_regions` to use all contigs except `decoy`
- Reconfigure `call_regions` to `intersect_regions`
- Update to BCFtools v1.17
- Keep `bam-readcount` output in `SomaticSniper` QC folder
- Update `MuSE` to `v2.0.2`
Expand Down
22 changes: 21 additions & 1 deletion config/F16.config
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done any runs with large samples since adding plot_VennDiagram_R or concat_VCFs_BCFtools so these are just guesses. These two processes will run together, but only after everything is done. I doubt they use much memory so I don't think it matters much. The next PR, add maf, will add one more process and may be the last PR before release. With that I could test with large samples and look at memory as well as which processes will use more cpus.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Other processes after create_IndelCandidate_SAMtools will only run one at a time, so
// Other processes will only run one at a time, so
// we don't need to control their resources.

process {
Expand Down Expand Up @@ -78,4 +78,24 @@ process {
}
}
}
withName: plot_VennDiagram_R {
cpus = 2
memory = 5.GB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can VennDiagram take 2 CPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I sort of mentioned in the PR description, these are just placeholders until I do a large bam test run before the release. I will adjust these and the processes added in sfitz-add-maf at that time.

retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
withName: concat_VCFs_BCFtools {
cpus = 2
memory = 5.GB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this process doesn't use 2 CPUs?

retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
}
8 changes: 8 additions & 0 deletions config/F2.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,12 @@ process {
withName: call_sSNV_Strelka2 {
cpus = 2
}
withName: plot_VennDiagram_R {
cpus = 2
memory = 1.GB
}
withName: concat_VCFs_BCFtools {
cpus = 2
memory = 2.GB
}
}
20 changes: 20 additions & 0 deletions config/F32.config
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,24 @@ process {
}
}
}
withName: plot_VennDiagram_R {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
withName: concat_VCFs_BCFtools {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
}
20 changes: 20 additions & 0 deletions config/F72.config
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,24 @@ process {
}
}
}
withName: plot_VennDiagram_R {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
withName: concat_VCFs_BCFtools {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
}
20 changes: 20 additions & 0 deletions config/M64.config
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,24 @@ process {
}
}
}
withName: plot_VennDiagram_R {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
withName: concat_VCFs_BCFtools {
cpus = 2
memory = 5.GB
retry_strategy {
memory {
strategy = 'add'
operand = 10.GB
}
}
}
}
8 changes: 4 additions & 4 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -63,29 +63,29 @@ include { somaticsniper } from './module/somaticsniper' addParams(
include { strelka2 } from './module/strelka2' addParams(
workflow_output_dir: "${params.output_dir_base}/Strelka2-${params.strelka2_version}",
workflow_log_output_dir: "${params.log_output_dir}/process-log/Strelka2-${params.strelka2_version}",
output_filename: generate_standard_filename("Strelka2_${params.strelka2_version}",
output_filename: generate_standard_filename("Strelka2-${params.strelka2_version}",
params.dataset_id,
params.sample_id,
[:]))
include { mutect2 } from './module/mutect2' addParams(
workflow_output_dir: "${params.output_dir_base}/Mutect2-${params.GATK_version}",
workflow_log_output_dir: "${params.log_output_dir}/process-log/Mutect2-${params.GATK_version}",
output_filename: generate_standard_filename("Mutect2_${params.GATK_version}",
output_filename: generate_standard_filename("Mutect2-${params.GATK_version}",
params.dataset_id,
params.sample_id,
[:]))
include { muse } from './module/muse' addParams(
workflow_output_dir: "${params.output_dir_base}/MuSE-${params.MuSE_version}",
workflow_log_output_dir: "${params.log_output_dir}/process-log/MuSE-${params.MuSE_version}",
output_filename: generate_standard_filename("MuSE_${params.MuSE_version}",
output_filename: generate_standard_filename("MuSE-${params.MuSE_version}",
params.dataset_id,
params.sample_id,
[:]))

include { intersect } from './module/intersect' addParams(
workflow_output_dir: "${params.output_dir_base}/intersect-BCFtools-${params.BCFtools_version}",
workflow_log_output_dir: "${params.log_output_dir}/process-log/intersect-BCFtools-${params.BCFtools_version}",
output_filename: generate_standard_filename("Consensus",
output_filename: generate_standard_filename("BCFtools-${params.BCFtools_version}",
params.dataset_id,
params.sample_id,
[:]))
Expand Down
48 changes: 40 additions & 8 deletions module/intersect-processes.nf
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@ process intersect_VCFs_BCFtools {
pattern: "*.vcf.gz*"
publishDir path: "${params.workflow_output_dir}/output",
mode: "copy",
pattern: "isec-2-or-more"
pattern: "isec-2-or-more/*.txt",
saveAs: { "${file(it).getParent().getName()}/${params.output_filename}_${file(it).getName()}" }
publishDir path: "${params.workflow_output_dir}/output",
mode: "copy",
pattern: "isec-1-or-more/*.txt"
pattern: "isec-1-or-more/*.txt",
saveAs: { "${file(it).getParent().getName()}/${params.output_filename}_${file(it).getName()}" }
publishDir path: "${params.workflow_log_output_dir}",
mode: "copy",
pattern: ".command.*",
saveAs: { "${task.process.replace(':', '/')}-${task.index}/log${file(it).getName()}" }
saveAs: { "${task.process.replace(':', '/')}/log${file(it).getName()}" }
tyamaguchi-ucla marked this conversation as resolved.
Show resolved Hide resolved

input:
path vcfs
Expand All @@ -33,8 +35,8 @@ process intersect_VCFs_BCFtools {
path "*.vcf.gz", emit: consensus_vcf
path "*.vcf.gz.tbi", emit: consensus_idx
path ".command.*"
path "isec-2-or-more"
path "isec-1-or-more", emit: isec_dir
path "isec-2-or-more/*.txt"
path "isec-1-or-more/*.txt", emit: isec

script:
vcf_list = vcfs.join(' ')
Expand All @@ -58,11 +60,11 @@ process intersect_VCFs_BCFtools {
publishDir path: "${params.workflow_log_output_dir}",
mode: "copy",
pattern: ".command.*",
saveAs: { "${task.process.replace(':', '/')}-${task.index}/log${file(it).getName()}" }
saveAs: { "${task.process.replace(':', '/')}/log${file(it).getName()}" }

input:
path script_dir
path isec_dir
path isec

output:
path ".command.*"
Expand All @@ -71,6 +73,36 @@ process intersect_VCFs_BCFtools {
script:
"""
set -euo pipefail
Rscript ${script_dir}/plot-venn.R --isec_dir ${isec_dir} --outfile ${params.output_filename}_Venn-diagram.tiff
Rscript ${script_dir}/plot-venn.R --isec_readme README.txt --isec_sites sites.txt --outfile ${params.output_filename}_Venn-diagram.tiff
tyamaguchi-ucla marked this conversation as resolved.
Show resolved Hide resolved
"""
}

process concat_VCFs_BCFtools {
container params.docker_image_BCFtools
publishDir path: "${params.workflow_output_dir}/intermediate/${task.process.split(':')[-1]}",
mode: "copy",
pattern: "*concat.vcf",
enabled: params.save_intermediate_files
Copy link
Contributor Author

@sorelfitzgibbon sorelfitzgibbon Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this intermediate file will be used by vcf2maf (and has to be uncompressed)

publishDir path: "${params.workflow_log_output_dir}",
mode: "copy",
pattern: ".command.*",
saveAs: { "${task.process.replace(':', '/')}/log${file(it).getName()}" }

input:
path vcfs
path indices

output:
path "*concat.vcf", emit: concat_vcf
path ".command.*"

script:
vcf_list = vcfs.join(' ')
"""
set -euo pipefail
# BCFtools concat to create a single VCF with all nfiles +2 variants
# output header is a uniquified concatenation of all headers
# output `INFO` `FORMAT` `NORMAL` and `TUMOR` fields are from the first listed VCF that has the variant
yashpatel6 marked this conversation as resolved.
Show resolved Hide resolved
bcftools concat --output-type v --output ${params.output_filename}_SNV-concat.vcf --allow-overlaps --rm-dups all ${vcf_list}
yashpatel6 marked this conversation as resolved.
Show resolved Hide resolved
"""
}
48 changes: 42 additions & 6 deletions module/intersect.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
include { generate_sha512sum } from './common'
include { intersect_VCFs_BCFtools; plot_VennDiagram_R } from './intersect-processes.nf'
include { intersect_VCFs_BCFtools; plot_VennDiagram_R; concat_VCFs_BCFtools } from './intersect-processes.nf'
include { compress_index_VCF } from '../external/pipeline-Nextflow-module/modules/common/index_VCF_tabix/main.nf' addParams(
options: [
output_dir: params.workflow_output_dir,
log_output_dir: params.workflow_log_output_dir,
bgzip_extra_args: params.bgzip_extra_args,
tabix_extra_args: params.tabix_extra_args
])

workflow intersect {
take:
Expand All @@ -8,22 +15,51 @@ workflow intersect {
script_dir_ch

main:
vcfs_ch = tool_vcfs
.map {
it.sort { a, b ->
def toolA = file(a).getName()
def toolB = file(b).getName()
return toolA.compareTo(toolB)
}
yashpatel6 marked this conversation as resolved.
Show resolved Hide resolved
}
intersect_VCFs_BCFtools(
tool_vcfs,
vcfs_ch,
tool_indices,
params.intersect_regions,
params.intersect_regions_index
)
plot_VennDiagram_R(
script_dir_ch,
intersect_VCFs_BCFtools.out.isec,
)
consensus_vcfs_ch = intersect_VCFs_BCFtools.out.consensus_vcf
.map {
it.sort { a, b ->
def toolA = file(a).getName()
def toolB = file(b).getName()
return toolA.compareTo(toolB)
}
}
concat_VCFs_BCFtools(
consensus_vcfs_ch,
intersect_VCFs_BCFtools.out.consensus_idx
)
compress_index_VCF(concat_VCFs_BCFtools.out.concat_vcf
.map{ it -> ['SNV', it]}
)
file_for_sha512 = intersect_VCFs_BCFtools.out.consensus_vcf
.flatten()
.map{ it -> ["${file(it).getName().split('_')[0]}-SNV-vcf", it]}
.mix(intersect_VCFs_BCFtools.out.consensus_idx
.flatten()
.map{ it -> ["${file(it).getName().split('_')[0]}-SNV-idx", it]}
)
.mix(compress_index_VCF.out.index_out
.map{ it -> ["intersect-${it[0]}-vcf", it[1]] }
)
.mix(compress_index_VCF.out.index_out
.map{ it -> ["intersect-${it[0]}-index", it[2]] }
)
generate_sha512sum(file_for_sha512)
plot_VennDiagram_R(
script_dir_ch,
intersect_VCFs_BCFtools.out.isec_dir,
)
}
22 changes: 14 additions & 8 deletions r-scripts/plot-venn.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Script to plot a Venn diagram of shared variants from the different SNV calling algorithms, using the output of BCFtools isec
# Initial commit: Sorel Fitz-Gibbon 2023-06-29
# Input:
# -i, --isec_dir: The directory containing the output from BCFtools intersect
# -d, --dataset: The dataset ID passed from nextflow
# -r, --isec_readme: The README.txt file from BCFtools intersect
# -s, --isec_sites: The sites.txt file from BCFtools intersect
# -o, --outfile: The output filename
# Output:
# - A Venn diagram of shared variant counts from the BCFtools intersection of the VCF files

Expand All @@ -12,7 +13,8 @@ library('VennDiagram');

## Parse the arguments #############################################################################
parser <- ArgumentParser();
parser$add_argument('-i', '--isec_dir', help = 'The directory containing the output from BCFtools intersect', type = 'character');
parser$add_argument('-r', '--isec_readme', help = 'The README.txt file from BCFtools intersect', type = 'character');
parser$add_argument('-s', '--isec_sites', help = 'The sites.txt file from BCFtools intersect', type = 'character');
parser$add_argument('-o', '--outfile', help = 'Output filename', type = 'character');
args <- parser$parse_args();

Expand All @@ -31,20 +33,24 @@ plot.venn <- function(tool.variants, outfile) {

### Main ###########################################################################################
# Get intersection counts from BCFtools isec output and format for plotting
algorithms <- readLines(paste0(args$isec_dir,'/README.txt'));
algorithms <- algorithms[grep(paste0('^', args$isec_dir), algorithms)];
algorithms <- gsub(paste0(args$isec_dir,'.*\t'), '', algorithms);
algorithms <- readLines(args$isec_readme);
tyamaguchi-ucla marked this conversation as resolved.
Show resolved Hide resolved
algorithms <- algorithms[grep('^isec-1-or-more', algorithms)];
algorithms <- gsub('isec-1-or-more.*\t', '', algorithms);
algorithms <- gsub('-.*', '', algorithms);
sites <- read.table(paste0(args$isec_dir,'/sites.txt'), header = FALSE, colClasses = 'character');

sites <- read.table(args$isec_sites, header = FALSE, colClasses = 'character');
split.col <- strsplit(as.character(sites$V5), '');
sites$col1 <- sapply(split.col, '[', 1);
sites$col2 <- sapply(split.col, '[', 2);
sites$col3 <- sapply(split.col, '[', 3);
sites$col4 <- sapply(split.col, '[', 4);
sites$V5 <- NULL;

header <- c('chrom', 'pos', 'ref', 'alt', algorithms);
colnames(sites) <- header

variants <- paste(sites$chrom, sites$pos, sep = '_');
tool.variants <- lapply(sites[, algorithms], function(x) variants[x == 1])
tool.variants <- lapply(sites[, algorithms], function(x) variants[x == 1]);
tool.variants.ordered <- tool.variants[order(lengths(tool.variants), decreasing = TRUE)];

plot.venn(tool.variants.ordered, args$outfile);
Loading