Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update how PipeVal is being used to validate input and output files #240

Closed
wants to merge 15 commits into from
Closed
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
## [Unreleased]

### Changed
- Replace `blcdsdockerregistry/validate:2.1.5` (out of date) with the version from [pipeline-Nextflow-module](https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/validate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to specify that we're using v3.0.0; the module might get updated to newer versions but the version used by the pipeline is still dictated by in default.config

- Update input csv according to [here](https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=2022-07-27+Nextflow+Working+Group+Meeting+Notes) (Section "Input structures for alignment pipelines")
- `run_MarkDuplicatesSpark_GATK` now retries once with 130GB on F72, and 140GB on M64
- Update registered output function
Expand Down
3 changes: 2 additions & 1 deletion config/default.config
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@ params {
// tools and their versions
bwa_version = "BWA-MEM2-2.2.1"
hisat2_version = "HISAT2-2.2.1"
pipeval_version = "2.1.6"
Copy link
Contributor

@tyamaguchi-ucla tyamaguchi-ucla Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


docker_image_bwa_and_samtools = "blcdsdockerregistry/bwa-mem2_samtools-1.12:2.2.1"
docker_image_hisat2_and_samtools = "blcdsdockerregistry/hisat2_samtools-1.12:2.2.1"
docker_image_picardtools = "blcdsdockerregistry/picard:2.26.10"
docker_image_sha512sum = "blcdsdockerregistry/align-dna:sha512sum-1.0"
docker_image_validate_params = "blcdsdockerregistry/validate:2.1.5"
docker_image_validate_params = "blcdsdockerregistry/validate:2.1.6"
docker_image_gatk = "broadinstitute/gatk:4.2.4.1"
docker_image_samtools = "blcdsdockerregistry/samtools:1.15.1"

Expand Down
20 changes: 12 additions & 8 deletions module/align_DNA_BWA_MEM2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,13 @@
// here it actually saves cost, time, and memory to directly pipe the output into
// samtools due to the large size of the uncompressed SAM files.
include { run_sort_SAMtools ; run_merge_SAMtools } from './samtools.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf'
include { run_validate_PipeVal } from '../external/nextflow-modules/modules/PipeVal/validate/main.nf' addParams(
options: [
docker_image_version: params.pipeval_version,
process_label: 'process_low',
main_process: "BWA-MEM2-${params.bwa_version}"
]
)
include { run_MarkDuplicate_Picard } from './mark_duplicate_picardtools.nf'
include { run_MarkDuplicatesSpark_GATK } from './mark_duplicates_spark.nf'
include { generate_sha512sum } from './check_512sum.nf'
Expand Down Expand Up @@ -83,13 +89,12 @@ workflow align_DNA_BWA_MEM2_workflow {
run_validate_PipeVal(ich_samples_validate.mix(
ich_reference_fasta,
ich_reference_index_files
),
aligner_log_dir
)
)

// change validation file name depending on whether inputs or outputs are being validated
//val_filename = ${task.process.split(':')[1].replace('_', '-')} == run-validate ? "input_validation.txt" : "output_validation.txt"
run_validate_PipeVal.out.val_file.collectFile(
run_validate_PipeVal.out.validation_result.collectFile(
name: 'input_validation.txt',
storeDir: "${aligner_validation_dir}"
)
Expand Down Expand Up @@ -124,14 +129,13 @@ workflow align_DNA_BWA_MEM2_workflow {
}
}
generate_sha512sum(och_bam_index.mix(och_bam), aligner_output_dir)
validate_output_file(
run_validate_PipeVal(
och_bam.mix(
och_bam_index,
Channel.from(params.work_dir, params.output_dir)
),
aligner_log_dir
)
)
validate_output_file.out.val_file.collectFile(
run_validate_PipeVal.out.validation_result.collectFile(
name: 'output_validation.txt',
storeDir: "${aligner_validation_dir}"
)
Expand Down
2 changes: 1 addition & 1 deletion module/align_DNA_HISAT2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
// samtools due to the large size of the uncompressed SAM files.

include { run_sort_SAMtools ; run_merge_SAMtools} from './samtools.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from '../external/nextflow-modules/modules/PipeVal/validate/main.nf'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the workflow for HISAT2 not require updates like the ones with BWA-MEM2?

Copy link
Contributor Author

@jarbet jarbet Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HISAT2 workflow needs to get the same updates. However, I cannot get PipeVal to work in the BWA-MEM2 workflow (see previous comments), so I am waiting to update HISAT2.

I don't understand what PipeVal is doing on the current main branch (ignoring changes from this PR), and I think there might be some issues. For example, here is A-mini-n1 testing output that uses the main branch (so no changes to PipeVal code):

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-filename-standardized/align-DNA-8.1.0/a_mini_n1_standardized

BWA-MEM2:

cat input_validation.txt 
Input: genome.fa is valid file-fasta
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: genome.fa.bwt.2bit.64 is valid file-unknown
Input: genome.fa.0123 is valid file-unknown
Input: genome.fa.fai is valid file-unknown
Input: genome.fa.ann is valid file-unknown
Input: genome.fa.amb is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: genome.fa.pac is valid file-unknown
cat output_validation.txt 
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown
Input: scratch is valid file-unknown

HISAT2:

cat input_validation.txt 
Input: Homo_sapiens_assembly38.7.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.5.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.2.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.6.ht2 is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.1.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.fasta is valid file-fasta
Input: Homo_sapiens_assembly38.8.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.3.ht2 is valid file-unknown
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.4.ht2 is valid file-unknown
cat output_validation.txt 
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: scratch is valid file-unknown
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown

Nearly all files are valid file-unknown and it looks like the github branch name, as well as scratch are also passed to PipeVal in output_validation.txt? Does this output look correct to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the files are file-unknown because they don't have an expected file extension, which is fine. Also, the main branch uses v2.1.5 of pipeval, which had issues detecting multiple extensions like fq.gz so those also got marked as file-unknown.

Align-DNA also tries to validate the work_dir and output_dir but it looks like they tried to validate them as file-input even though they're directories, leading to that statement. So with v3.0.0, these two would need to be validated as directories rather than as files.

We may need to wait for pipeval to get updated before we can proceed with this PR.

include { run_MarkDuplicate_Picard } from './mark_duplicate_picardtools.nf'
include { run_MarkDuplicatesSpark_GATK } from './mark_duplicates_spark.nf'
include { generate_sha512sum } from './check_512sum.nf'
Expand Down
23 changes: 0 additions & 23 deletions module/validation.nf

This file was deleted.