-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update how PipeVal is being used to validate input and output files #240
Conversation
config/default.config
Outdated
@@ -14,12 +14,13 @@ params { | |||
// tools and their versions | |||
bwa_version = "BWA-MEM2-2.2.1" | |||
hisat2_version = "HISAT2-2.2.1" | |||
pipeval_version = "2.1.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aybeshlikyan, @yashpatel6 : I'm having trouble figuring out how to switch to PipeVal/validate. Any suggestions on how to modify my code? Here is the error I'm currently getting:
Yash mentioned I might need to first process the input files using a line similar to here. I'm not sure what this line is doing, and I noticed |
@jarbet For the first warning/error, it seems like it's caused by lines 134-137 in "module/align_DNA_BWA_MEM2.nf" ; the function For the second warning/error, it's caused by lines 91-94 in the same file. It seems like it's complaining about the cardinality again, but also the index file (I think it's talking about The line that Yash linked in call-gSNP basically sets up a channel with the correct cardinality, mapping the unique BAM files from the input csv into items that are tuples of the form I think setting up both |
@jarbet Also a couple other things I noticed:
|
Generally, you're correct! Some minor clarifications: The process does expect each input to be a tuple with cardinality 2. So each emission in the input channel to the process must be a tuple of cardinality 2. The channel itself can contain as many emissions as you want. The cause for the error is that the input channel contains 4 emissions that each have a single item (cardinality 1). So the problem isn't that there are 4 items but that these 4 items have cardinality 1 when they need to have cardinality 2. The second error/warning is actually caused because the input only had cardinality 1 so the second expected item (the actual path to the file for validating) is PipeVal doesn't have explicit validation for index files but it is still able to verify that the file actually exists I believe. So passing |
@aybeshlikyan , @yashpatel6 : thanks for the help so far. I understand the new Conceptually, I think I need something like the following: For the input validation:
For the output validation:
I don't know how to do this in groovy but here is a very rough guess based on here: For the input validation:
For the output validation:
Then somehow I would have to send the |
Generally, that's correct. You'll want to use |
@yashpatel6 @aybeshlikyan : I think I fixed the input cardinality problem when validating the input files and output files, since I am no longer getting the cardinality warnings. However, I am now getting the following error:
Although the cardinality is correct now, I think the contents of what is being passed to PipeVal seems wrong:
Any thoughts on how to fix this? |
This seems to be because |
I wasn't aware of the change. @aybeshlikyan thoughts? |
@@ -10,6 +10,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm | |||
## [Unreleased] | |||
|
|||
### Changed | |||
- Replace `blcdsdockerregistry/validate:2.1.5` (out of date) with the version from [pipeline-Nextflow-module](https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/validate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to specify that we're using v3.0.0
; the module might get updated to newer versions but the version used by the pipeline is still dictated by in default.config
|
||
docker_image_bwa_and_samtools = "blcdsdockerregistry/bwa-mem2_samtools-1.12:2.2.1" | ||
docker_image_hisat2_and_samtools = "blcdsdockerregistry/hisat2_samtools-1.12:2.2.1" | ||
docker_image_picardtools = "blcdsdockerregistry/picard:2.26.10" | ||
docker_image_sha512sum = "blcdsdockerregistry/align-dna:sha512sum-1.0" | ||
docker_image_validate_params = "blcdsdockerregistry/validate:2.1.5" | ||
docker_image_validate_params = "blcdsdockerregistry/pipeval:3.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker_image_validate_params = "blcdsdockerregistry/pipeval:3.0.0" | |
docker_image_validate_params = "blcdsdockerregistry/pipeval:${pipeval_version}" |
To avoid hard-coding and duplication
@@ -3,7 +3,15 @@ | |||
// here it actually saves cost, time, and memory to directly pipe the output into | |||
// samtools due to the large size of the uncompressed SAM files. | |||
include { run_sort_SAMtools ; run_merge_SAMtools } from './samtools.nf' | |||
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf' | |||
|
|||
// need to rename run_validate_PipeVal since using twice. Use run_validate_PipeVal for input validation; validate_output_file for output validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// need to rename run_validate_PipeVal since using twice. Use run_validate_PipeVal for input validation; validate_output_file for output validation | |
// Use run_validate_PipeVal for input validation; validate_output_file for output validation |
The first part of the comment is unnecessary for the actual pipeline I think
@@ -80,16 +88,20 @@ workflow align_DNA_BWA_MEM2_workflow { | |||
ich_reference_fasta | |||
ich_reference_index_files | |||
main: | |||
run_validate_PipeVal(ich_samples_validate.mix( | |||
|
|||
run_validate_PipeVal_inputs = ich_samples_validate.mix( // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_validate_PipeVal_inputs = ich_samples_validate.mix( // | |
run_validate_PipeVal_inputs = ich_samples_validate.mix( |
Empty comment
@@ -80,16 +88,20 @@ workflow align_DNA_BWA_MEM2_workflow { | |||
ich_reference_fasta | |||
ich_reference_index_files | |||
main: | |||
run_validate_PipeVal(ich_samples_validate.mix( | |||
|
|||
run_validate_PipeVal_inputs = ich_samples_validate.mix( // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, generally for input channels, we want to use input_ch_*
@@ -4,7 +4,7 @@ | |||
// samtools due to the large size of the uncompressed SAM files. | |||
|
|||
include { run_sort_SAMtools ; run_merge_SAMtools} from './samtools.nf' | |||
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf' | |||
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from '../external/nextflow-modules/modules/PipeVal/validate/main.nf' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the workflow for HISAT2 not require updates like the ones with BWA-MEM2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HISAT2 workflow needs to get the same updates. However, I cannot get PipeVal to work in the BWA-MEM2 workflow (see previous comments), so I am waiting to update HISAT2.
I don't understand what PipeVal is doing on the current main
branch (ignoring changes from this PR), and I think there might be some issues. For example, here is A-mini-n1
testing output that uses the main
branch (so no changes to PipeVal code):
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-filename-standardized/align-DNA-8.1.0/a_mini_n1_standardized
BWA-MEM2:
cat input_validation.txt
Input: genome.fa is valid file-fasta
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: genome.fa.bwt.2bit.64 is valid file-unknown
Input: genome.fa.0123 is valid file-unknown
Input: genome.fa.fai is valid file-unknown
Input: genome.fa.ann is valid file-unknown
Input: genome.fa.amb is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: genome.fa.pac is valid file-unknown
cat output_validation.txt
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown
Input: scratch is valid file-unknown
HISAT2:
cat input_validation.txt
Input: Homo_sapiens_assembly38.7.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.5.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.2.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.6.ht2 is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.1.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.fasta is valid file-fasta
Input: Homo_sapiens_assembly38.8.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.3.ht2 is valid file-unknown
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.4.ht2 is valid file-unknown
cat output_validation.txt
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: scratch is valid file-unknown
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown
Nearly all files are valid file-unknown
and it looks like the github branch name, as well as scratch
are also passed to PipeVal in output_validation.txt
? Does this output look correct to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the files are file-unknown
because they don't have an expected file extension, which is fine. Also, the main branch uses v2.1.5
of pipeval, which had issues detecting multiple extensions like fq.gz
so those also got marked as file-unknown
.
Align-DNA also tries to validate the work_dir
and output_dir
but it looks like they tried to validate them as file-input
even though they're directories, leading to that statement. So with v3.0.0
, these two would need to be validated as directories rather than as files.
We may need to wait for pipeval to get updated before we can proceed with this PR.
Closing this PR as it's old and will be handled in a different PR |
Description
Replace
blcdsdockerregistry/validate:2.1.5
(out of date) with the version from pipeline-Nextflow-module.Closes #239
Testing Results
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/a_mini_n1_standardized
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/BWA-MEM2-a_mini_n1.config
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/align-DNA-8.1.0
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up the branch protection rule following the github standards before opening this pull request, or the branch protection rule has already been set up.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample with aligner setting to
BWA-MEM2
,HISAT2
, and both. The paths to the test config files and output directories were attached in the Testing Results section.