Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update how PipeVal is being used to validate input and output files #240

Closed
wants to merge 15 commits into from

Conversation

jarbet
Copy link
Contributor

@jarbet jarbet commented Aug 31, 2022

Description

Replace blcdsdockerregistry/validate:2.1.5 (out of date) with the version from pipeline-Nextflow-module.

Closes #239

Testing Results

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate

  • BWA-MEM2
    • input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/a_mini_n1_standardized
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/BWA-MEM2-a_mini_n1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/align-DNA-8.1.0

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up the branch protection rule following the github standards before opening this pull request, or the branch protection rule has already been set up.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample with aligner setting to BWA-MEM2, HISAT2, and both. The paths to the test config files and output directories were attached in the Testing Results section.

@@ -14,12 +14,13 @@ params {
// tools and their versions
bwa_version = "BWA-MEM2-2.2.1"
hisat2_version = "HISAT2-2.2.1"
pipeval_version = "2.1.6"
Copy link
Contributor

@tyamaguchi-ucla tyamaguchi-ucla Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jarbet
Copy link
Contributor Author

jarbet commented Aug 31, 2022

@aybeshlikyan, @yashpatel6 : I'm having trouble figuring out how to switch to PipeVal/validate. Any suggestions on how to modify my code?

Here is the error I'm currently getting:

WARN: Input tuple does not match input set cardinality declared by process align_DNA_BWA_MEM2_workflow:validate_output_file -- offending value: /scratch
WARN: Input tuple does not match input set cardinality declared by process align_DNA_BWA_MEM2_workflow:run_validate_PipeVal -- offending value: /hot/ref/tool-specific-input/BWA-MEM2-2.2.1/GRCh38-BI-20160721/index/genome.fa.bwt.2bit.64
Error executing process > 'align_DNA_BWA_MEM2_workflow:run_validate_PipeVal (6)'

Caused by:
Not a valid path value type: org.codehaus.groovy.runtime.NullObject (null)

Yash mentioned I might need to first process the input files using a line similar to here. I'm not sure what this line is doing, and I noticed align-RNA does not include a similar step, so I'm not sure if this line is needed?

@aybeshlikyan
Copy link

@jarbet For the first warning/error, it seems like it's caused by lines 134-137 in "module/align_DNA_BWA_MEM2.nf" ; the function validate_output_file (AKA run_validate_PipeVal from the submodule) is expecting a channel with 2 items: the mode of the file to validate (in this case it would be 'file-bam') and the file to be validated (i.e. och_bam), but it's getting a channel with the items och_bam, och_bam_index, params.work_dir (i.e. "/scratch"), and params.output_dir. That's why it's complaining about the input set cardinality-- it's expecting 2 items but getting 4. The old run_validate_PipeVal in "validate.nf" took in the output log dir as an input, but the submodule handles that with the options passed into "addParams" with the include statement at the top of the file.

For the second warning/error, it's caused by lines 91-94 in the same file. It seems like it's complaining about the cardinality again, but also the index file (I think it's talking about ich_reference_index_files?); PipeVal can't handle that type of file.

The line that Yash linked in call-gSNP basically sets up a channel with the correct cardinality, mapping the unique BAM files from the input csv into items that are tuples of the form [<PipeVal file type>, <path to BAM file>] (i.e. ['file-input', it]).

I think setting up both validate_output_file and run_validate_PipeVal to take in channels with the correct number/type of items should fix it or at least get it closer to running. However I might not be understanding what's going on fully because I'm confused about how/why the index files are passed into run_validate_PipeVal, because I thought PipeVal couldn't run on those; maybe @yashpatel6 can make sure everything I said is accurate and give more insight if I misunderstood anything.

@aybeshlikyan
Copy link

I noticed align-RNA does not include a similar step, so I'm not sure if this line is needed?

@jarbet Align-RNA doesn't use that exact way to set up the channel but it does something similar here, here, and here.

@aybeshlikyan
Copy link

@jarbet Also a couple other things I noticed:

  • the pipeval_version variable in "default.config" is set to 3.0.0 but the submodule is still the version of the repo that's using 2.1.6; you can run git submodule update --remote external/nextflow-modules/ from the root of the align-DNA repo to pull the latest version of the submodule
  • You can change docker_image_validate_params = "blcdsdockerregistry/validate:2.1.6" to docker_image_validate_params = "blcdsdockerregistry/pipeval:3.0.0" in the default.config.

@yashpatel6
Copy link
Contributor

@jarbet For the first warning/error, it seems like it's caused by lines 134-137 in "module/align_DNA_BWA_MEM2.nf" ; the function validate_output_file (AKA run_validate_PipeVal from the submodule) is expecting a channel with 2 items: the mode of the file to validate (in this case it would be 'file-bam') and the file to be validated (i.e. och_bam), but it's getting a channel with the items och_bam, och_bam_index, params.work_dir (i.e. "/scratch"), and params.output_dir. That's why it's complaining about the input set cardinality-- it's expecting 2 items but getting 4. The old run_validate_PipeVal in "validate.nf" took in the output log dir as an input, but the submodule handles that with the options passed into "addParams" with the include statement at the top of the file.

For the second warning/error, it's caused by lines 91-94 in the same file. It seems like it's complaining about the cardinality again, but also the index file (I think it's talking about ich_reference_index_files?); PipeVal can't handle that type of file.

The line that Yash linked in call-gSNP basically sets up a channel with the correct cardinality, mapping the unique BAM files from the input csv into items that are tuples of the form [<PipeVal file type>, <path to BAM file>] (i.e. ['file-input', it]).

I think setting up both validate_output_file and run_validate_PipeVal to take in channels with the correct number/type of items should fix it or at least get it closer to running. However I might not be understanding what's going on fully because I'm confused about how/why the index files are passed into run_validate_PipeVal, because I thought PipeVal couldn't run on those; maybe @yashpatel6 can make sure everything I said is accurate and give more insight if I misunderstood anything.

Generally, you're correct! Some minor clarifications:

The process does expect each input to be a tuple with cardinality 2. So each emission in the input channel to the process must be a tuple of cardinality 2. The channel itself can contain as many emissions as you want. The cause for the error is that the input channel contains 4 emissions that each have a single item (cardinality 1). So the problem isn't that there are 4 items but that these 4 items have cardinality 1 when they need to have cardinality 2.

The second error/warning is actually caused because the input only had cardinality 1 so the second expected item (the actual path to the file for validating) is null and so you get that error.

PipeVal doesn't have explicit validation for index files but it is still able to verify that the file actually exists I believe. So passing ['file-input', <index_file>] should still work fine as a basic check that the file exists and if there's a checksum, I believe PipeVal will validate that (I could be wrong on the checksum part).

@jarbet
Copy link
Contributor Author

jarbet commented Sep 2, 2022

@aybeshlikyan , @yashpatel6 : thanks for the help so far. I understand the new run_validate_PipeVal requires 2 parameters, although Yash mentioned it can take a channel/set of many inputs, as long as each has the 2 required parameters. I don't know how to setup the inputs/channel to do this, but I'm reading more of the Nextflow documentation to try and figure it out.

Conceptually, I think I need something like the following:

For the input validation:

run_validate_PipeVal(
         // This needs to be a vector containing 'file-input' (or 'file-fasta'?), repeated X times, where X is the number of fasta files to validate
         [ file-input, file-input, ...., file-input]
         // I think this is a vector of paths to multiple fasta files? 
         ich_samples_validate.mix(.
         ich_reference_index_files
         )
         )

For the output validation:

validate_output_file(
         // This needs to be a vector containing 'file-input' (or 'file-bam'?), repeated X times, where X is the number of bam files to validate
         [ file-input, file-input, ...., file-input]
         // I think this is a vector of paths to multiple bam files?
         och_bam.mix(
            och_bam_index,
            Channel.from(params.work_dir, params.output_dir)
            )
         )

I don't know how to do this in groovy but here is a very rough guess based on here:

For the input validation:

ich_samples_validate.mix(.
         ich_reference_index_files
         ).flatMap{it -> ['file-input', it}.set{pipeval_input}
         // this assumes 'it' is every element of the ich_samples_validate.mix channel vector? 

For the output validation:

och_bam.mix(
            och_bam_index,
            Channel.from(params.work_dir, params.output_dir)
            ).flatMap{it -> ['file-input', it}.set{pipeval_output}

Then somehow I would have to send the pipeval_input and pipeval_output stuff to the corresponding calls to run_validate_PipeVal and validate_output_file ?

@yashpatel6
Copy link
Contributor

Generally, that's correct. You'll want to use map instead of flatMap though. flatMap is meant for doing a mapping operation on a channel and then flattening the result so if you use that, you'll end up with with each emission being a single element rather than a tuple of size 2. The reason there's a flatMap in call-gSNP is that the input channel contained values that didn't need to be validated so we map to keep only the things we want to validate -> flatten the result so each file to validate is a single emission -> map to ['file-input', it] to get it to the required input format. The flatMap operator handles the first two steps and the map handles the last step.

@jarbet
Copy link
Contributor Author

jarbet commented Sep 20, 2022

@yashpatel6 @aybeshlikyan : I think I fixed the input cardinality problem when validating the input files and output files, since I am no longer getting the cardinality warnings.

However, I am now getting the following error:
/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/BWA-MEM2-a_mini_n2.log

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_validate_PipeVal (1)'

Caused by:
  Process `align_DNA_BWA_MEM2_workflow:run_validate_PipeVal (1)` terminated with an error exit status (1)

Command executed:

  set -euo pipefail
  validate -t file-input genome.fa > 'validation.txt'

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /scratch/0a/231f887f90535270d8b01aa4749ea5

Although the cardinality is correct now, I think the contents of what is being passed to PipeVal seems wrong:

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-validate/align-DNA-8.1.0/a_mini_n2/BWA-MEM2-2.2.1/validation:

[[email protected] validation]$ cat input_validation.txt 
Input: HG002.N-n2_R2.fq.gz is valid file-fastq
Input: genome.fa.0123 is valid file-unknown
Input: genome.fa.fai is valid file-unknown
Input: genome.fa.bwt.2bit.64 is valid file-unknown
Input: genome.fa.ann is valid file-unknown
[[email protected] validation]$ cat output_validation.txt 
Input: scratch is valid file-unknown
Input: jarbet-validate is valid file-unknown

Any thoughts on how to fix this?
I'm guessing this line is no longer needed, but I don't understand the error from run_validate_PipeVal.

@jarbet jarbet changed the title Replace docker validate with pipeline-Nextflow-module validate update how PipeVal is being used to validate input and output files Sep 20, 2022
@yashpatel6
Copy link
Contributor

This seems to be because PipeVal got updated to throw an error for any file type that's not BAM, VCF, or FASTQ. Previous versions of PipeVal used to allow for other extensions and would check checksums if available and otherwise just pass the validation. @aybeshlikyan @tyamaguchi-ucla Any reason PipeVal was set up to only accept BAM, VCF, and FASTQ files in 3.0.0?

@tyamaguchi-ucla
Copy link
Contributor

@aybeshlikyan @tyamaguchi-ucla Any reason PipeVal was set up to only accept BAM, VCF, and FASTQ files in 3.0.0?

I wasn't aware of the change. @aybeshlikyan thoughts?

@jarbet jarbet marked this pull request as ready for review September 26, 2022 21:38
@jarbet jarbet requested a review from a team as a code owner September 26, 2022 21:38
@@ -10,6 +10,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
## [Unreleased]

### Changed
- Replace `blcdsdockerregistry/validate:2.1.5` (out of date) with the version from [pipeline-Nextflow-module](https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/validate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to specify that we're using v3.0.0; the module might get updated to newer versions but the version used by the pipeline is still dictated by in default.config


docker_image_bwa_and_samtools = "blcdsdockerregistry/bwa-mem2_samtools-1.12:2.2.1"
docker_image_hisat2_and_samtools = "blcdsdockerregistry/hisat2_samtools-1.12:2.2.1"
docker_image_picardtools = "blcdsdockerregistry/picard:2.26.10"
docker_image_sha512sum = "blcdsdockerregistry/align-dna:sha512sum-1.0"
docker_image_validate_params = "blcdsdockerregistry/validate:2.1.5"
docker_image_validate_params = "blcdsdockerregistry/pipeval:3.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docker_image_validate_params = "blcdsdockerregistry/pipeval:3.0.0"
docker_image_validate_params = "blcdsdockerregistry/pipeval:${pipeval_version}"

To avoid hard-coding and duplication

@@ -3,7 +3,15 @@
// here it actually saves cost, time, and memory to directly pipe the output into
// samtools due to the large size of the uncompressed SAM files.
include { run_sort_SAMtools ; run_merge_SAMtools } from './samtools.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf'

// need to rename run_validate_PipeVal since using twice. Use run_validate_PipeVal for input validation; validate_output_file for output validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// need to rename run_validate_PipeVal since using twice. Use run_validate_PipeVal for input validation; validate_output_file for output validation
// Use run_validate_PipeVal for input validation; validate_output_file for output validation

The first part of the comment is unnecessary for the actual pipeline I think

@@ -80,16 +88,20 @@ workflow align_DNA_BWA_MEM2_workflow {
ich_reference_fasta
ich_reference_index_files
main:
run_validate_PipeVal(ich_samples_validate.mix(

run_validate_PipeVal_inputs = ich_samples_validate.mix( //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run_validate_PipeVal_inputs = ich_samples_validate.mix( //
run_validate_PipeVal_inputs = ich_samples_validate.mix(

Empty comment

@@ -80,16 +88,20 @@ workflow align_DNA_BWA_MEM2_workflow {
ich_reference_fasta
ich_reference_index_files
main:
run_validate_PipeVal(ich_samples_validate.mix(

run_validate_PipeVal_inputs = ich_samples_validate.mix( //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, generally for input channels, we want to use input_ch_*

@@ -4,7 +4,7 @@
// samtools due to the large size of the uncompressed SAM files.

include { run_sort_SAMtools ; run_merge_SAMtools} from './samtools.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from './validation.nf'
include { run_validate_PipeVal; run_validate_PipeVal as validate_output_file } from '../external/nextflow-modules/modules/PipeVal/validate/main.nf'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the workflow for HISAT2 not require updates like the ones with BWA-MEM2?

Copy link
Contributor Author

@jarbet jarbet Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HISAT2 workflow needs to get the same updates. However, I cannot get PipeVal to work in the BWA-MEM2 workflow (see previous comments), so I am waiting to update HISAT2.

I don't understand what PipeVal is doing on the current main branch (ignoring changes from this PR), and I think there might be some issues. For example, here is A-mini-n1 testing output that uses the main branch (so no changes to PipeVal code):

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-filename-standardized/align-DNA-8.1.0/a_mini_n1_standardized

BWA-MEM2:

cat input_validation.txt 
Input: genome.fa is valid file-fasta
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: genome.fa.bwt.2bit.64 is valid file-unknown
Input: genome.fa.0123 is valid file-unknown
Input: genome.fa.fai is valid file-unknown
Input: genome.fa.ann is valid file-unknown
Input: genome.fa.amb is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: genome.fa.pac is valid file-unknown
cat output_validation.txt 
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: BWA-MEM2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown
Input: scratch is valid file-unknown

HISAT2:

cat input_validation.txt 
Input: Homo_sapiens_assembly38.7.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.5.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.2.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.6.ht2 is valid file-unknown
Input: HG002.N-n1_R2.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.1.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.fasta is valid file-fasta
Input: Homo_sapiens_assembly38.8.ht2 is valid file-unknown
Input: Homo_sapiens_assembly38.3.ht2 is valid file-unknown
Input: HG002.N-n1_R1.fq.gz is valid file-unknown
Input: Homo_sapiens_assembly38.4.ht2 is valid file-unknown
cat output_validation.txt 
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam.bai is valid file-unknown
Input: scratch is valid file-unknown
Input: HISAT2-2.2.1_000000_a-mini-n1-standardized.bam is valid file-unknown
Input: jarbet-filename-standardized is valid file-unknown

Nearly all files are valid file-unknown and it looks like the github branch name, as well as scratch are also passed to PipeVal in output_validation.txt? Does this output look correct to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the files are file-unknown because they don't have an expected file extension, which is fine. Also, the main branch uses v2.1.5 of pipeval, which had issues detecting multiple extensions like fq.gz so those also got marked as file-unknown.

Align-DNA also tries to validate the work_dir and output_dir but it looks like they tried to validate them as file-input even though they're directories, leading to that statement. So with v3.0.0, these two would need to be validated as directories rather than as files.

We may need to wait for pipeval to get updated before we can proceed with this PR.

@yashpatel6
Copy link
Contributor

Closing this PR as it's old and will be handled in a different PR

@yashpatel6 yashpatel6 closed this May 15, 2023
@yashpatel6 yashpatel6 deleted the jarbet-validate branch March 30, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace docker validate with pipeline-Nextflow-module validate
4 participants