Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host_filter.wdl modernization #70

Merged
merged 83 commits into from
Apr 25, 2023
Merged

host_filter.wdl modernization #70

merged 83 commits into from
Apr 25, 2023

Conversation

mlin
Copy link
Collaborator

@mlin mlin commented May 31, 2022

copilot:summary

Comment on lines 21 to 22
File human_bowtie2_index_tar
File human_hisat2_index_tar
Copy link
Contributor

@valenzuelaomar valenzuelaomar Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the host genome is not a human? would I still provide the custom host genome file path as a value to human_bowtie2_index_tar?

I'm trying to figure out how the inputs here change depending on human vs non-human hosts

Copy link
Collaborator Author

@mlin mlin Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ovalenzuela19 Good question. If the host genome is non-human then we filter against both the host and human genomes. This protects privacy in case for example, we're sequencing a mosquito carrying some blood from a human it recently bit. So the pipeline always takes in the human index files in addition to the host index files (human or not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh this makes sense! Thanks so much!

String host_genome
String genome_dir = "STAR_genome/part-0/"
# Adapter trimming and QC filtering
call fastp_qc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename the steps slightly? I believe our step names are in camelCase and start with a verb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. fastp_qc => RunFastpQc

@mlin mlin marked this pull request as ready for review March 31, 2023 08:44
@mlin mlin changed the title host_filter.wdl modernization WIP host_filter.wdl modernization Mar 31, 2023
@mlin
Copy link
Collaborator Author

mlin commented Mar 31, 2023

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar Please have a look over this PR before merging at last. I took out the indexing from this since that's subsumed in #182. Note that this replaces the existing host_filter.wdl, not sure if there might be value in keeping both versions around (on HEAD)?

@mlin
Copy link
Collaborator Author

mlin commented Apr 6, 2023

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar In today's pipeline meeting we discussed that indeed, it seems like a good idea to keep both versions of the WDL alive on the main branch, since we expect to continue supporting the original version for some time.

One way to do this would be to keep the current diff that replaces host_filter.wdl and rewires other parts that refers to it; and simply add original_host_filter.wdl alongside. There are other ways of course, but let me put that one forward as a strawperson.

Thoughts? would this need some change to whatever part of the system actually launches the original version if so requested?

@@ -31,7 +31,7 @@ def test_bench3_viral(short_read_mngs_bench3_viral_outputs):
taxon_counts = json.load(infile)["pipeline_output"]["taxon_counts_attributes"]

taxa = set(entry["tax_id"] for entry in taxon_counts)
assert len(taxa) == 177
assert abs(len(taxa) - 184) < 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting to introduce some variability into the host filtering? Or was this just a holdover from the old assert?

@rzlim08
Copy link
Collaborator

rzlim08 commented Apr 6, 2023

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar In today's pipeline meeting we discussed that indeed, it seems like a good idea to keep both versions of the WDL alive on the main branch, since we expect to continue supporting the original version for some time.

One way to do this would be to keep the current diff that replaces host_filter.wdl and rewires other parts that refers to it; and simply add original_host_filter.wdl alongside. There are other ways of course, but let me put that one forward as a strawperson.

Thoughts? would this need some change to whatever part of the system actually launches the original version if so requested?

Thanks Mike and sorry for the late review.

One option is to just use host_filter.wdl as a workflow manager and reference other wdl files e.g. (old_host_filter.wdl, new_host_filter.wdl). The wdl files already get miniwdl zipped up so we could just call the zip files from the webapp.

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

@mlin
Copy link
Collaborator Author

mlin commented Apr 7, 2023

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

@rzlim08 I think it would be closer to the second case unfortunately. Would you still want to see that approach in view of that potential awkwardness? (Thx- I'm just not familiar enough with the pieces of code actually invoking these workflows to understand the constraints here)

@rzlim08
Copy link
Collaborator

rzlim08 commented Apr 12, 2023

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

@rzlim08 I think it would be closer to the second case unfortunately. Would you still want to see that approach in view of that potential awkwardness? (Thx- I'm just not familiar enough with the pieces of code actually invoking these workflows to understand the constraints here)

I think the other option would be to run with 2 WDL files and the webapp could choose between them. This would likely mean one of the wdl's would be set as a "default" for e.g. local testing/ benchmarking, but the webapp could have the logic to run one vs the other. This might be the easiest solution for now.

@mlin
Copy link
Collaborator Author

mlin commented Apr 14, 2023

@ovalenzuela19 The tentative solution here is to merge with the modern host filtering WDL replacing the original one as host_filter.wdl, and adding the original one alongside as original_host_filter.wdl. Everything else in the repo (local_driver.wdl, amr/run.wdl, unit tests, etc.) would be wired to host_filter.wdl and not original_host_filter.wdl.

Question for you- is that going to work for the webapp as far as how it invokes the original pipeline when asked to do so? Or does it need the zip to have host_filter.wdl as the original pipeline, in which case there would not be much point (and might cause additional confusion) to add original_host_filter.wdl at all.

Thanks, we're obviously trying to avoid breaking anything that depends on the structure of this repo that aren't necessarily obvious when making these kinds of changes.

@valenzuelaomar
Copy link
Contributor

@mlin so would original_host_filter.wdl just be there for reference and not really usable?

I think from a webapp perspective that solution should be fine

@mlin
Copy link
Collaborator Author

mlin commented Apr 14, 2023

@ovalenzuela19

would original_host_filter.wdl just be there for reference and not really usable?

Mostly yes, for reference. The unanswered bit is what would we do if a user absolutely needs us to make some further change/bugfix to the original pipeline they're still depending on. How would we roll that out as a tagged release & zip file the webapp can use, once it's been renamed to original_host_filter.wdl?

If we're optimists we would say that's fairly unlikely to happen, to the extent we can punt now and figure it out later if it does, but, thought we should at least air it out here. cc @rzlim08 @morsecodist @katrinakalantar for visibility

@rzlim08
Copy link
Collaborator

rzlim08 commented Apr 14, 2023

@ovalenzuela19

would original_host_filter.wdl just be there for reference and not really usable?

Mostly yes, for reference. The unanswered bit is what would we do if a user absolutely needs us to make some further change/bugfix to the original pipeline they're still depending on. How would we roll that out as a tagged release & zip file the webapp can use, once it's been renamed to original_host_filter.wdl?

If we're optimists we would say that's fairly unlikely to happen, to the extent we can punt now and figure it out later if it does, but, thought we should at least air it out here. cc @rzlim08 @morsecodist @katrinakalantar for visibility

Yeah I think the main question is how to support both pipelines at the same time indefinitely. If we have both WDLs we'd at least be able to patch either. I'll see if I can make a small customization to pin a release to a major/minor version. That being said I don't think we should be running with this forever and would like to drop support for the old version eventually if we're comfortable enough with the new one.

@valenzuelaomar
Copy link
Contributor

valenzuelaomar commented Apr 14, 2023

Yeah I think the main question is how to support both pipelines at the same time indefinitely. If we have both WDLs we'd at least be able to patch either. I'll see if I can make a small customization to pin a release to a major/minor version. That being said I don't think we should be running with this forever and would like to drop support for the old version eventually if we're comfortable enough with the new one.

What if we just create a new workflow that is just the old short-read-mngs pipeline with the old host filtering wdl? That way we can still maintain it & package it like normal and it's up to the web app which wdl will be used? By default it'll use the existing short-read-mngs workflow which uses the modern host filtering stage.

That way if we need to run the old v7 short-read-mngs pipeline, the web app can just have some conditional in place to fetch the docker image for the short-read-mngs old version. LMK if this approach sounds absurd

valenzuelaomar and others added 9 commits April 14, 2023 16:18
* fix bowtie2 counts for single file

* fix extra expansions

* relieve hisat2 dependency

* single sample hisat2

* fix hisat2

* fix dockerfile for hisat2

---------

Co-authored-by: Omar Valenzuela <[email protected]>
…219)

* Revert "output gene id in primary output file (#209)"

This reverts commit 2d9ff56.

* Revert "Output non host reads and non host contigs for AMR (#205)"

This reverts commit 9de3fc2.
* legacy-host-filter-inital-commit

* linting

* add stage io map

* remove stage io map swp file
@rzlim08 rzlim08 merged commit 60a7e78 into main Apr 25, 2023
@rzlim08 rzlim08 deleted the mlin/modernize-host-filter branch April 25, 2023 19:16
@kislyuk
Copy link

kislyuk commented May 6, 2023

Whoa. Just noticed this! Huge step forward for czid. Congrats on landing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants