Skip to content

Conversation

@tomkinsc
Copy link
Contributor

@tomkinsc tomkinsc commented May 3, 2024

add workflow to download multiple SRA accessions to multiple bams: fetch_multiple_sra_to_bams

This is useful in the event a sample is associated with multiple sequencing runs (i.e. more than one SRR###). It also adjusts the Fetch_SRA_to_BAM task to find metadata for the requested run in the event multi-run metadata is returned for a given accession.

tomkinsc added 4 commits May 3, 2024 19:04
…tch_multiple_sra_to_bams

add workflow to download multiple SRA accessions to multiple bams: fetch_multiple_sra_to_bams; this is useful in the event a sample is associated with multiple sequencing runs (i.e. more than one SRR###). It also adjusts the Fetch_SRA_to_BAM task to find metadata for the requested run, in the event multi-run metadata is returned for a given accession
…) function not available until WDL >= 1.1)

name the output tsv file using the first specified ID since we are operating under WDL 1.0 (the `sep()` function is not available to join arrays of strings until WDL >= 1.1, and `~{sep="_" variable}` seemingly does not work outside a command block)
…dinstitute/viral-pipelines into ct-pluralize-fetch-sra-to-bam
@tomkinsc tomkinsc requested a review from dpark01 May 3, 2024 23:34
Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- since I missed the use case motivating this I just have some questions:

  1. Would it be useful to emit a merged bam File output at the workflow level (either instead of or in addition to the Array[File], again depending on the use case motivating this)?
  2. I might feel better if we threw in a check / assertion that there is one and only one unique biosample_accession value across all the results. If that's not in conflict with the use case, maybe we can do that at the end of the workflow?
  3. And if we can assume that, it might also be nice if the workflow could emit an output like Map[String,String] biosample_metadata (which of course should be identical for all elements of the scatter so this just presents the deduplicated map). This would just contain the keys that start with sample_ but not the other ones that are tied to the SRA entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants