Skip to content

Commit

Permalink
host_filter.wdl modernization (#70)
Browse files Browse the repository at this point in the history
* fastp

* fastp single

* bowtie2 run

* hisat2 run

* dedup run

* run subsample

* run kallisto

* adjust index tar filenames

* polishing

* polishing

* count reads in each step

* Create host_filter_indexing.wdl

* boost fastp complexity threshold

* output fastp report

* build fastp from our fork with SDUST complexity filtering

* use fastp --sdust_complexity_filter

* bump

* bump

* tune

* stub the remaining step descriptions

* wire to tests

* and auto_benchmark

* fixup tests

* fixup tests

* fixup tests

* fixup tests

* fixup tests

* fixup tests

* add back in picard CollectInsertSizeMetrics

* picard step description

* host_filter_2022.wdl => host_filter.wdl

* polish

* restore fastqs_0 and fastqs_1 to minimize collateral changes

* add minimap2 index build

* picard_insert_metrics.txt

* amr/run.wdl workaround

* index multiple transcripts_fasta_gz

* make gtf optional

* allow uncompressed genome fasta

* allow uncompressed genome fasta

* allow uncompressed genome fasta

* bump minimap2 memory

* bump minimap2 memory

* step descriptions -- first draft

* add indexing driver & draft readme

* include invocations in step descriptions

* rebase amr fix

* load card_json

* run kallisto every time

* fix amr wdl

* fix short-read-mngs rebase weirdness

* add final things

* [modernized host filter] add ERCC and gene-level outputs to kallisto (#175)

The kallisto step gains two new derivative output files:
* `ERCC_counts.tsv`: Estimated read counts for the ERCC sequences only (two-column TSV: ERCC_id, est_counts)
* `gene_abundance.tsv`: gene-level est_counts and tpm, computed by summing over all transcripts for each gene
* (and `abundance.tsv` is renamed to `transcript_abundance.tsv`)

To get the `gene_abundance.tsv` we need a new input `gtf_gz`, the Ensembl GTF file for the host species that will tell it how to map the transcript IDs in `transcript_abundance.tsv` onto gene IDs for the roll-up. The input is optional and if absent then the `gene_abundance.tsv` output is omitted too.

Note: docker image update needed to install & upgrade some dependencies.

* load card_json explicitly

* add ~

* fix host_filter unit tests

* fix host_filter unit tests

* bowtie2: sort by read name for better reproducibility

* update minimap2 indexing invocation

* add chelonia_mydas, drosophila_melanogaster, gray_whale, pea-aphid

* copy-paste {bowtie2,hisat2}_human_filter to support pipeline viz

* allow kallisto nonzero exit

* rename modern host filtering inputs/outputs and create a 1-1 mapping between inputs/outputs

* fix lint issue

* rename reads_in_count to input_read_count

* auto_benchmark updates

* fix test_RunCZIDDedup_safe_csv

* rename kallisto output files

* update mosquitos with several Culicidae

* add files to wdl output for pipeline viz compatibility

* convert headers in descriptions to bolded text

* delete host_filter_indexing since it's subsumed in #182

* fix glob patterns in read counting

* Revert "fix glob patterns in read counting"

This reverts commit aeb234f.

* [Bug] fix count expansion for single file short-read-mngs (#216)

* fix bowtie2 counts for single file

* fix extra expansions

* relieve hisat2 dependency

* single sample hisat2

* fix hisat2

* fix dockerfile for hisat2

---------

Co-authored-by: Omar Valenzuela <[email protected]>

* Remove AMR changes that are a WIP from modern host filtering branch (#219)

* Revert "output gene id in primary output file (#209)"

This reverts commit 2d9ff56.

* Revert "Output non host reads and non host contigs for AMR (#205)"

This reverts commit 9de3fc2.

* tune hisat2 memory usage (#223)

* Legacy Host Filter initial commit (#224)

* legacy-host-filter-inital-commit

* linting

* add stage io map

* remove stage io map swp file

* Revert "Remove AMR changes that are a WIP from modern host filtering branch (#219)" (#226)

This reverts commit 227a489.

---------

Co-authored-by: Mike Lin <[email protected]>
Co-authored-by: Omar Valenzuela <[email protected]>
Co-authored-by: Omar Valenzuela <[email protected]>
Co-authored-by: rzlim08 <[email protected]>
  • Loading branch information
5 people authored Apr 25, 2023
1 parent aed2db8 commit 60a7e78
Show file tree
Hide file tree
Showing 30 changed files with 1,894 additions and 733 deletions.
12 changes: 6 additions & 6 deletions workflows/amr/run.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ workflow amr {
input:
non_host_reads = select_all(
[
host_filter_stage.gsnap_filter_out_gsnap_filter_1_fa,
host_filter_stage.gsnap_filter_out_gsnap_filter_2_fa
host_filter_stage.subsampled_out_subsampled_1_fa,
host_filter_stage.subsampled_out_subsampled_2_fa
]
),
min_contig_length = min_contig_length,
Expand All @@ -45,8 +45,8 @@ workflow amr {
non_host_reads = select_first([non_host_reads,
select_all(
[
host_filter_stage.gsnap_filter_out_gsnap_filter_1_fa,
host_filter_stage.gsnap_filter_out_gsnap_filter_2_fa
host_filter_stage.subsampled_out_subsampled_1_fa,
host_filter_stage.subsampled_out_subsampled_2_fa
]
)]),
card_json = card_json,
Expand Down Expand Up @@ -102,8 +102,8 @@ workflow amr {
non_host_reads,
select_all(
[
host_filter_stage.gsnap_filter_out_gsnap_filter_1_fa,
host_filter_stage.gsnap_filter_out_gsnap_filter_2_fa
host_filter_stage.subsampled_out_subsampled_1_fa,
host_filter_stage.subsampled_out_subsampled_2_fa
]
)
]),
Expand Down
141 changes: 141 additions & 0 deletions workflows/legacy-host-filter/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# syntax=docker/dockerfile:1.4
FROM ubuntu:18.04
ARG DEBIAN_FRONTEND=noninteractive
ARG MINIWDL_VERSION=1.1.5

LABEL maintainer="CZ ID Team <[email protected]>"

RUN sed -i s/archive.ubuntu.com/us-west-2.ec2.archive.ubuntu.com/ /etc/apt/sources.list; \
echo 'APT::Install-Recommends "false";' > /etc/apt/apt.conf.d/98czid; \
echo 'APT::Install-Suggests "false";' > /etc/apt/apt.conf.d/99czid

RUN apt-get -q update && apt-get -q install -y \
jq \
moreutils \
pigz \
pixz \
aria2 \
httpie \
curl \
wget \
zip \
unzip \
zlib1g-dev \
pkg-config \
apt-utils \
libbz2-dev \
liblzma-dev \
software-properties-common \
libarchive-tools \
liblz4-tool \
lbzip2 \
docker.io \
python3-dev \
python3-pip \
python3-setuptools \
python3-wheel \
python3-requests \
python3-yaml \
python3-dateutil \
python3-psutil \
python3-cutadapt \
python3-scipy \
samtools \
fastx-toolkit \
seqtk \
bedtools \
dh-autoreconf \
nasm \
build-essential

# The following packages pull in python2.7
RUN apt-get -q install -y \
bowtie2 \
spades \
ncbi-blast+

RUN pip3 install boto3==1.23.10 marisa-trie==0.7.7 pytest
RUN pip3 install miniwdl==${MINIWDL_VERSION} miniwdl-s3parcp==0.0.5 miniwdl-s3upload==0.0.4
RUN pip3 install https://github.com/chanzuckerberg/miniwdl-plugins/archive/f0465b0.zip#subdirectory=sfn-wdl
RUN pip3 install https://github.com/chanzuckerberg/s3mi/archive/v0.8.0.tar.gz

ADD https://raw.githubusercontent.com/chanzuckerberg/miniwdl/v${MINIWDL_VERSION}/examples/clean_download_cache.sh /usr/local/bin
RUN chmod +x /usr/local/bin/clean_download_cache.sh

# docker.io is the largest package at 250MB+ / half of all package disk space usage.
# The docker daemons never run inside the container - removing them saves 150MB+
RUN rm -f /usr/bin/dockerd /usr/bin/containerd*

RUN cd /usr/bin; curl -O https://amazon-ecr-credential-helper-releases.s3.amazonaws.com/0.4.0/linux-amd64/docker-credential-ecr-login
RUN chmod +x /usr/bin/docker-credential-ecr-login
RUN mkdir -p /root/.docker
RUN jq -n '.credsStore="ecr-login"' > /root/.docker/config.json

RUN curl -L -o /usr/bin/czid-dedup https://github.com/chanzuckerberg/czid-dedup/releases/download/v0.1.2/czid-dedup-Linux; chmod +x /usr/bin/czid-dedup

# Note: bsdtar is available in libarchive-tools
# Note: python3-scipy pulls in gcc (fixed in Ubuntu 19.10)
# TODO: kSNP3 (separate phylotree image?)

# Note: the NonHostAlignment stage uses a different version of gmap custom to CZ ID, installed here:
# https://github.com/chanzuckerberg/czid/blob/master/workflows/docker/gsnap/Dockerfile#L16-L20
# TODO: migrate both to https://packages.ubuntu.com/focal/gmap (updates to gmap require revalidation)
RUN apt-get -q install -y gmap

# FIXME: replace trimmomatic with cutadapt (trimmomatic pulls in too many deps)
RUN apt-get -q install -y trimmomatic
RUN ln -sf /usr/share/java/trimmomatic-0.36.jar /usr/local/bin/trimmomatic-0.38.jar

# FIXME: replace PriceSeqFilter with cutadapt quality/N-fraction cutoff
RUN curl -s https://idseq-prod-pipeline-public-assets-us-west-2.s3-us-west-2.amazonaws.com/PriceSource140408/PriceSeqFilter > /usr/bin/PriceSeqFilter
RUN chmod +x /usr/bin/PriceSeqFilter

RUN curl -Ls https://github.com/chanzuckerberg/s3parcp/releases/download/v0.2.0-alpha/s3parcp_0.2.0-alpha_Linux_x86_64.tar.gz | tar -C /usr/bin -xz s3parcp

# FIXME: check if use of pandas, pysam is necessary
RUN pip3 install pysam==0.14.1 pandas==1.1.5

# Picard for average fragment size https://github.com/broadinstitute/picard
# r-base is a dependency of collecting input size metrics https://github.com/bioconda/bioconda-recipes/pull/16398
RUN apt-get install -y r-base
RUN curl -L -o /usr/local/bin/picard.jar https://github.com/broadinstitute/picard/releases/download/2.21.2/picard.jar
# Create a single executable so we can use SingleCommand
RUN printf '#!/bin/bash\njava -jar /usr/local/bin/picard.jar "$@"\n' > /usr/local/bin/picard
RUN chmod +x /usr/local/bin/picard

# install STAR, the package rna-star does not include STARlong
RUN curl -L https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz | tar xz
RUN mv STAR-2.5.3a/bin/Linux_x86_64_static/* /usr/local/bin
RUN rm -rf STAR-2.5.3a


RUN apt-get -y update && apt-get install -y build-essential libz-dev git python3-pip cmake

# Host filtering (2022 version) dependencies
# fastp (libdeflate libisal (dh-autoreconf nasm))
# hisat2
# bowtie2 [already installed]
# kallisto + python gtfparse
WORKDIR /tmp
RUN wget -nv -O - https://github.com/intel/isa-l/archive/refs/tags/v2.30.0.tar.gz | tar zx
RUN cd isa-l-* && ./autogen.sh && ./configure && make -j8 && make install
RUN wget -nv -O - https://github.com/ebiggers/libdeflate/archive/refs/tags/v1.12.tar.gz | tar zx
RUN cd libdeflate-* && make -j8 && make install
RUN ldconfig
RUN git clone https://github.com/mlin/fastp.git && git -C fastp checkout 37edd60
RUN cd fastp && make -j8 && ./fastp test && cp fastp /usr/local/bin
WORKDIR /
RUN wget -nv -O /tmp/HISAT2.zip https://czid-public-references.s3.us-west-2.amazonaws.com/test/hisat2/hisat2.zip \
&& unzip /tmp/HISAT2.zip && rm /tmp/HISAT2.zip
RUN curl -L https://github.com/pachterlab/kallisto/releases/download/v0.46.1/kallisto_linux-v0.46.1.tar.gz | tar xz -C /
RUN pip3 install gtfparse==1.2.1

# Uninstall build only dependencies
RUN apt-get purge -y g++ libperl4-corelibs-perl make

COPY --from=lib idseq-dag /tmp/idseq-dag
RUN pip3 install /tmp/idseq-dag && rm -rf /tmp/idseq-dag

COPY --from=lib idseq_utils /tmp/idseq_utils
RUN pip3 install /tmp/idseq_utils && rm -rf /tmp/idseq_utils

Loading

0 comments on commit 60a7e78

Please sign in to comment.