Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial rewrite of the pipeline #32

Merged
merged 54 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
eaf5dac
Add subworkflow to validate samplesheets
robomics May 30, 2024
5be09d3
Add subworkflow to process TADs
robomics May 30, 2024
cc78cb4
Add subworkflow to run NCHG
robomics May 30, 2024
bec9685
Refactor
robomics May 30, 2024
3134d7f
Add subworkflow to run call TAD cliques
robomics May 30, 2024
8474004
Update main.nf and nextflow.config
robomics May 30, 2024
4698c92
Add new Dockerfiles
robomics May 30, 2024
97e7379
Update env.yml
robomics May 30, 2024
64b230b
Remove unnecessary import
robomics May 30, 2024
033b108
Update base config
robomics May 30, 2024
cbdb35f
Remove debug code
robomics May 30, 2024
b823019
Update dependabot
robomics May 30, 2024
b9f464c
Update CI
robomics May 30, 2024
b64b3fc
Merge branch 'main' into rewrite
robomics May 30, 2024
b8a9e76
Merge branch 'main' into rewrite
robomics May 30, 2024
26cc61a
Update container uris in nextflow.config
robomics May 30, 2024
d448fa1
Bugfix
robomics May 30, 2024
b67bbea
Bugfix
robomics May 30, 2024
f4f2bec
Merge branch 'main' into rewrite
robomics May 30, 2024
ce2a614
Add missing tag
robomics May 30, 2024
7c2ba76
Merge branch 'main' into rewrite
robomics May 30, 2024
9e9b3f9
Debugging
robomics May 31, 2024
b8a68e1
Debugging
robomics May 31, 2024
9e016f9
Remove debug code
robomics May 31, 2024
3d2039c
Speed up CI
robomics May 31, 2024
20bbda8
Bugfix
robomics May 31, 2024
6bbfeec
Fail CI on test dataset cache miss
robomics May 31, 2024
dd15c20
Update NCHG Dockerfile
robomics May 31, 2024
fdc6ca9
Update test config
robomics May 31, 2024
5e2bfa8
Merge branch 'main' into rewrite
robomics May 31, 2024
c5eb6f0
Update apply_normalization.py
robomics May 31, 2024
0120739
Update plot_maximal_clique_sizes.py
robomics Jun 3, 2024
dd6d6e3
Update NCHG subworkflow
robomics Jun 3, 2024
e23524c
Update cliques workflow
robomics Jun 3, 2024
7602528
Bugfix
robomics Jun 3, 2024
91444e5
Bugfix
robomics Jun 3, 2024
8d62699
Bugfix
robomics Jun 4, 2024
a4dc5b0
Merge branch 'main' into rewrite
robomics Jun 4, 2024
2032607
Update nextflow.config
robomics Jun 4, 2024
c3b9a40
Add label process_very_high
robomics Jun 4, 2024
4abf826
Let call_cliques.py run for longer
robomics Jun 4, 2024
70ed778
Refactor
robomics Jun 4, 2024
15ef31e
Support specifying different cutoffs for cis and trans cliques
robomics Jun 4, 2024
53156bf
Merge branch 'main' into rewrite
robomics Jun 4, 2024
a81fe9f
Update CI
robomics Jun 4, 2024
c1e91ea
Bugfix
robomics Jun 4, 2024
6365671
Bugfix
robomics Jun 4, 2024
eccb270
Bugfix
robomics Jun 5, 2024
325c919
Update cliques workflow
robomics Jun 5, 2024
d6086ed
Output NCHG expected values
robomics Jun 6, 2024
8194d3d
Bugfix
robomics Jun 7, 2024
c4565d6
Update plot_significant_interactions.py
robomics Jun 7, 2024
c879406
Update README
robomics Jun 7, 2024
59c4316
Update nextflow.config
robomics Jun 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ updates:
interval: "daily"

- package-ecosystem: "docker"
directory: "/"
directory: "/containers"
schedule:
interval: "daily"
61 changes: 9 additions & 52 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ on:
paths:
- ".github/workflows/ci.yml"
- ".dockerignore"
- "Dockerfile"
- "env.yml"
- "main.nf"
- "nextflow.config"
Expand All @@ -20,7 +19,6 @@ on:
paths:
- ".github/workflows/ci.yml"
- ".dockerignore"
- "Dockerfile"
- "env.yml"
- "main.nf"
- "nextflow.config"
Expand All @@ -44,46 +42,6 @@ env:
NXF_ANSI_LOG: false

jobs:
build-image:
name: Build Dockerfile
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository }}
flavor: |
latest=true
tags: |
type=semver,priority=1000,pattern={{version}}
type=sha,priority=900
type=ref,priority=800,event=branch
type=ref,priority=700,event=pr

- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Build Docker image and push to registries
uses: docker/build-push-action@v5
with:
context: ${{ github.workspace }}
push: ${{ github.event_name != 'pull_request' }}
cache-from: type=gha
cache-to: type=gha,mode=min
tags: ${{ steps.meta.outputs.tags }}


preproc-test-dataset:
name: Preprocess test dataset
runs-on: ubuntu-latest
Expand Down Expand Up @@ -120,13 +78,13 @@ jobs:
- name: Generate requirements.txt
if: steps.cache-dataset.outputs.cache-hit != 'true'
run: |
echo 'cooler==0.9.1' > requirements.txt
echo 'cooler==0.10.0' > requirements.txt

- name: Setup Python
if: steps.cache-dataset.outputs.cache-hit != 'true'
uses: actions/setup-python@v5
with:
python-version: 3.9
python-version: 3.12
cache: pip

- name: Remove unused resolutions
Expand All @@ -137,17 +95,15 @@ jobs:

pip3 install -r requirements.txt

for res in 100000 1000000; do
cooler cp "$src::/resolutions/$res" "$dest::/resolutions/$res"
done
cooler cp "$src::/resolutions/100000" "$dest"

mv "$dest" "$src"


test-workflow:
name: Test workflow
runs-on: ubuntu-latest
needs: [ build-image, preproc-test-dataset ]
needs: [ preproc-test-dataset ]

strategy:
matrix:
Expand All @@ -162,6 +118,7 @@ jobs:
with:
key: ${{ needs.preproc-test-dataset.outputs.cache-key }}
path: data/
fail-on-cache-miss: true

- name: Install Nextflow
uses: nf-core/setup-nextflow@v2
Expand All @@ -172,12 +129,12 @@ jobs:
run: |
nextflow run -c config/test.config \
--sample="test" \
--cooler_cis="data/$(basename "$TEST_DATASET_URL")::/resolutions/100000" \
--cooler_trans="data/$(basename "$TEST_DATASET_URL")::/resolutions/1000000" \
--hic_file="data/$(basename "$TEST_DATASET_URL")" \
--resolution=100000 \
--outdir=data/out \
--max_cpus=2 \
--max_cpus=$(nproc) \
--max_memory=6.GB \
--max_time=2.h \
.

ls -lah data/out/test_{all,cis,trans}_{cliques,domains}*.gz
ls -lah data/out/cliques/test_cis_{cliques,domains}*.gz
128 changes: 75 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The workflow is largely based on [this](https://github.com/Chrom3D/INC-tutorial)

### Software requirements

- Nextflow (at least version: v20.07.1. Pipeline was developed using v22.10.7)
- Nextflow (at least version: v20.07.1. Pipeline was developed using v22.10.8)
- Docker or Singularity/Apptainer

Running the pipeline without containers is technically possible, but it is not recommended.
Expand All @@ -32,9 +32,7 @@ To install the dependencies in a Conda environment named `myenv`, run the follow
conda env update --name myenv --file env.yml --prune
```

You will also need to compile `NCHG` from the source code available at [Chrom3D/preprocess_scripts](https://github.com/Chrom3D/preprocess_scripts).

Check out the `Dockerfile` from this repo for an example of how this can be done using Conda.
You will also need to compile `NCHG` from the source code available at [paulsengroup/NCHG](https://github.com/paulsengroup/NCHG).

</details>

Expand All @@ -48,14 +46,15 @@ The workflow can be run in two ways:

The samplesheet should be a TSV file with the following columns:

| sample | cooler_cis | cooler_trans | tads |
|--------------|-------------------------------------------|---------------------------------------------|--------------------------|
| sample_name | high_resolution.cool | low_resolution.cool | tads.bed |
| 4DNFI74YHN5W | 4DNFI74YHN5W.mcool::/resolutions/50000 | 4DNFI74YHN5W.mcool::/resolutions/1000000 | 4DNFI74YHN5W_domains.bed |
| sample | hic_file | resolution | tads |
|--------------|----------------------------------------|------------|---------------------------|
| sample_name | myfile.hic | 50000 | tads.bed |
| 4DNFI74YHN5W | 4DNFI74YHN5W.mcool::/resolutions/50000 | 50000 | 4DNFI74YHN5W_domains.bed |


- __sample__: Sample names/ids. This field will be used as prefix to in the output file names (see [below](#running-the-workflow)).
- __cooler_cis__: path to a cooler file with the contact matrix used to read intra-chromosomal/cis interactions (usually a matrix with resolution ~50kbp).
- __cooler_trans__: path to a cooler file with the contact matrix used to read intra-chromosomal/trans interactions (usually a matrix with resolution ~1Mbp).
- __hic_file__: Path to a file in .hic or Cooler format.
- __resolution__: Resolution to be used for the data analysis (50-100kbp are good starting points).
- __tads__ (optional) : path to a BED3+ file with the list of TADs. When not specified, the workflow will use [hicFindTADs](https://hicexplorer.readthedocs.io/en/latest/content/tools/hicFindTADs.html) from the [HiCExplorer](https://github.com/deeptools/HiCExplorer) suite to call TADs.

URI syntax for multi-resolution Cooler files is supported (e.g. `myfile.mcool::/resolutions/bin_size`).
Expand All @@ -68,17 +67,17 @@ Furthermore, all contact matrices (and TADs when provided) should use the same r
To run the workflow without a samplesheet is not available, the following parameters are required:

- __sample__
- __cooler_cis__
- __cooler_trans__
- __hic_file__
- __tads__

Parameters have the same meaning as the header fields outlined in the [previous section](#using-a-samplesheet).

The above parameters can be passed directly through the CLI when calling `nextflow run`:

```bash
nextflow run --sample='4DNFI74YHN5W' \
--cooler_cis='data/4DNFI74YHN5W.mcool::/resolutions/100000' \
--cooler_trans='data/4DNFI74YHN5W.mcool::/resolutions/1000000' \
--hic_file='data/4DNFI74YHN5W.mcool' \
--resolution=50000
...
```

Expand All @@ -87,8 +86,8 @@ Alternatively, parameters can be written to a `config` file:
user@dev:/tmp$ cat myconfig.txt

sample = '4DNFI74YHN5W'
cooler_cis = 'data/4DNFI74YHN5W.mcool::/resolutions/100000'
cooler_trans = 'data/4DNFI74YHN5W.mcool::/resolutions/1000000'
hic_file = 'data/4DNFI74YHN5W.mcool'
resolution = 50000
```

and the `config` file is then passed to `nextflow run`:
Expand All @@ -100,12 +99,26 @@ nextflow run -c myconfig.txt ...

### Optional files and parameters

In addition to the mandatory parameters, providing the following parameters is highly recommended:
In addition to the mandatory parameters, the pipeline accepts the following parameters:

- __cytoband__: path to a [cytoband](https://software.broadinstitute.org/software/igv/cytoband) file. Used to mask centromeric regions.
- __assembly_gaps__: path to a BED file with the list of assembly gaps/unmappable regions.

UCSC publishes the above files for common reference genomes: [link](https://hgdownload.cse.ucsc.edu/goldenPath/) (files are usually named `cytoBand.txt.gz` and `gaps.txt.gz` respectively).
- __custom_mask__: path to a BED file with a list of custom regions to be masked out.

Note that NCHG by default uses the `MAD-max` filter to remove bins with suspiciously high or low marginals, so providing the above files is usually not requirerd.
One exception is when dealing with genomes affected by structural variants, in which case we reccommend masking out these regions using __custom_mask__.

- __hicexplorer_hic_norm__: normalization to use when calling TADs from .hic files.
- __hicexplorer_cool_norm__: normalization to use when calling TADs from .\[m\]cool files.
- __nchg_mad_max__: cutoff used by NCHG when performing the `MAD-max` filtering.
- __nchg_bad_bin_fraction__: bad bin fraction used by NCHG to discard domains overlapping with a high fraction of bad bins.
- __nchg_fdr_cis__: adjusted pvalue used by NCHG to filter significant cis interactions.
- __nchg_log_ratio_cis__: log ratio used by NCHG to filter significant cis interactions.
- __nchg_fdr_trans__: adjusted pvalue used by NCHG to filter significant trans interactions.
- __nchg_log_ratio_trans__: log ratio used by NCHG to filter significant trans interactions.
- __clique_size_thresh__: minimum clique size.
- __call_cis_cliques__: call cliques overlapping cis regions of the Hi-C matrix.
- __call_trans_cliques__: call cliques overlapping trans regions of the Hi-C matrix.

By default, the workflow results are published under `result/`. The output folder can be customized through the __outdir__ parameter.

Expand All @@ -123,8 +136,8 @@ utils/download_example_datasets.sh data/
Next, create a `samplesheet.tsv` file like the follwing (make sure you are using tabs, not spaces!)

```tsv
sample cooler_cis cooler_trans tads
example data/4DNFI74YHN5W.mcool::/resolutions/50000 data/4DNFI74YHN5W.mcool::/resolutions/1000000
sample hic_file resolution tads
example data/4DNFI74YHN5W.mcool 50000
```

Finally, run the workflow with:
Expand All @@ -133,45 +146,54 @@ user@dev:/tmp$ nextflow run --max_cpus=8 \
--max_memory=16.GB \
--max_time=2.h \
--sample_sheet=samplesheet.tsv \
--cytoband=data/cytoBand.txt.gz \
--assembly_gaps=data/gaps.txt.gz \
--outdir=data/results/ \
https://github.com/robomics/call_tad_cliques \
-r v0.3.2 \
-r v0.4.0 \
-with-singularity # Replace this with -with-docker to use Docker instead

N E X T F L O W ~ version 22.10.7
Launching `https://github.com/robomics/call_tad_cliques` [focused_bohr] DSL2 - revision: 9a02af259a [v0.3.2]
executor > local (16)
[71/d00e78] process > check_sample_sheet [100%] 1 of 1 ✔
[ad/a7b3ba] process > process_sample_sheet [100%] 1 of 1 ✔
[03/43e432] process > extract_chrom_sizes_from_cooler [100%] 1 of 1 ✔
[81/142f17] process > generate_bed_mask [100%] 1 of 1 ✔
[be/c84dae] process > process_tads (1) [100%] 1 of 1 ✔
[de/142949] process > fill_gaps_between_tads (1) [100%] 1 of 1 ✔
[89/098ccb] process > bedtools_bed_setdiff (1) [100%] 1 of 1 ✔
[cc/368819] process > map_intrachrom_interactions (1) [100%] 1 of 1 ✔
[4d/ecb1ac] process > select_significant_intrachrom_interactions (1) [100%] 1 of 1 ✔
[70/1cfa53] process > collect_interchrom_interactions (1) [100%] 1 of 1 ✔
[cb/feb585] process > select_significant_interchrom_interactions (1) [100%] 1 of 1 ✔
[73/4ca264] process > map_interchrom_interactions_to_tads (1) [100%] 1 of 1 ✔
[ad/03bff0] process > merge_interactions (1) [100%] 1 of 1 ✔
[eb/5d93d5] process > call_cliques (3) [100%] 3 of 3 ✔
[e5/6cf437] process > plot_maximal_clique_sizes (1) [100%] 3 of 3 ✔
Completed at: 26-Feb-2023 20:40:59
Duration : 7m 30s
N E X T F L O W ~ version 24.04.2

Launching `./main.nf` [irreverent_wescoff] DSL2 - revision: 0be6cbedc5

executor > local (41)
[7b/fb313f] SAMPLESHEET:CHECK_SYNTAX | 1 of 1 ✔
[b0/4f77c1] SAMPLESHEET:CHECK_FILES | 1 of 1 ✔
[0b/0ba9ff] TADS:SELECT_NORMALIZATION_METHOD (example) | 1 of 1 ✔
[09/ca2567] TADS:APPLY_NORMALIZATION (example (50000; weight)) | 1 of 1 ✔
[da/a14762] TADS:HICEXPLORER_FIND_TADS (example) | 1 of 1 ✔
[- ] TADS:COPY -
[92/f261fc] NCHG:GENERATE_MASK | 1 of 1 ✔
[8e/40c02a] NCHG:MASK_DOMAINS (example) | 1 of 1 ✔
[36/f950a6] NCHG:EXPECTED (example) | 1 of 1 ✔
[bf/6fbee1] NCHG:GENERATE_CHROMOSOME_PAIRS (example) | 1 of 1 ✔
[47/945cac] NCHG:DUMP_CHROM_SIZES (example) | 1 of 1 ✔
[16/bb5e75] NCHG:COMPUTE (example (chr1:chr1)) | 21 of 21 ✔
[51/55dfbb] NCHG:MERGE (example (cis)) | 1 of 1 ✔
[f1/196069] NCHG:FILTER (example (cis)) | 1 of 1 ✔
[01/c17c93] NCHG:VIEW (example (cis)) | 1 of 1 ✔
[a7/68bab8] NCHG:CONCAT (example) | 1 of 1 ✔
[bd/e70a4d] NCHG:PLOT_EXPECTED (example) | 1 of 1 ✔
[a4/235495] NCHG:GET_HIC_PLOT_RESOLUTION (example) | 1 of 1 ✔
[78/19ba93] NCHG:PLOT_SIGNIFICANT (example) | 1 of 1 ✔
[76/b21803] CLIQUES:CALL (example) | 1 of 1 ✔
[e7/812637] CLIQUES:PLOT_MAXIMAL_CLIQUE_SIZE_DISTRIBUTION_BY_TAD (cis) | 1 of 1 ✔
[b6/d1f15a] CLIQUES:PLOT_CLIQUE_SIZE_DISTRIBUTION (cis) | 1 of 1 ✔
Completed at: 07-Jun-2024 16:10:27
Duration : 1m 28s
CPU hours : 0.1
Succeeded : 16
Succeeded : 41
```

This will create a `data/results/` folder with the following files:
- `example_all_cliques.tsv.gz` - TSV with the list of cliques computed from all significant interactions (i.e. both intra and inter-chromosomal interactions).
- `example_all_domains.bed.gz` - BED file with the list of domains part of cliques computed from all significant interactions. The last column encodes the domain ID.
- `example_cis_cliques.tsv.gz` - Same as `example_all_cliques.tsv.gz`, but for intra-chromosomal interactions only.
- `example_cis_domains.bed.gz` - Same as `example_all_domains.bed.gz`, but for intra-chromosomal interactions only.
- `example_trans_cliques.tsv.gz` - Same as `example_all_cliques.tsv.gz`, but for inter-chromosomal interactions only.
- `example_trans_domains.bed.gz` - Same as `example_all_domains.bed.gz`, but for inter-chromosomal interactions only.
- `plots/*.{png,svg}` - Plots showing the maximal clique size distribution.
- `cliques/example_cis_cliques.tsv.gz` - TSV with the list of cliques computed from cis significant interactions (i.e. both intra and inter-chromosomal interactions).
- `cliques/example_cis_domains.bed.gz` - BED file with the list of domains part of cliques computed from cis significant interactions. The last column encodes the domain ID.
- `nchg/example.filtered.tsv.gz` - TSV with the statistically significant interactions detected by NCHG.
- `nchg/expected_values_example.cis.h5` - HDF5 file with the expected values computed by NCHG.
- `plots/cliques/cis_clique_size_distribution*` - Plots showing the clique size distribution.
- `plots/cliques/cis_tad_max_clique_size_distribution*` - Plots showing the maximal clique size distribution.
- `plots/nchg/example/example.*.*.png` - Plots showing the log ratio computed by NCHG for each chromosome pair analyzed.
- `plots/nchg/example/example_cis.png` - Plot showing the expected value profile computed by NCHG.
- `tads/example_tads.bed.gz` - TADs used to generate the list of genomic coordinates to be tested for significance.

The list of pairs of interacting domains can be generated using `bin/generate_cliques_bedpe.py`

Expand All @@ -197,7 +219,7 @@ If you get permission errors when using `-with-docker`:

If you get an error similar to:
```
Cannot find revision `v0.3.2` -- Make sure that it exists in the remote repository `https://github.com/robomics/call_tad_cliques`
Cannot find revision `v0.4.0` -- Make sure that it exists in the remote repository `https://github.com/robomics/call_tad_cliques`
```

try to remove folder `~/.nextflow/assets/robomics/call_tad_cliques` before running the workflow
Expand Down
Loading
Loading