Skip to content

robomics/call_tad_cliques

Repository files navigation

Nextflow workflow to call TAD cliques

CI

This repository hosts a Nextflow workflow to call TAD cliques.

The workflow is largely based on this tutorial.

Requirements

Software requirements

  • Nextflow (at least version: v22.10.8. Pipeline was developed using v24.04.2)
  • Docker or Singularity/Apptainer

Running the pipeline without containers is technically possible, but it is not recommended.

If you absolutely cannot use containers...

Have a look at the env.yml for the list of dependencies to be installed.

To install the dependencies in a Conda environment named myenv, run the following:

conda env update --name myenv --file env.yml --prune

You will also need to compile NCHG from the source code available at paulsengroup/NCHG.

Required input files

The workflow can be run in two ways:

  1. Using a sample sheet (recommended, supports processing multiple samples at once)
  2. By specifying options directly on the CLI or using a config

Using a samplesheet

The samplesheet should be a TSV file with the following columns:

sample hic_file resolution tads
sample_name myfile.hic 50000 tads.bed
4DNFI74YHN5W 4DNFI74YHN5W.mcool::/resolutions/50000 50000 4DNFI74YHN5W_domains.bed
  • sample: Sample names/ids. This field will be used as prefix to in the output file names (see below).
  • hic_file: Path to a file in .hic or Cooler format.
  • resolution: Resolution to be used for the data analysis (50-100kbp are good starting points).
  • tads (optional) : path to a BED3+ file with the list of TADs. When not specified, the workflow will use hicFindTADs from the HiCExplorer suite to call TADs.

URI syntax for multi-resolution Cooler files is supported (e.g. myfile.mcool::/resolutions/bin_size).

Furthermore, all contact matrices (and TADs when provided) should use the same reference genome assembly.

Without using a samplesheet

To run the workflow without a samplesheet is not available, the following parameters are required:

  • sample
  • hic_file
  • tads

Parameters have the same meaning as the header fields outlined in the previous section.

The above parameters can be passed directly through the CLI when calling nextflow run:

nextflow run --sample='4DNFI74YHN5W' \
             --hic_file='data/4DNFI74YHN5W.mcool' \
             --resolution=50000
             ...

Alternatively, parameters can be written to a config file:

user@dev:/tmp$ cat myconfig.txt

sample       = '4DNFI74YHN5W'
hic_file     = 'data/4DNFI74YHN5W.mcool'
resolution   = 50000

and the config file is then passed to nextflow run:

nextflow run -c myconfig.txt ...

Optional files and parameters

In addition to the mandatory parameters, the pipeline accepts the following parameters:

  • cytoband: path to a cytoband file. Used to mask centromeric regions.
  • assembly_gaps: path to a BED file with the list of assembly gaps/unmappable regions.
  • custom_mask: path to a BED file with a list of custom regions to be masked out.

Note that NCHG by default uses the MAD-max filter to remove bins with suspiciously high or low marginals, so providing the above files is usually not requirerd. One exception is when dealing with genomes affected by structural variants, in which case we reccommend masking out these regions using custom_mask.

  • hicexplorer_hic_norm: normalization to use when calling TADs from .hic files.
  • hicexplorer_cool_norm: normalization to use when calling TADs from .[m]cool files.
  • nchg_mad_max: cutoff used by NCHG when performing the MAD-max filtering.
  • nchg_bad_bin_fraction: bad bin fraction used by NCHG to discard domains overlapping with a high fraction of bad bins.
  • nchg_fdr_cis: adjusted pvalue used by NCHG to filter significant cis interactions.
  • nchg_log_ratio_cis: log ratio used by NCHG to filter significant cis interactions.
  • nchg_fdr_trans: adjusted pvalue used by NCHG to filter significant trans interactions.
  • nchg_log_ratio_trans: log ratio used by NCHG to filter significant trans interactions.
  • clique_size_thresh: minimum clique size.
  • call_cis_cliques: call cliques overlapping cis regions of the Hi-C matrix.
  • call_trans_cliques: call cliques overlapping trans regions of the Hi-C matrix.

By default, the workflow results are published under result/. The output folder can be customized through the outdir parameter.

For a complete list of parameters supported by the workflow refer to the workflow main config file.

Running the workflow

First, download the example datasets using script utils/download_example_datasets.sh.

# This will download files inside folder data/
utils/download_example_datasets.sh data/

Next, create a samplesheet.tsv file like the follwing (make sure you are using tabs, not spaces!)

sample   hic_file      resolution    tads
example  data/4DNFI74YHN5W.mcool   50000   

Finally, run the workflow with:

user@dev:/tmp$ nextflow run --max_cpus=8 \
                            --max_memory=16.GB \
                            --max_time=2.h \
                            --sample_sheet=samplesheet.tsv \
                            --outdir=data/results/ \
                            https://github.com/robomics/call_tad_cliques \
                            -r v0.4.0 \
                            -with-singularity  # Replace this with -with-docker to use Docker instead

 N E X T F L O W   ~  version 24.04.2

Launching `https://github.com/robomics/call_tad_cliques` [desperate_easley] DSL2 - revision: f89b6c923c [v0.4.0]

executor >  local (41)
[c4/ae5ade] SAMPLESHEET:CHECK_SYNTAX                                   | 1 of 1 ✔
[6a/eca471] SAMPLESHEET:CHECK_FILES                                    | 1 of 1 ✔
[43/a52d61] TADS:SELECT_NORMALIZATION_METHOD (example)                 | 1 of 1 ✔
[6b/6a7925] TADS:APPLY_NORMALIZATION (example (50000; weight))         | 1 of 1 ✔
[34/a35f83] TADS:HICEXPLORER_FIND_TADS (example)                       | 1 of 1 ✔
[-        ] TADS:COPY                                                  -
[27/c7fe33] NCHG:GENERATE_MASK                                         | 1 of 1 ✔
[3e/21fbf1] NCHG:MASK_DOMAINS (example)                                | 1 of 1 ✔
[4e/de0183] NCHG:EXPECTED (example)                                    | 1 of 1 ✔
[a5/1d67bc] NCHG:GENERATE_CHROMOSOME_PAIRS (example)                   | 1 of 1 ✔
[c8/2ab36d] NCHG:DUMP_CHROM_SIZES (example)                            | 1 of 1 ✔
[68/a97434] NCHG:COMPUTE (example (chr1:chr1))                         | 21 of 21 ✔
[ef/423514] NCHG:MERGE (example (cis))                                 | 1 of 1 ✔
[de/121f6a] NCHG:FILTER (example (cis))                                | 1 of 1 ✔
[0e/f58cfb] NCHG:VIEW (example (cis))                                  | 1 of 1 ✔
[e0/15439e] NCHG:CONCAT (example)                                      | 1 of 1 ✔
[0d/307019] NCHG:PLOT_EXPECTED (example)                               | 1 of 1 ✔
[7d/27fffc] NCHG:GET_HIC_PLOT_RESOLUTION (example)                     | 1 of 1 ✔
[ba/c1f985] NCHG:PLOT_SIGNIFICANT (example)                            | 1 of 1 ✔
[a1/87e559] CLIQUES:CALL (example)                                     | 1 of 1 ✔
[26/a2cab8] CLIQUES:PLOT_MAXIMAL_CLIQUE_SIZE_DISTRIBUTION_BY_TAD (cis) | 1 of 1 ✔
[f8/a17299] CLIQUES:PLOT_CLIQUE_SIZE_DISTRIBUTION (cis)                | 1 of 1 ✔
Completed at: 07-Jun-2024 16:38:56
Duration    : 1m 21s
CPU hours   : 0.1
Succeeded   : 41

This will create a data/results/ folder with the following files:

  • cliques/example_cis_cliques.tsv.gz - TSV with the list of cliques computed from cis significant interactions (i.e. both intra and inter-chromosomal interactions).
  • cliques/example_cis_domains.bed.gz - BED file with the list of domains part of cliques computed from cis significant interactions. The last column encodes the domain ID.
  • nchg/example.filtered.tsv.gz - TSV with the statistically significant interactions detected by NCHG.
  • nchg/expected_values_example.cis.h5 - HDF5 file with the expected values computed by NCHG.
  • plots/cliques/cis_clique_size_distribution* - Plots showing the clique size distribution.
  • plots/cliques/cis_tad_max_clique_size_distribution* - Plots showing the maximal clique size distribution.
  • plots/nchg/example/example.*.*.png - Plots showing the log ratio computed by NCHG for each chromosome pair analyzed.
  • plots/nchg/example/example_cis.png - Plot showing the expected value profile computed by NCHG.
  • tads/example_tads.bed.gz - TADs used to generate the list of genomic coordinates to be tested for significance.

The list of pairs of interacting domains can be generated using bin/generate_cliques_bedpe.py

user@dev:/tmp$ bin/generate_cliques_bedpe.py data/results/example_cis_domains.bed.gz data/results/example_cis_cliques.tsv.gz |
chr3	94300000	95850000	chr3	94300000	95850000	CLIQUE_#0
chr3	94300000	95850000	chr3	108150000	108950000	CLIQUE_#0
chr3	94300000	95850000	chr3	116000000	116900000	CLIQUE_#0
chr3	94300000	95850000	chr3	137750000	138500000	CLIQUE_#0
chr3	94300000	95850000	chr3	152100000	153900000	CLIQUE_#0
chr3	108150000	108950000	chr3	94300000	95850000	CLIQUE_#0
chr3	108150000	108950000	chr3	108150000	108950000	CLIQUE_#0
chr3	108150000	108950000	chr3	116000000	116900000	CLIQUE_#0
chr3	108150000	108950000	chr3	137750000	138500000	CLIQUE_#0
chr3	108150000	108950000	chr3	152100000	153900000	CLIQUE_#0
Troubleshooting

If you get permission errors when using -with-docker:

  • Pass option -process.containerOptions="--user root" to nextflow run

If you get an error similar to:

Cannot find revision `v0.4.0` -- Make sure that it exists in the remote repository `https://github.com/robomics/call_tad_cliques`

try to remove folder ~/.nextflow/assets/robomics/call_tad_cliques before running the workflow

Getting help

If you are having trouble running the workflow feel free to reach out by starting a new discussion here.

Bug reports and feature requests can be submitted by opening an issue.