Skip to content

Latest commit

 

History

History
204 lines (135 loc) · 8.04 KB

README.md

File metadata and controls

204 lines (135 loc) · 8.04 KB

Genebuild statistics pipeline

The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available. If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly.

plot

Nextflow version nextflow <= 22.10.1. (21.10.5.5658 currently available on Slurm)

Running options

The following options require a list of mandatory arguments (see Mandatory arguments).

Busco pipeline --run_busco_core

Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details Busco userguide

Docker image available in https://hub.docker.com/r/ezlabgva/busco

--busco_mode

Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes.

--busco_dataset

Select Busco dataset; if not specified the pipeline will choose the closest lineage according to the ncbi taxonomy classification.

--copyToFtp

Boolean option to copy output in Ensembl ftp, default false

--apply_busco_metakeys

Boolean option to load Busco metakey into db

--host

The host name for the databases

--port

The port number of the host

--user

The read/wrote username for the host.

--user_r

The read only username for the host.

--password

The database password.

nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password> --busco_mode <busco_mode> --run_busco_core true --apply_busco_metakeys true --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys  true --apply_ensembl_beta_metakeys true --team -profile slurm

OMArk pipeline --run_omark

OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species. Further information available in the official repo https://github.com/DessimozLab/OMArk

--copyToFtp

Boolean option to copy output in Ensembl ftp, default false

--host

The host name for the databases

--port

The port number of the host

--user

The read/wrote username for the host.

--user_r

The read only username for the host.

--password

The database password.

nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password> --run_omark true -profile slurm

Ensembl statistics and Beta Metakeys pipeline --run_ensembl_stats, --run_ensembl_beta_metakeys

The pipeline calculate core statistics for Ensembl browser.

--run_ensembl_stats

Boolean option to run Ensembl statistics in a mysql db, default false

--apply_ensembl_stats

Boolean option to load Ensembl statistics in a mysql db, default false

--run_ensembl_beta_metakeys

Boolean option to run Ensembl beta metakeys in a mysql db, default false

--apply_ensembl_beta_metakeys

Boolean option to load Ensembl beta metakeys in a mysql db, default false

--host

The host name for the databases

--port

The port number of the host

--user

The read/wrote username for the host.

--user_r

The read only username for the host.

--password

The database password.

--team

Required by Ensembl metakey script if run_ensembl_beta_metakeys is enabled.

nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password>  --run_ensembl_stats true --apply_ensembl_stats true  --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team <team> -profile slurm

Busco NCBI genome pipeline --run_busco_ncbi

Option available to check the quality of the genome by running Busco in genome mode.

nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path>  --run_busco_ncbi true -profile slurm

Requirements

Mandatory arguments

--csvFile

The structure of the file can cahnge according to the running options

Running mode csv file format
--run_busco_core core (header)
<db_name>
--run_omark core (header)
<db_name>
--run_busco_ncbi gca,taxon_id (header)
,<taxon_id>

For example tu run busco on a list of core dbs the file should be |core | |db1 | |db2 |

--enscode

Path to the root directory containing the Perl repositories (ensembl-analysis)

--outDir

Path to the directory where to store the results of the pipeline

Optional arguments

--bioperl

Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to --enscode will be used as root, i.e. <enscode>/bioperl-1.6.924.

--cacheDir

Path to the directory to use as cache for the intermediate files. If not provided, the value passed to --outDir will be used as root, i.e. <outDir>/cache.

--files_latency

Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds.

Pipeline configuration

Using the provided nextflow.config

We are using profiles to be able to run the pipeline on different HPC clusters. The default is standard.

  • standard: uses LSF to run the compute heavy jobs. It expects the usage of scratch to use a low latency filesystem.
  • slurm: uses SLURM to run the compute heavy jobs. It expects the usage of scratch to use a low latency filesystem.

Using a local configuration file

You can use a local config with -c to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well:

  • process.scratch: The patch to the scratch directory to use
  • workDir: The directory where nextflow stores any file

Information about all the parameters

nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help

Ensembl dependencies

These are the Ensembl repositories required by this pipeline:

Repository name branch URL
ensembl default https://github.com/Ensembl/ensembl.git
ensembl-analysis main https://github.com/Ensembl/ensembl-analysis.git
ensembl-io default https://github.com/Ensembl/ensembl-io.git
ensembl-genes default https://github.com/Ensembl/ensembl-genes.git

It is recommended that all the repositories are cloned into the same folder.

Remember that, following the instructions in Ensembl's Perl API installation, you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands:

wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip
unzip release-1-6-924.zip
mv bioperl-live-release-1-6-924 bioperl-1.6.924

It is recommended to install it in the same folder as the Ensembl repositories.