The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available. If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly.
Nextflow version nextflow <= 22.10.1. (21.10.5.5658 currently available on Slurm)
The following options require a list of mandatory arguments (see Mandatory arguments
).
Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details Busco userguide
Docker image available in https://hub.docker.com/r/ezlabgva/busco
Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes.
Select Busco dataset; if not specified the pipeline will choose the closest lineage according to the ncbi taxonomy classification.
Boolean option to copy output in Ensembl ftp, default false
Boolean option to load Busco metakey into db
The host name for the databases
The port number of the host
The read/wrote username for the host.
The read only username for the host.
The database password.
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --busco_mode <busco_mode> --run_busco_core true --apply_busco_metakeys true --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team -profile slurm
OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species. Further information available in the official repo https://github.com/DessimozLab/OMArk
Boolean option to copy output in Ensembl ftp, default false
The host name for the databases
The port number of the host
The read/wrote username for the host.
The read only username for the host.
The database password.
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_omark true -profile slurm
The pipeline calculate core statistics for Ensembl browser.
Boolean option to run Ensembl statistics in a mysql db, default false
Boolean option to load Ensembl statistics in a mysql db, default false
Boolean option to run Ensembl beta metakeys in a mysql db, default false
Boolean option to load Ensembl beta metakeys in a mysql db, default false
The host name for the databases
The port number of the host
The read/wrote username for the host.
The read only username for the host.
The database password.
Required by Ensembl metakey script if run_ensembl_beta_metakeys is enabled.
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team <team> -profile slurm
Option available to check the quality of the genome by running Busco in genome mode.
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --run_busco_ncbi true -profile slurm
The structure of the file can cahnge according to the running options
Running mode | csv file format |
---|---|
--run_busco_core | core (header) |
<db_name> | |
--run_omark | core (header) |
<db_name> | |
--run_busco_ncbi | gca,taxon_id (header) |
,<taxon_id> |
For example tu run busco on a list of core dbs the file should be |core | |db1 | |db2 |
Path to the root directory containing the Perl repositories (ensembl-analysis)
Path to the directory where to store the results of the pipeline
Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to --enscode
will be used as root, i.e. <enscode>/bioperl-1.6.924
.
Path to the directory to use as cache for the intermediate files. If not provided, the value passed to --outDir
will be used as root, i.e. <outDir>/cache
.
Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds.
We are using profiles to be able to run the pipeline on different HPC clusters. The default is standard
.
standard
: uses LSF to run the compute heavy jobs. It expects the usage ofscratch
to use a low latency filesystem.slurm
: uses SLURM to run the compute heavy jobs. It expects the usage ofscratch
to use a low latency filesystem.
You can use a local config with -c
to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well:
process.scratch
: The patch to the scratch directory to useworkDir
: The directory where nextflow stores any file
nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help
These are the Ensembl repositories required by this pipeline:
Repository name | branch | URL |
---|---|---|
ensembl | default | https://github.com/Ensembl/ensembl.git |
ensembl-analysis | main | https://github.com/Ensembl/ensembl-analysis.git |
ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
ensembl-genes | default | https://github.com/Ensembl/ensembl-genes.git |
It is recommended that all the repositories are cloned into the same folder.
Remember that, following the instructions in Ensembl's Perl API installation, you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands:
wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip
unzip release-1-6-924.zip
mv bioperl-live-release-1-6-924 bioperl-1.6.924
It is recommended to install it in the same folder as the Ensembl repositories.