Ensembl · ens-ftricomi · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,138 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,5 @@
+Ensembl
+Copyright 2020-2022 EMBL-European Bioinformatics Institute
+
+This product includes software developed at:
+- EMBL-European Bioinformatics Institute
diff --git a/README.md b/README.md
@@ -1,81 +1,204 @@
-# Busco Nextflow pipeline
+# Genebuild statistics pipeline
 
-Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html)
+The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available. 
+If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly.
 
-## Requirements
+![plot](./plot.jpeg)
 
-### Busco
-We are using the Docker image available in https://hub.docker.com/r/ezlabgva/busco
+Nextflow version nextflow  <= 22.10.1. (21.10.5.5658 currently available on Slurm) 
 
-### Perl EnsEMBL repositories you need to have
-We recommend that you clone all the repositories into one directory
-| Repository name | branch | URL|
-|-----------------|--------|----|
-| ensembl | default | https://github.com/Ensembl/ensembl.git |
-| ensembl-analysis | default | https://github.com/Ensembl/ensembl-analysis.git |
-| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
+## Running options  
 
+The following options require a list of mandatory arguments (see `Mandatory arguments`).
 
-## Running the pipeline
+## Busco pipeline `--run_busco_core`
 
+Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html)
 
-### Mandatory options
+Docker image available in https://hub.docker.com/r/ezlabgva/busco
 
-#### csvFile
-A file containing the list of databases you want to run Busco on. The databases need to have DNA.
+#### `--busco_mode`
+Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes.
 
-#### host
-The host name for the databases
+#### `--busco_dataset`
+Select Busco dataset; if not specified the pipeline will choose  the closest lineage according to the ncbi taxonomy classification.
 
-#### port
-The port number of the host
+#### `--copyToFtp`
+Boolean option to copy output in Ensembl ftp, default false
 
-#### user
-The read only username for the host. The password is expected to be empty.
+#### `--apply_busco_metakeys`
+Boolean option to load Busco metakey into db
 
-#### enscode
-The directory containing the Perl repositories
+#### `--host`
+The host name for the databases 
 
+#### `--port`
+The port number of the host 
 
-### Using the provided nextflow.config
-We are using profiles to be able to run the pipeline on different HPC. The default is 'standard'
+#### `--user`
+The read/wrote username for the host. 
 
-#### standard
-Uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem
+#### `--user_r`
+The read only username for the host. 
 
-#### cluster
-Uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem
+#### `--password`
+The database password. 
 
+```bash
+nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password> --busco_mode <busco_mode> --run_busco_core true --apply_busco_metakeys true --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys  true --apply_ensembl_beta_metakeys true --team -profile slurm
+```
 
-### Using a local config
-You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting the ones mentionned below.
+## OMArk pipeline `--run_omark`
+
+OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species.
+Further information available in the official repo https://github.com/DessimozLab/OMArk
 
-#### process.scratch
-The patch to the scratch directory to use
+#### `--copyToFtp`
+Boolean option to copy output in Ensembl ftp, default false
 
-#### workDir
-The directory where nextflow stores any file
+#### `--host`
+The host name for the databases 
 
-#### outDir
-The directory to use to store the results of the pipeline
+#### `--port`
+The port number of the host 
 
+#### `--user`
+The read/wrote username for the host. 
 
-### Running the different Busco modes
-The default option is to run busco in both genome and protein mode
+#### `--user_r`
+The read only username for the host. 
 
-#### BUSCO in genome mode
+#### `--password`
+The database password. 
 
+```bash
+nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password> --run_omark true -profile slurm
 ```
-/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf --enscode $ENSCODE --csvFile dbname.csv --genome_file genome.fa  --mode genome -w ../../work
-``` 
-#### BUSCO in protein mode
 
+## Ensembl statistics and Beta Metakeys pipeline `--run_ensembl_stats, --run_ensembl_beta_metakeys`
+
+The pipeline calculate core statistics for Ensembl browser.
+
+### `--run_ensembl_stats`
+Boolean option to run Ensembl statistics in a mysql db, default false
+
+#### `--apply_ensembl_stats`
+Boolean option to load Ensembl statistics in a mysql db, default false
+
+### `--run_ensembl_beta_metakeys`
+Boolean option to run Ensembl beta metakeys in a mysql db, default false
+
+#### `--apply_ensembl_beta_metakeys`
+Boolean option to load Ensembl beta metakeys in a mysql db, default false
+
+#### `--host`
+The host name for the databases 
+
+#### `--port`
+The port number of the host 
+
+#### `--user`
+The read/wrote username for the host. 
+
+#### `--user_r`
+The read only username for the host. 
+
+#### `--password`
+The database password. 
+
+#### `--team`
+Required by Ensembl metakey script if run_ensembl_beta_metakeys is enabled. 
+
+```bash
+nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user>  --password <mysql_password>  --run_ensembl_stats true --apply_ensembl_stats true  --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team <team> -profile slurm
 ```
-/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf -profile slurm --enscode $ENSCODE --csvFile dbname.csv --mode protein -w ../../work
+
+## Busco NCBI genome pipeline `--run_busco_ncbi`
+
+Option available to check the quality of the genome by running Busco in genome mode.
+
+```bash
+nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path>  --run_busco_ncbi true -profile slurm
 ```
 
+
+## Requirements
+
+### Mandatory arguments
+
+#### `--csvFile`
+The structure of the file can cahnge according to the running options
+| Running mode | csv file format |
+|-----------------|--------|
+| --run_busco_core |  core (header)   | 
+|                  |  <db_name>  |
+| --run_omark |  core  (header)  | 
+|                  |  <db_name>  |
+| --run_busco_ncbi |  gca,taxon_id (header)   | 
+|                  |  <gca>,<taxon_id>  |
+
+For example tu run busco on a list of core dbs the file should be
+|core |
+|db1  |
+|db2  |
+
+#### `--enscode`
+Path to the root directory containing the Perl repositories (ensembl-analysis)
+
+#### `--outDir`
+Path to the directory where to store the results of the pipeline
+
+### Optional arguments
+
+#### `--bioperl`
+Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to `--enscode` will be used as root, i.e. `<enscode>/bioperl-1.6.924`.
+
+#### `--cacheDir`
+Path to the directory to use as cache for the intermediate files. If not provided, the value passed to `--outDir` will be used as root, i.e. `<outDir>/cache`.
+
+#### `--files_latency`
+Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds.
+
+### Pipeline configuration
+
+#### Using the provided nextflow.config
+We are using profiles to be able to run the pipeline on different HPC clusters. The default is `standard`.
+
+* `standard`: uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.
+* `slurm`: uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.
+
+
+#### Using a local configuration file
+You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well:
+
+* `process.scratch`: The patch to the scratch directory to use
+* `workDir`: The directory where nextflow stores any file
+
 ### Information about all the parameters
 
+```bash
+nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help
 ```
-/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf --help
+
+
+#### Ensembl dependencies
+These are the Ensembl repositories required by this pipeline:
+
+| Repository name | branch | URL|
+|-----------------|--------|----|
+| ensembl | default | https://github.com/Ensembl/ensembl.git |
+| ensembl-analysis | main | https://github.com/Ensembl/ensembl-analysis.git |
+| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
+| ensembl-genes | default | https://github.com/Ensembl/ensembl-genes.git |
+
+It is recommended that all the repositories are cloned into the same folder.
+
+Remember that, following the instructions in [Ensembl's Perl API installation](http://www.ensembl.org/info/docs/api/api_installation.html), you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands:
+
+```bash
+wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip
+unzip release-1-6-924.zip
+mv bioperl-live-release-1-6-924 bioperl-1.6.924
 ```
+
+It is recommended to install it in the same folder as the Ensembl repositories.
diff --git a/annotation_stats/.python-version b/annotation_stats/.python-version