Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics pipe #10

Open
wants to merge 90 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
336370e
first commit
ens-ftricomi Apr 12, 2024
877060a
small fixes
ens-ftricomi Apr 12, 2024
04a95dc
fix in config
ens-ftricomi Apr 12, 2024
dbf48dd
moved config in main dir
ens-ftricomi Apr 12, 2024
3bd21b7
added subworkflow configs
ens-ftricomi Apr 12, 2024
f2d318f
parameters fix
ens-ftricomi Apr 12, 2024
e9f7feb
fix false upper case
ens-ftricomi Apr 12, 2024
ef5bc79
fixed parameters
ens-ftricomi Apr 12, 2024
7a3375b
error clean cache
ens-ftricomi Apr 12, 2024
866de80
bug fix busco workflow
ens-ftricomi Apr 12, 2024
268dc22
adjust path for config
ens-ftricomi Apr 12, 2024
f68323d
wrong module name
ens-ftricomi Apr 12, 2024
535bce5
install dependencies
ens-ftricomi Apr 12, 2024
803f4d9
wrong path
ens-ftricomi Apr 12, 2024
8c315b0
replaced take with input
ens-ftricomi Apr 12, 2024
b1d11e0
bug fix busco output
ens-ftricomi Apr 12, 2024
bcc7819
typo in varaiable declaration
ens-ftricomi Apr 12, 2024
3231ba7
fix variable declarations
ens-ftricomi Apr 12, 2024
5bf80d5
removed double quotes
ens-ftricomi Apr 12, 2024
3812837
fixed config parameters
ens-ftricomi Apr 27, 2024
f55f156
fixed channels, busco commands, internal functions to get information…
ens-ftricomi Apr 27, 2024
e467a58
omark subpipeline: fixed channels, internal functions to get informat…
ens-ftricomi Apr 27, 2024
ff3a189
fetch file: fixed channels, internal functions to get information fro…
ens-ftricomi Apr 27, 2024
cdc57f5
busco pipeline: redefined channels, defined parallelisation
ens-ftricomi Apr 27, 2024
69a346d
omark pipeline: redefined channels, defined parallelisation
ens-ftricomi Apr 27, 2024
cdf341d
renamed main
ens-ftricomi Apr 27, 2024
6d072f6
added bin folder for scripts
ens-ftricomi Apr 27, 2024
d28d4b2
added lib folder for mysql jar
ens-ftricomi Apr 27, 2024
dbad198
first commit ensembl statistics
ens-ftricomi Apr 27, 2024
eb69010
changed order of the channels
ens-ftricomi Apr 27, 2024
6aec0ff
tested copy on ftp
ens-ftricomi Apr 27, 2024
5acf65d
core statistics pipeline
ens-ftricomi Apr 28, 2024
fb5079d
fix wrong output dir
ens-ftricomi Apr 28, 2024
5204fdc
blank space
ens-ftricomi Apr 28, 2024
30fac83
fix omark file renaming
ens-ftricomi Apr 28, 2024
326559e
upload stats in ftp
ens-ftricomi Apr 28, 2024
7d08daf
cleaning
ens-ftricomi Apr 28, 2024
0edb7be
add maxForks for busco singularity
ens-ftricomi May 13, 2024
cf850a1
add maxForks to fetch data
ens-ftricomi May 13, 2024
20b2af5
added maxForks to the singularity, fixed omark output name
ens-ftricomi May 13, 2024
539c402
commented channel print
ens-ftricomi May 13, 2024
d582fbf
cleaning, added core_meta_updates repo, tentative to add maxForks dir…
ens-ftricomi May 13, 2024
56cf2f6
added maxFork to reduce interaction with dbs
ens-ftricomi May 13, 2024
ed6c2ac
fixed permissions for statistics dir in the ftp
ens-ftricomi May 13, 2024
5c7a9d6
added README
ens-ftricomi May 13, 2024
229ae5c
missing link for repo
ens-ftricomi May 13, 2024
b05cdc6
specified header
ens-ftricomi May 15, 2024
96645b8
added busco dataset param
ens-ftricomi May 16, 2024
a992df4
added dataset option and adjusted clade selector using taxonomy id an…
ens-ftricomi Jun 23, 2024
74cc47d
fixed busco dataset in a tuple
ens-ftricomi Jun 27, 2024
bbabbee
added plot
ens-ftricomi Aug 6, 2024
923a05f
replaced diagram image
ens-ftricomi Aug 6, 2024
9c5fe20
cleaning
ens-ftricomi Aug 14, 2024
aae5447
force copy in the ftp
ens-ftricomi Sep 20, 2024
d2bd020
added busco miniprot version
ens-ftricomi Sep 20, 2024
e09ab26
script to get busco scores, prepare json and patches for core
ens-ftricomi Sep 23, 2024
8069f1e
apply busco patches to the core
ens-ftricomi Sep 25, 2024
b3e7196
ignore metakeys already present
ens-ftricomi Sep 27, 2024
09775e1
module for loading busco score
ens-ftricomi Sep 27, 2024
c7171b3
added option for applying patches
ens-ftricomi Sep 27, 2024
6ce34d7
bug fixed python docker image
ens-ftricomi Sep 27, 2024
1ba1e40
tested busco patches
ens-ftricomi Sep 27, 2024
4c71103
removed print statements
ens-ftricomi Sep 27, 2024
13cb49e
added repo required in enscode
ens-ftricomi Oct 1, 2024
aa57d6e
added python script to run beta metakeys, added options in the pipeli…
ens-ftricomi Oct 3, 2024
4e52a16
bugfix wrong db name
ens-ftricomi Oct 3, 2024
3edf65a
cleaned unused files
ens-ftricomi Oct 3, 2024
74c2a51
cleaned unused files
ens-ftricomi Oct 3, 2024
608ff3f
updated help
ens-ftricomi Oct 3, 2024
846789e
cleaned unused modules
ens-ftricomi Oct 3, 2024
87471e3
update option documentation
ens-ftricomi Oct 3, 2024
8996274
updated plot
ens-ftricomi Oct 4, 2024
2640d71
updated nextflow version
ens-ftricomi Oct 9, 2024
1cdac6f
updated nextflow version
ens-ftricomi Oct 9, 2024
2ae8818
cleaning
ens-ftricomi Oct 10, 2024
e73ed02
added missing option to apply Beta patches and statistics on example …
ens-ftricomi Oct 16, 2024
f66c96e
removed unused script
ens-ftricomi Oct 16, 2024
606ca99
make the python script failing when team is not defined
ens-ftricomi Oct 24, 2024
105f0b6
adjust busco_mode in the busco example command
ens-ftricomi Oct 24, 2024
019443d
fixed primates taxon id
ens-ftricomi Nov 14, 2024
1126f9f
bug fix missing parameters
ens-ftricomi Nov 19, 2024
d6760b9
module to get seq from core
ens-ftricomi Nov 19, 2024
9934ad9
renamed list of lineages
ens-ftricomi Jan 20, 2025
51b051d
manage parameter initialisation
ens-ftricomi Jan 20, 2025
2036029
removed copy into the ftp
ens-ftricomi Jan 20, 2025
b9e4d11
cleaning
ens-ftricomi Jan 20, 2025
7b16797
updated groovy version to adapta to nextflow version 21 and above
ens-ftricomi Jan 20, 2025
3e7906c
removed unused plugin
ens-ftricomi Jan 20, 2025
141a666
force symlink overriding
ens-ftricomi Jan 27, 2025
f4b1cf5
bug fix input dbname variable
ens-ftricomi Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
5 changes: 5 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Ensembl
Copyright 2020-2022 EMBL-European Bioinformatics Institute

This product includes software developed at:
- EMBL-European Bioinformatics Institute
215 changes: 169 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,204 @@
# Busco Nextflow pipeline
# Genebuild statistics pipeline

Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html)
The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available.
If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly.

## Requirements
![plot](./plot.jpeg)

### Busco
We are using the Docker image available in https://hub.docker.com/r/ezlabgva/busco
Nextflow version nextflow <= 22.10.1. (21.10.5.5658 currently available on Slurm)

### Perl EnsEMBL repositories you need to have
We recommend that you clone all the repositories into one directory
| Repository name | branch | URL|
|-----------------|--------|----|
| ensembl | default | https://github.com/Ensembl/ensembl.git |
| ensembl-analysis | default | https://github.com/Ensembl/ensembl-analysis.git |
| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
## Running options

The following options require a list of mandatory arguments (see `Mandatory arguments`).

## Running the pipeline
## Busco pipeline `--run_busco_core`

Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html)

### Mandatory options
Docker image available in https://hub.docker.com/r/ezlabgva/busco

#### csvFile
A file containing the list of databases you want to run Busco on. The databases need to have DNA.
#### `--busco_mode`
Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes.

#### host
The host name for the databases
#### `--busco_dataset`
Select Busco dataset; if not specified the pipeline will choose the closest lineage according to the ncbi taxonomy classification.

#### port
The port number of the host
#### `--copyToFtp`
Boolean option to copy output in Ensembl ftp, default false

#### user
The read only username for the host. The password is expected to be empty.
#### `--apply_busco_metakeys`
Boolean option to load Busco metakey into db

#### enscode
The directory containing the Perl repositories
#### `--host`
The host name for the databases

#### `--port`
The port number of the host

### Using the provided nextflow.config
We are using profiles to be able to run the pipeline on different HPC. The default is 'standard'
#### `--user`
The read/wrote username for the host.

#### standard
Uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem
#### `--user_r`
The read only username for the host.

#### cluster
Uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem
#### `--password`
The database password.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --busco_mode <busco_mode> --run_busco_core true --apply_busco_metakeys true --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team -profile slurm
```

### Using a local config
You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting the ones mentionned below.
## OMArk pipeline `--run_omark`

OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species.
Further information available in the official repo https://github.com/DessimozLab/OMArk

#### process.scratch
The patch to the scratch directory to use
#### `--copyToFtp`
Boolean option to copy output in Ensembl ftp, default false

#### workDir
The directory where nextflow stores any file
#### `--host`
The host name for the databases

#### outDir
The directory to use to store the results of the pipeline
#### `--port`
The port number of the host

#### `--user`
The read/wrote username for the host.

### Running the different Busco modes
The default option is to run busco in both genome and protein mode
#### `--user_r`
The read only username for the host.

#### BUSCO in genome mode
#### `--password`
The database password.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_omark true -profile slurm
```
/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf --enscode $ENSCODE --csvFile dbname.csv --genome_file genome.fa --mode genome -w ../../work
```
#### BUSCO in protein mode

## Ensembl statistics and Beta Metakeys pipeline `--run_ensembl_stats, --run_ensembl_beta_metakeys`

The pipeline calculate core statistics for Ensembl browser.

### `--run_ensembl_stats`
Boolean option to run Ensembl statistics in a mysql db, default false

#### `--apply_ensembl_stats`
Boolean option to load Ensembl statistics in a mysql db, default false

### `--run_ensembl_beta_metakeys`
Boolean option to run Ensembl beta metakeys in a mysql db, default false

#### `--apply_ensembl_beta_metakeys`
Boolean option to load Ensembl beta metakeys in a mysql db, default false

#### `--host`
The host name for the databases

#### `--port`
The port number of the host

#### `--user`
The read/wrote username for the host.

#### `--user_r`
The read only username for the host.

#### `--password`
The database password.

#### `--team`
Required by Ensembl metakey script if run_ensembl_beta_metakeys is enabled.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_ensembl_stats true --apply_ensembl_stats true --run_ensembl_beta_metakeys true --apply_ensembl_beta_metakeys true --team <team> -profile slurm
```
/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf -profile slurm --enscode $ENSCODE --csvFile dbname.csv --mode protein -w ../../work

## Busco NCBI genome pipeline `--run_busco_ncbi`

Option available to check the quality of the genome by running Busco in genome mode.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --run_busco_ncbi true -profile slurm
```


## Requirements

### Mandatory arguments

#### `--csvFile`
The structure of the file can cahnge according to the running options
| Running mode | csv file format |
|-----------------|--------|
| --run_busco_core | core (header) |
| | <db_name> |
| --run_omark | core (header) |
| | <db_name> |
| --run_busco_ncbi | gca,taxon_id (header) |
| | <gca>,<taxon_id> |

For example tu run busco on a list of core dbs the file should be
|core |
|db1 |
|db2 |

#### `--enscode`
Path to the root directory containing the Perl repositories (ensembl-analysis)

#### `--outDir`
Path to the directory where to store the results of the pipeline

### Optional arguments

#### `--bioperl`
Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to `--enscode` will be used as root, i.e. `<enscode>/bioperl-1.6.924`.

#### `--cacheDir`
Path to the directory to use as cache for the intermediate files. If not provided, the value passed to `--outDir` will be used as root, i.e. `<outDir>/cache`.

#### `--files_latency`
Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds.

### Pipeline configuration

#### Using the provided nextflow.config
We are using profiles to be able to run the pipeline on different HPC clusters. The default is `standard`.

* `standard`: uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.
* `slurm`: uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.


#### Using a local configuration file
You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well:

* `process.scratch`: The patch to the scratch directory to use
* `workDir`: The directory where nextflow stores any file

### Information about all the parameters

```bash
nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help
```
/hps/software/users/ensembl/genebuild/bin/nextflow run ./ensembl-genes-nf/busco_pipeline.nf --help


#### Ensembl dependencies
These are the Ensembl repositories required by this pipeline:

| Repository name | branch | URL|
|-----------------|--------|----|
| ensembl | default | https://github.com/Ensembl/ensembl.git |
| ensembl-analysis | main | https://github.com/Ensembl/ensembl-analysis.git |
| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
| ensembl-genes | default | https://github.com/Ensembl/ensembl-genes.git |

It is recommended that all the repositories are cloned into the same folder.

Remember that, following the instructions in [Ensembl's Perl API installation](http://www.ensembl.org/info/docs/api/api_installation.html), you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands:

```bash
wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip
unzip release-1-6-924.zip
mv bioperl-live-release-1-6-924 bioperl-1.6.924
```

It is recommended to install it in the same folder as the Ensembl repositories.
1 change: 0 additions & 1 deletion annotation_stats/.python-version

This file was deleted.

Loading