This is the main pipeline that is used internally for loading the data into the RNAcentral database. More information. The pipeline is nextflow based and the main entry point is main.nf.
The pipeline is typically run as:
nextflow run -profile env -with-singularity pipeline.sif main.nfThe pipeline is meant to run
The pipeline requires a local.config file to exist and contain some
information. Notably a PGDATABASE environment variable must be defined so
data can be imported or fetched. In addition, to import specific databases
there must be a params.import_data.databases dict defined. The keys must be
known databases names and the values should be truthy to indicate the databases
should be imported.
There is some more advanced configuration options available, such as turning on or off specific parts of the pipeline like genome mapping, qa, etc.
The pipeline is meant to run in docker or singularity. You should build or fetch a suitable container. Some example commands are below.
- 
build container docker build -t rnacentral-import-pipeline .
- 
open interactive shell inside a running container docker run -v `pwd`:/rnacentral/rnacentral-import-pipeline -v /path/to/data:/rnacentral/data/ -it rnacentral-import-pipeline bash
Several tests require fetching some data files prior to testing. The files can be fetched with:
./scripts/fetch-test-data.shThe tests can then be run using py.test. For example, running Ensembl importing tests can be done with:
py.test tests/databases/ensembl/The pipeline requires the NXF_OPTS environment variable to be set to
-Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000, a module for doing this is
in modules/cluster. Also some configuration settings for efficient usage on
EBI's LSF cluster  are in config/cluster.config.
See LICENSE for more information.