This repository consists of a generic data pipeline that is used to ETL eJP XML dumps in S3 buckets.
Being generic, it needs to be configured for its data source, data sink, and transformations.
The sample configuration for this data pipeline can be found in sample_data_config
directory of this project
This repo is designed to run in as a containerized application in the development environment.
To run this locally, review the docker-compose.dev.override.yml
and docker-compose.yaml
files, and ensure that the different credentials files required by different data pipelines are correctly provided.
Following are the credentials that you may need to provide
- GCP's service account json key (mandatory for all data pipelines)
- AWS credentials
To run the whole test on the application:
make build-dev end2end-test
To run tests excluding the end to end tests:
make build-dev test-exclude-e2e
To run the pipeline (which uses a state file):
make data-hub-pipelines-run-ejp-xml-pipeline
To set up the development environment:
# initial setup
make dev-venv
# update dependencies
make dev-install
To run the test:
make dev-test
To run the pipeline (which uses a state file):
make dev-run-ejp-xml-pipeline
To clear the state:
make dev-clear-state
ejp_xml_pipeline
package consist of the packages and libraries and functions needed to run the pipeline.tests
contains the tests run on this implementation. These include these types- unit tests
- end to end tests
sample_data_config
folder contains the sample configurations for the data pipeline
This runs on Jenkins and follows the standard approaches used by the eLife Data Team
for CI/CD.
Note that as part of the CI/CD, another Jenkins pipeline is always triggered whenever there is a commit to the develop branch. The latest commit reference to a develop
branch is passed on as a parameter to this Jenkins pipeline to be triggered.