eJP XML Data Pipeline

This repository consists of a generic data pipeline that is used to ETL eJP XML dumps in S3 buckets. Being generic, it needs to be configured for its data source, data sink, and transformations. The sample configuration for this data pipeline can be found in sample_data_config directory of this project

Running The Pipeline Locally

This repo is designed to run in as a containerized application in the development environment. To run this locally, review the docker-compose.dev.override.yml and docker-compose.yaml files, and ensure that the different credentials files required by different data pipelines are correctly provided. Following are the credentials that you may need to provide

GCP's service account json key (mandatory for all data pipelines)
AWS credentials

Using Docker

To run the whole test on the application:

make build-dev end2end-test

To run tests excluding the end to end tests:

make build-dev test-exclude-e2e

To run the pipeline (which uses a state file):

make data-hub-pipelines-run-ejp-xml-pipeline

Using a Virtual Environment

To set up the development environment:

# initial setup
make dev-venv
# update dependencies
make dev-install

To run the test:

make dev-test

To run the pipeline (which uses a state file):

make dev-run-ejp-xml-pipeline

To clear the state:

make dev-clear-state

Project Folder/Package Organisation

ejp_xml_pipeline package consist of the packages and libraries and functions needed to run the pipeline.
tests contains the tests run on this implementation. These include these types
- unit tests
- end to end tests
sample_data_config folder contains the sample configurations for the data pipeline

CI/CD

This runs on Jenkins and follows the standard approaches used by the eLife Data Team for CI/CD. Note that as part of the CI/CD, another Jenkins pipeline is always triggered whenever there is a commit to the develop branch. The latest commit reference to a develop branch is passed on as a parameter to this Jenkins pipeline to be triggered.

Name		Name	Last commit message	Last commit date
Latest commit History 402 Commits
.github		.github
ejp_xml_pipeline		ejp_xml_pipeline
sample_data_config		sample_data_config
tests		tests
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
Makefile		Makefile
README.md		README.md
docker-compose.ci.override.yml		docker-compose.ci.override.yml
docker-compose.dev.override.yml		docker-compose.dev.override.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
maintainers.txt		maintainers.txt
mypy.ini		mypy.ini
requirements.build.txt		requirements.build.txt
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
run_test.sh		run_test.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eJP XML Data Pipeline

Running The Pipeline Locally

Using Docker

Using a Virtual Environment

Project Folder/Package Organisation

CI/CD

About

Releases 2

Packages

Contributors 7

Languages

elifesciences/data-hub-ejp-xml-pipeline

Folders and files

Latest commit

History

Repository files navigation

eJP XML Data Pipeline

Running The Pipeline Locally

Using Docker

Using a Virtual Environment

Project Folder/Package Organisation

CI/CD

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Languages

Packages