Skip to content

Data pipeline for loading eJP Xml files from S3 bucket into Bigquery

Notifications You must be signed in to change notification settings

elifesciences/data-hub-ejp-xml-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eJP XML Data Pipeline

This repository consists of a generic data pipeline that is used to ETL eJP XML dumps in S3 buckets. Being generic, it needs to be configured for its data source, data sink, and transformations. The sample configuration for this data pipeline can be found in sample_data_config directory of this project

Running The Pipeline Locally

This repo is designed to run in as a containerized application in the development environment. To run this locally, review the docker-compose.dev.override.yml and docker-compose.yaml files, and ensure that the different credentials files required by different data pipelines are correctly provided. Following are the credentials that you may need to provide

  • GCP's service account json key (mandatory for all data pipelines)
  • AWS credentials

Using Docker

To run the whole test on the application:

make build-dev end2end-test

To run tests excluding the end to end tests:

make build-dev test-exclude-e2e

To run the pipeline (which uses a state file):

make data-hub-pipelines-run-ejp-xml-pipeline

Using a Virtual Environment

To set up the development environment:

# initial setup
make dev-venv
# update dependencies
make dev-install

To run the test:

make dev-test

To run the pipeline (which uses a state file):

make dev-run-ejp-xml-pipeline

To clear the state:

make dev-clear-state

Project Folder/Package Organisation

  • ejp_xml_pipeline package consist of the packages and libraries and functions needed to run the pipeline.
  • tests contains the tests run on this implementation. These include these types
    • unit tests
    • end to end tests
  • sample_data_config folder contains the sample configurations for the data pipeline

CI/CD

This runs on Jenkins and follows the standard approaches used by the eLife Data Team for CI/CD. Note that as part of the CI/CD, another Jenkins pipeline is always triggered whenever there is a commit to the develop branch. The latest commit reference to a develop branch is passed on as a parameter to this Jenkins pipeline to be triggered.

About

Data pipeline for loading eJP Xml files from S3 bucket into Bigquery

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages