Navigator Data Ingest

Introduction

This application forms the Data Ingest portion of the data processing pipeline.

The application reads json files from an input s3 directory and outputs the processed json objects as json files to an output s3 directory.

Declarations of the input and output object types can be found in src.navigator_data_ingest.base.types

Input Object: Document
Output Object: DocumentParserInput

The main feature of the ingest stage is to download the source document, identify the content-type and upload the result to the relevant s3 document store if pdf.

Configuration

The tool is a command line tool that can be run with the following command:

python -m navigator_data_ingest

The following options exist for configuring the tool at runtime:

S3 bucket name from which to read/write input/output files

--pipeline-bucket

S3 bucket name in which to store cached documents

--document-bucket

Location of JSON Document array input file

--input-file

Prefix to apply to output files

--output-prefix

Number of workers downloading/uploading cached documents

--worker-count

Unit Tests

The unit tests use a mock s3 client to simulate the s3 interactions. All the update functions are then tested for functionality.

Integration Tests

The test procedure follows the deploy-test-destroy pattern.

This involves creating a test environment, running the tests, and then destroying the test environment.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github		.github
integration_tests		integration_tests
src/navigator_data_ingest		src/navigator_data_ingest
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
HOW_TO_UPDATE_TESTS.md		HOW_TO_UPDATE_TESTS.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
format_jsons.py		format_jsons.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Navigator Data Ingest

Introduction

Configuration

Unit Tests

Integration Tests

About

Releases 40

Packages

Contributors 6

Languages

License

climatepolicyradar/navigator-data-ingest

Folders and files

Latest commit

History

Repository files navigation

Navigator Data Ingest

Introduction

Configuration

Unit Tests

Integration Tests

About

Resources

License

Stars

Watchers

Forks

Releases 40

Packages 0

Contributors 6

Languages

Packages