s3-pq-to-pg

Concurrently downloads parquet files from s3 in batches from a todo list. Serially reads specified fields from each parquet file into a postgres db table All configuration via a yaml file.

Got data in parquet files you need to get into a postgres db? Are the parquet files stored in S3?

If you say yes to both, we're probably working for the same company and you don't need to reinvent the wheel. Here you go.

RUN

# install latest release binary and:
s3-parquet-to-postgres /path/to/config.yml

# build from this repo if you have rustc 1.80.x or higher
cargo build -r
target/release/s3-parquet-to-postgres /path/to/config.yml

CAVEAT - MAPPING PARQUET TYPE TO POSTGRES TYPES

The columns in your destination db table are expected to have data types compatible with the corresponding parquet file's column types as resolved by the parquet crate.

So if you have a parquet DOUBLE value (stored as rust f64), and you try to stick that in a postgres SMALLINT, expect a stack trace, even if the values ought to fit.

Check the Parquet crate docs for Field enum for the internal rust types that correspond to the parquet types.

Check here for the mapping of rust to postgres data types. And probably here and here.

unsupported postgres types

Only the types supported by Parquet are mapped by default to the equivalent postgres column types. Don't expect to store data direct to an ip address type.

This module doesn't do any fancy casting.

Field::TimestampMillis date +%s%3N Field::TimestampMicros $(( $(date +%s%3N) * 1000 ))

PERFORMANCE

Honestly, the downloading from s3 will be the biggest time suck.

Optimising the rust code is left as an exercise for those with more patience. I clone Strings like a mo'fo' as it doesn't hurt my use-case. Really this'll only hurt if your parquet has many, many columns you wish to push to the db. If you're pulling a few dozen columns, this isn't going to matter.

In practical terms, consider disabling any indexes on the db table before running. This'll greatly improve the COPY INTO that this does under the hood. However if your indexes are there to prevent duplicate keys etc, don't do that unless you're sure about the incoming parquet data.

LOCAL

# activate the pre-commit hooks in repo's root dir:
pre-commit install

# set logging for further development
export RUST_LOG=info,s3_parquet_to_postgres=debug,work_lists=debug,s3_download=debug

# or is you want to just run the binary
export RUST_LOG=info

## local/reset-local.sh

# build debug binary and run the tests
./local/reset-local.sh # requires docker / docker-compose

# build for release
./local/reset-local.sh -r

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
local		local
src		src
tests/testdata/unit-tests		tests/testdata/unit-tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
TODO.md		TODO.md
env.app		env.app

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

s3-pq-to-pg

RUN

CAVEAT - MAPPING PARQUET TYPE TO POSTGRES TYPES

unsupported postgres types

PERFORMANCE

LOCAL

About

Releases

Packages

Languages

opsgang/s3-parquet-to-postgres

Folders and files

Latest commit

History

Repository files navigation

s3-pq-to-pg

RUN

CAVEAT - MAPPING PARQUET TYPE TO POSTGRES TYPES

unsupported postgres types

PERFORMANCE

LOCAL

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages