Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/1492 extend timestamp config #1669

Merged
merged 42 commits into from
Aug 30, 2024
Merged

Conversation

donotpush
Copy link
Collaborator

@donotpush donotpush commented Aug 7, 2024

Description

dlt does not support timestamps without timezones and to be able to pass such timestamps form source to destination we need to extend the core library.

The scope of this PR is to implement a proof of concept (POC) for DuckDB and parquet. This includes updating the type mapper and adding tests where the timezone is passed via columns.

Additionally, a complementary notebook with various experiments is included in this PR.

https://github.com/donotpush/dlt-notebooks/blob/main/notebooks/duckdb-exploration.ipynb

@rudolfix requirements to solve #1492 :

  1. We'd need a new hint in TColumnType that will say if data_type contains timezone or not (or define time zone as string, with default to "UTC"). See precision and scale hints already there. The idea is the same.
  2. Each of our destinations has a type mapper which needs to be be able to interpret this new thing.
  3. pyarrow.py contains functions that convert dlt schema into arrow schema and vice versa. those functions needs to be upgraded
  4. we'll need plenty of tests: loading non-tz aware datetimes via json and parquet into all possible destinations
  5. let's make it work for duckdb+postgres+snowflake first

Related Issues

Copy link

netlify bot commented Aug 7, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 6ef862a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66cf3ab0fab24a00086b8862

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! implementation idea and code structure is good. duckdb was a hard case because we had technical debt where we silently created columns without timezones if precision required it. now we can do it right

my take: try to implement postgres and snowflake next. if you need any credentials ping us but I think you have access to CI vault

thx for the tests

dlt/common/libs/pyarrow.py Outdated Show resolved Hide resolved
dlt/destinations/impl/duckdb/duck.py Show resolved Hide resolved
@donotpush
Copy link
Collaborator Author

LGTM! implementation idea and code structure is good. duckdb was a hard case because we had technical debt where we silently created columns without timezones if precision required it. now we can do it right

my take: try to implement postgres and snowflake next. if you need any credentials ping us but I think you have access to CI vault

thx for the tests

@rudolfix I've implemented your review comments, and added more tests.

I'll review & fix the CICD failing tests.

@donotpush
Copy link
Collaborator Author

donotpush commented Aug 20, 2024

Added postgresql dlt's implementation and tests.

Links to my notebooks:

Notebook Link
DuckDB Timestamps Timezone 1492-duckdb-timestamps-timezone.ipynb
PostgreSQL Timestamps Timezone 1492-postgres-timestamps-timezone.ipynb

Working on Snowflake.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really good! I think we are really close to close it.

  1. could we support BigQuery and Athena? those are popular destinations
  2. for other destinations log a warning that we do not support ts without timezones

then we are good to go. I like your tests

@donotpush
Copy link
Collaborator Author

donotpush commented Aug 23, 2024

@rudolfix - Comment updated.

warning log added.

BigQuery - I attempted to implement the timezone flag for BigQuery using both TIMESTAMP and DATETIME data types, but I encountered issues. BigQuery does not automatically handle timezones in incoming data for the DATETIME type, leading to errors. Tansforming (or stripping) JSON or other data types before insertion doesn't seem to be a viable solution.

I have documented my findings here:
BigQuery Exploration Notebook

Athena - Athena only supports a single TIMESTAMP type, which does not allow for timezone or precision configuration. AWS documentation can be confusing; they list DDL and DML columns, with "Not Available" indicating a lack of support for other TIMESTAMP types. I ran a few tests on my AWS account with the following data:

File uploaded to s3://test-athena-julian/test/1.txt:

1 2024-07-30T10:00:00.12
2 2024-07-30T11:00:00.123456
3 2024-07-30T12:00:00.1234
4 2024-07-30T10:00:00.1234567
5 2024-07-30T11:00:00.123456789

I used the following DDL statement to create the table:

CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
  id INT,
  t TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://test-athena-julian/test';

Amazon Athena Data Types Documentation

@donotpush donotpush force-pushed the feat/1492-extend-timestamp-config branch from 58a2150 to 3a4613c Compare August 23, 2024 21:32
@donotpush donotpush requested a review from rudolfix August 26, 2024 14:41
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have suggestion to move docs to different sections. the code is good. thanks for investigating and cleaning up timestamp mess :)


`duckdb` can create unique indexes for columns with `unique` hints. However, **this feature is disabled by default** as it can significantly slow down data loading.

## Supported column flags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are still called hints. I'd move this section under

Data loading

and replace this header with

Data types

(it is used in several other destinations)

please apply this to other docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Question: What about this section?

## Supported column hints

`duckdb` can create unique indexes for columns with `unique` hints. However, **this feature is disabled by default** as it can significantly slow down data loading.

It was there before this PR. Should we keep it as it is, or move it under Data Loading as well?

@donotpush donotpush requested review from rudolfix August 28, 2024 13:41
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit fbf0ef4 into devel Aug 30, 2024
56 checks passed
@rudolfix rudolfix deleted the feat/1492-extend-timestamp-config branch August 30, 2024 09:41
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* feat: add timezone flag to configure timestamp data

* fix: delete timezone init

* test: add duckdb timestamps with timezone

* test: fix resource hints for timestamp

* test: correct duckdb timestamps

* test: timezone tests for parquet files

* exp: add notebook with timestamp exploration

* test: refactor timestamp tests

* test: simplified tests and extended experiments

* exp: timestamp exp for duckdb and parquet

* fix: add pyarrow reflection for timezone flag

* fix lint errors

* fix: CI/CD move tests pyarrow module

* fix: pyarrow timezone defaults true

* refactor: typemapper signatures

* fix: duckdb timestamp config

* docs: updated duckdb.md timestamps

* fix: revert duckdb timestamp defaults

* fix: restore duckdb timestamp default

* fix: duckdb timestamp mapper

* fix: delete notebook

* docs: added timestamp and timezone section

* refactor: duckdb precision exception message

* feat: postgres timestamp timezone config

* fix: postgres timestamp precision

* fix: postgres timezone false case

* feat: add snowflake timezone and precision flag

* test: postgres invalid timestamp precision

* test: unified timestamp invalid precision

* test: unified column flag timezone

* chore: add warn log for unsupported timezone or precision flag

* docs: timezone and precision flags for timestamps

* fix: none case error

* docs: add duckdb default precision

* fix: typing errors

* rebase: formatted files from upstream devel

* fix: warning message and reference TODO

* test: delete duplicated input_data array

* docs: moved timestamp config to data types section

* fix: lint and format

* fix: lint local errors
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* feat: add timezone flag to configure timestamp data

* fix: delete timezone init

* test: add duckdb timestamps with timezone

* test: fix resource hints for timestamp

* test: correct duckdb timestamps

* test: timezone tests for parquet files

* exp: add notebook with timestamp exploration

* test: refactor timestamp tests

* test: simplified tests and extended experiments

* exp: timestamp exp for duckdb and parquet

* fix: add pyarrow reflection for timezone flag

* fix lint errors

* fix: CI/CD move tests pyarrow module

* fix: pyarrow timezone defaults true

* refactor: typemapper signatures

* fix: duckdb timestamp config

* docs: updated duckdb.md timestamps

* fix: revert duckdb timestamp defaults

* fix: restore duckdb timestamp default

* fix: duckdb timestamp mapper

* fix: delete notebook

* docs: added timestamp and timezone section

* refactor: duckdb precision exception message

* feat: postgres timestamp timezone config

* fix: postgres timestamp precision

* fix: postgres timezone false case

* feat: add snowflake timezone and precision flag

* test: postgres invalid timestamp precision

* test: unified timestamp invalid precision

* test: unified column flag timezone

* chore: add warn log for unsupported timezone or precision flag

* docs: timezone and precision flags for timestamps

* fix: none case error

* docs: add duckdb default precision

* fix: typing errors

* rebase: formatted files from upstream devel

* fix: warning message and reference TODO

* test: delete duplicated input_data array

* docs: moved timestamp config to data types section

* fix: lint and format

* fix: lint local errors
@rudolfix rudolfix mentioned this pull request Sep 3, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to keep default date columns as TIMESTAMP_NTZ in oracle to snowflake migration
2 participants