You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug] When performing data freshness tests in a timezone other than UTC, a full table scan may be executed, resulting in longer test execution times.
#11066
Open
2 tasks done
Benjamin0313 opened this issue
Nov 28, 2024
· 0 comments
I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
Currently, when running dbt source freshness tests in a timezone other than UTC, a query like the following is compiled:
select
max(convert_timezone('UTC', created_at_local)) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders
In this case, the execution order to obtain max_loaded_at is as follows:
• Convert the timezone.
• Retrieve the maximum value of created_at_local in the converted timezone.
At this stage, the timezone conversion is performed for all records, and the maximum date is then calculated from the converted timestamps. As a result, the table is fully scanned once, and aggregation is performed.
As a result, when the number of records is large, the data freshness test takes a significant amount of time.
Expected Behavior
The ideal steps would be as follows:
• First, retrieve the latest date from the table.
• Then, perform the timezone conversion.
select
convert_timezone('UTC', max(created_at_local)) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders
Steps To Reproduce
In a dbt-snowflake environment in
Create sample dbt project and set up dbt profile
Make sample table in snowflake for testing freshness
create or replace table D_HARATO_DB.SAMPLE_SCHEMA.SAMPLE_TABLE as
select '2024-11-24' as TEST_DATE ,'test' as SAMPLE_VALUES
union
select '2024-11-25' as TEST_DATE ,'test' as SAMPLE_VALUES
;
“Set up data freshness tests in dbt”
# _source.yml
version: 2
sources:
- name: sample_source
database: D_HARATO_DB
schema: SAMPLE_SCHEMA
description: This source includes raw order data
tables:
- name: SAMPLE_TABLE
description: Raw data for orders
freshness:
warn_after: {count: 18, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: convert_timezone('UTC', TEST_DATE)
Run dbt source freshness
Relevant log output
No response
Environment
- OS:mac- Python:3.11.6
- dbt: 1.9.0
Which database adapter are you using with dbt?
snowflake
Additional Context
This is the profile after running the freshness test. It scans the two records created and applies timezone conversion and retrieves the latest date from the scan results.
The text was updated successfully, but these errors were encountered:
Is this a new bug in dbt-core?
Current Behavior
Currently, when running dbt source freshness tests in a timezone other than UTC, a query like the following is compiled:
In this case, the execution order to obtain max_loaded_at is as follows:
• Convert the timezone.
• Retrieve the maximum value of created_at_local in the converted timezone.
At this stage, the timezone conversion is performed for all records, and the maximum date is then calculated from the converted timestamps. As a result, the table is fully scanned once, and aggregation is performed.
As a result, when the number of records is large, the data freshness test takes a significant amount of time.
Expected Behavior
The ideal steps would be as follows:
• First, retrieve the latest date from the table.
• Then, perform the timezone conversion.
Steps To Reproduce
dbt source freshness
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
This is the profile after running the freshness test. It scans the two records created and applies timezone conversion and retrieves the latest date from the scan results.
The text was updated successfully, but these errors were encountered: