Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] When performing data freshness tests in a timezone other than UTC, a full table scan may be executed, resulting in longer test execution times. #11066

Open
2 tasks done
Benjamin0313 opened this issue Nov 28, 2024 · 0 comments
Labels
bug Something isn't working triage

Comments

@Benjamin0313
Copy link

Benjamin0313 commented Nov 28, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Currently, when running dbt source freshness tests in a timezone other than UTC, a query like the following is compiled:

select
  max(convert_timezone('UTC', created_at_local)) as max_loaded_at,
  convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders

In this case, the execution order to obtain max_loaded_at is as follows:
• Convert the timezone.
• Retrieve the maximum value of created_at_local in the converted timezone.

At this stage, the timezone conversion is performed for all records, and the maximum date is then calculated from the converted timestamps. As a result, the table is fully scanned once, and aggregation is performed.

As a result, when the number of records is large, the data freshness test takes a significant amount of time.

Expected Behavior

The ideal steps would be as follows:
• First, retrieve the latest date from the table.
• Then, perform the timezone conversion.

select
  convert_timezone('UTC', max(created_at_local)) as max_loaded_at,
  convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders

Steps To Reproduce

  1. In a dbt-snowflake environment in
  2. Create sample dbt project and set up dbt profile
  3. Make sample table in snowflake for testing freshness
create or replace table D_HARATO_DB.SAMPLE_SCHEMA.SAMPLE_TABLE as
select '2024-11-24' as TEST_DATE ,'test' as  SAMPLE_VALUES
union
select '2024-11-25' as TEST_DATE ,'test' as  SAMPLE_VALUES
;
  1. “Set up data freshness tests in dbt”
# _source.yml
version: 2

sources:
  - name: sample_source
    database: D_HARATO_DB
    schema: SAMPLE_SCHEMA
    description: This source includes raw order data
    tables:
      - name: SAMPLE_TABLE
        description: Raw data for orders
        freshness:
          warn_after: {count: 18, period: hour}
          error_after: {count: 24, period: hour}
        loaded_at_field: convert_timezone('UTC', TEST_DATE)
  1. Run dbt source freshness

Relevant log output

No response

Environment

- OS:mac 
- Python:3.11.6
- dbt: 1.9.0

Which database adapter are you using with dbt?

snowflake

Additional Context

This is the profile after running the freshness test. It scans the two records created and applies timezone conversion and retrieves the latest date from the scan results.

スクリーンショット 2024-11-28 11 12 33
@Benjamin0313 Benjamin0313 added bug Something isn't working triage labels Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant