Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support MicrobatchConcurrency capability #1259

Merged
merged 7 commits into from
Dec 5, 2024

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Nov 27, 2024

resolves #1260
docs dbt-labs/docs.getdbt.com/#

Problem

dbt-snowflake does not yet support running the microbatch incremental strategy in concurrent threads.

Solution

  • Ensure temp tables are written to unique locations when batch.id is available (in microbatch model context)
  • Enable MicrobatchConcurrency capability

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes label Nov 27, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@MichelleArk MichelleArk marked this pull request as ready for review December 2, 2024 14:53
@MichelleArk MichelleArk requested a review from a team as a code owner December 2, 2024 14:53
@MichelleArk
Copy link
Contributor Author

MichelleArk commented Dec 4, 2024

📉 🎩 I've done a bunch of benchmarking around this enablement!

Setup

  • Microbatch input model with several years of data, although 30 days is all that was strictly necessary here.
  • Microbatch model configured to have a 'day' batch size, producing a variable sized batch (1, 1k, 10k, 50k records) for each day
  • Running dbt run --select microbatch --full-refresh --event-time-start 1998-07-01 --event-time-end 1998-08-02 with concurrency disabled, enabled and with 4 threads, enabled and with 8 threads

Results

batch size = 1 row

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 53.96            | 50.98                 | 1.6                   | N/A                                             |
| 4                             | 28.35            | 87.17                 | 2.7                   | 40%                                             |
| 8                             | 22.21            | 120.96                | 3.8s                  | 57%    

batch size = 1k rows

^ no significant difference to 1 row result

batch size = 10k rows

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 75.32            | 72.31                 | 2.25                   | N/A                                             |
| 4                             | 31.02            | 99.94                 | 3.1                    | 27%                                             |
| 8                             | 28.4             | 140.78                | 4.4                    | 48%     

batch size = 50k rows

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 98.79            | 95.99                 | 3                      | N/A                                             |
| 4                             | 39.63            | 131.33                | 4                      | 25%                                             |
| 8                             | 30.80            | 172.94                | 5.4                    | 44$                                             |

Validation:

  • For each scenario, I confirmed that the resulting dataset when running in concurrent mode is equivalent to the serial dataset produced. I've also manually confirmed the temp table suffixing is working as expected (e.g. create or replace temporary view analytics.dbt_marky.microbatch__dbt_tmp_19980801).

Conclusions

It's clear that there is some amount of overhead associated with enabling concurrency (longer avg runs for individual batches), with the benefit of significantly reduced. It appears that this overhead is non-linear however, and that it should decline relative to the overall runtime the larger the batch to be merged is.

I believe we should proceed with enabling concurrency for dbt-snowflake, and document that the benefit of running models concurrently is faster overall backfilling capabilities, with the side effect of longer-running batches, as some amount of overhead is incurred by the platform to merge into the main dataset safely + concurrently.

@MichelleArk MichelleArk merged commit 86cf6e6 into main Dec 5, 2024
14 checks passed
@MichelleArk MichelleArk deleted the microbatch-concurrency-capability branch December 5, 2024 20:37
github-actions bot pushed a commit that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable MicrobatchConcurrency for dbt-snowflake
3 participants