support MicrobatchConcurrency capability #1259

MichelleArk · 2024-11-27T22:18:45Z

resolves #1260
docs dbt-labs/docs.getdbt.com/#

Problem

dbt-snowflake does not yet support running the microbatch incremental strategy in concurrent threads.

Solution

Ensure temp tables are written to unique locations when batch.id is available (in microbatch model context)
Enable MicrobatchConcurrency capability

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

github-actions · 2024-11-27T22:18:58Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

QMalcolm

dbt/include/snowflake/macros/materializations/incremental.sql

MichelleArk · 2024-12-04T22:22:16Z

📉 🎩 I've done a bunch of benchmarking around this enablement!

Setup

Microbatch input model with several years of data, although 30 days is all that was strictly necessary here.
Microbatch model configured to have a 'day' batch size, producing a variable sized batch (1, 1k, 10k, 50k records) for each day
Running dbt run --select microbatch --full-refresh --event-time-start 1998-07-01 --event-time-end 1998-08-02 with concurrency disabled, enabled and with 4 threads, enabled and with 8 threads

Results

batch size = 1 row

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 53.96            | 50.98                 | 1.6                   | N/A                                             |
| 4                             | 28.35            | 87.17                 | 2.7                   | 40%                                             |
| 8                             | 22.21            | 120.96                | 3.8s                  | 57%

batch size = 1k rows

^ no significant difference to 1 row result

batch size = 10k rows

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 75.32            | 72.31                 | 2.25                   | N/A                                             |
| 4                             | 31.02            | 99.94                 | 3.1                    | 27%                                             |
| 8                             | 28.4             | 140.78                | 4.4                    | 48%

batch size = 50k rows

| # concurrent threads          | overall time (s) | sum of batch time (s) | avg time per batch (s) | % of time spent in concurrency-related overhead |
|-------------------------------|------------------|-----------------------|------------------------|-------------------------------------------------|
| 1 (concurrent_batches: False) | 98.79            | 95.99                 | 3                      | N/A                                             |
| 4                             | 39.63            | 131.33                | 4                      | 25%                                             |
| 8                             | 30.80            | 172.94                | 5.4                    | 44$                                             |

Validation:

For each scenario, I confirmed that the resulting dataset when running in concurrent mode is equivalent to the serial dataset produced. I've also manually confirmed the temp table suffixing is working as expected (e.g. create or replace temporary view analytics.dbt_marky.microbatch__dbt_tmp_19980801).

Conclusions

It's clear that there is some amount of overhead associated with enabling concurrency (longer avg runs for individual batches), with the benefit of significantly reduced. It appears that this overhead is non-linear however, and that it should decline relative to the overall runtime the larger the batch to be merged is.

I believe we should proceed with enabling concurrency for dbt-snowflake, and document that the benefit of running models concurrently is faster overall backfilling capabilities, with the side effect of longer-running batches, as some amount of overhead is incurred by the platform to merge into the main dataset safely + concurrently.

dbt/include/snowflake/macros/materializations/incremental.sql

(cherry picked from commit 86cf6e6)

support MicrobatchConcurrency capability

7259135

cla-bot bot added the cla:yes label Nov 27, 2024

QMalcolm approved these changes Nov 27, 2024

View reviewed changes

dbt/include/snowflake/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

MichelleArk and others added 2 commits December 2, 2024 09:13

Merge branch 'main' into microbatch-concurrency-capability

d236ec6

changelog entry

53a9010

MichelleArk marked this pull request as ready for review December 2, 2024 14:53

MichelleArk requested a review from a team as a code owner December 2, 2024 14:53

MichelleArk added 2 commits December 2, 2024 12:42

Merge branch 'main' into microbatch-concurrency-capability

d04f4dd

Merge branch 'main' into microbatch-concurrency-capability

3821905

MichelleArk commented Dec 5, 2024

View reviewed changes

dbt/include/snowflake/macros/materializations/incremental.sql Outdated Show resolved Hide resolved

MichelleArk added 2 commits December 5, 2024 14:16

leverage new global make_temp_relation

7829b10

remove empty line

c69b33b

mikealfare approved these changes Dec 5, 2024

View reviewed changes

MichelleArk merged commit 86cf6e6 into main Dec 5, 2024
14 checks passed

MichelleArk deleted the microbatch-concurrency-capability branch December 5, 2024 20:37

MichelleArk added the backport 1.9.latest label Dec 5, 2024

github-actions bot pushed a commit that referenced this pull request Dec 5, 2024

support MicrobatchConcurrency capability (#1259)

fca4c59

(cherry picked from commit 86cf6e6)

github-actions bot mentioned this pull request Dec 5, 2024

[Backport 1.9.latest] support MicrobatchConcurrency capability #1264

Merged

MichelleArk pushed a commit that referenced this pull request Dec 5, 2024

support MicrobatchConcurrency capability (#1259) (#1264)

af09301

thewchan mentioned this pull request Dec 11, 2024

[bot-automerge] dbt-snowflake v1.9.0 conda-forge/dbt-snowflake-feedstock#31

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support MicrobatchConcurrency capability #1259

support MicrobatchConcurrency capability #1259

MichelleArk commented Nov 27, 2024 •

edited

Loading

github-actions bot commented Nov 27, 2024

QMalcolm left a comment

MichelleArk commented Dec 4, 2024 •

edited

Loading

support MicrobatchConcurrency capability #1259

support MicrobatchConcurrency capability #1259

Conversation

MichelleArk commented Nov 27, 2024 • edited Loading

Problem

Solution

Checklist

github-actions bot commented Nov 27, 2024

QMalcolm left a comment

Choose a reason for hiding this comment

MichelleArk commented Dec 4, 2024 • edited Loading

Setup

Results

batch size = 1 row

batch size = 1k rows

batch size = 10k rows

batch size = 50k rows

Conclusions

MichelleArk commented Nov 27, 2024 •

edited

Loading

MichelleArk commented Dec 4, 2024 •

edited

Loading