Feature Request: Sharded Tables #4457

jthandy · 2019-07-30T00:45:25Z

jthandy
Jul 30, 2019
Maintainer

Feature:

Feature description

Currently in dbt, there is a hard-and-fast rule that one model file results in at most one object (table or view) in the database. This is a core part of the design of the product and has been true since the very first commit. In most situations, this works OK. There is one specific case where it does not, however: sharded tables.

Sharding isn't a term that's often used outside of the Bigquery world, but it's a pattern that in practice is used on Snowflake and Redshift semi-frequently when organizations are dealing with large enough datasets. Essentially, sharding is simply creating a series of physical tables that are "sharded" on a key (the most often I've seen are customer_id and created_date), that, when taken all together represent a complete view of the entire table. These tables are typically named in the format table_name_[shard], i.e. fct_orders_190401.

Bigquery provides a wildcard operator to allow all shards in a logical table to be selected from at the same time, and building tables using this paradigm was well-supported in even early public versions of Bigquery. Redshift and Snowflake do not have quite such native support for this style, but the Redshift docs specifically talk about this strategy, and I've heard from the Snowflake internal analytics team that they use this pattern as well.

I can imagine multiple ways that dbt could theoretically be modified in order to output this type of data structure in a more idiomatic way, but this is far enough from dbt's standard paradigm today that I don't want to be prescriptive here: I legitimately don't know what the ideal answer is from either a dbt user's perspective or from a technology perspective. Instead, I just want to flag this as a real need—one that I have personally felt on recent projects and have spoken to several teams who would get value out of this. Currently, those teams are employing some fascinating hacks to end-around dbt's inability to handle this type of data structure by escaping to Python and Airflow.

Why would anyone want to use this?

Sharded tables create essentially one additional level of abstraction to store very large datasets. While modern optimizers are often pretty good at minimizing costs and runtime of queries on very large tables, manually sharding these tables allows analytics engineers to be more explicit about what data they want to scan in a given query and thus control performance and costs more explicitly.
Sharded tables create more fine-grained levels that can have database permissions applied. For instance, it is possible to shard a large table on a region id and then only grant employees in a particular region to select from the shard associated with that particular region, while employees in HQ can select from all regions.
In practice, many source datasets are already sharded, and it is often optimal to deal with them in this format rather than forcing these datasets to be unioned together prior to being operated on within dbt.

Who will this benefit?

This will benefit dbt users who are using dbt to process large tables, typically 50GB+ but often 1TB+, who want to apply the fairly common data engineering design pattern of sharding data into multiple physical tables.

PradKumarGC · 2020-07-02T15:49:10Z

PradKumarGC
Jul 2, 2020

+1 for including sharding as a dbt feature. To highlight on another use case, when we hit a max. partitioning limit of 4000 in BigQuery (as of today) for large tables an alternative would be to shard them (by txn year as an example) followed by partitioning on each table on the original intended partition (e.g txn date date) column.

0 replies

elliottohara · 2020-09-30T15:54:24Z

elliottohara
Sep 30, 2020

This would be super useful for us as well, we'd love to physically separate some of the data in our tables based on the customer's id. It would dramatically simplify security.

0 replies

mitalauskas · 2020-12-02T07:14:09Z

mitalauskas
Dec 2, 2020

+1

0 replies

OussGhan · 2021-01-25T16:16:54Z

OussGhan
Jan 25, 2021

+1
Would definitely be a game changer for all bigquery dbt users. A frequent usecase for this is GA360 transformations using dbt. I managed to write incremental models using dbt but was very painful and will be obsolete once we reach 4000 partions milestone :/

0 replies

thibault-lengyel-carrefour · 2021-04-22T21:06:45Z

thibault-lengyel-carrefour
Apr 22, 2021

+1 It will be very useful for us as well. We usually need to shard our orders data (10TB) by country and other fields.

0 replies

minhnhat992 · 2021-11-19T19:49:46Z

minhnhat992
Nov 19, 2021

+1

0 replies

jtcohen6 · 2021-12-08T16:05:15Z

jtcohen6
Dec 8, 2021
Maintainer

As I've mentioned in some of the other issues that linked back here: I'd be interested in taking another look at sharded tables next year. The basic idea—one model, multiple objects, with a "union" view atop them—is quite similar to what we'd need to natively support lambda views.

One open question for me is the extent to which we want to plug into existing database capabilities that approximate this functionality, such as BigQuery's older school ingestion-time-partitioned tables. We officially deprecated support for those in dbt-bigquery==1.0.0, and I'd be more interested in a unified approach that works across adapters; but if it's the right way to do it on BigQuery (and potentially the most cost-effective), it could be worth adding that functionality back in.

2 replies

shanemorton-au Mar 10, 2022

+1

jtcohen6 May 13, 2022
Maintainer

Another piece of this that just came up in conversation: If we supported sharded tables, we could support an insert_overwrite incremental strategy in Snowflake, similar to how we support it on BigQuery + Spark/Databricks. The insert overwrite statement there truncates an entire table before loading new data in (docs). That's not very useful for incremental processing when all the data lives in one table, but very useful if (e.g.) each day of data lives in its own "shard"/"partition."

jonathanglima · 2022-09-13T22:52:12Z

jonathanglima
Sep 13, 2022

+1

Not sure what's the current condition on this, but I was able to workaround this using two things:

Setting a wildcard on the model's identifier to call it using source macro
Overriding generate_alias_name macro to read passed variables (--vars) and dynamically change the model alias.

A little troublesome at the moment, but works for me

2 replies

eliesg Jan 31, 2023

hi @jonathanglima i'm interested in that workaround, could you explain what you did in a bit more details? i'm kind of a beginner in dbt

jonathanglima Feb 3, 2023

Hi @eliesg

For step 1 you should check this docs, which specifies how to use wildcards for big query: https://docs.getdbt.com/reference/resource-properties/identifier

For step two you can overwrite generate_alias_name using the information in this part of the docs: https://docs.getdbt.com/docs/build/custom-aliases

My overwriting was basically to set the current year-month to write sharded tables unless a variable was specified. In that case I'd use it, for manually previous month backfilling.

Let me know if this helps you and if any other doubts arise.

Cheers!

ma1ster · 2023-02-03T12:23:04Z

ma1ster
Feb 3, 2023

+1
I need it desprately

0 replies

bhavinpatelde · 2023-07-31T11:11:36Z

bhavinpatelde
Jul 31, 2023

I am thinking of taking a below approach meanwhile as we have permission issue for dbt users on those external sharded tables:

Create an external process to combine date sharded tables into single table for each object.
Use newly created single table as source in dbt

Benefits:

dbt docs will perform better as it won't scan those 100s of date sharded tables
It resolves permission issue as we now have tables in our own schema.

To be added:

Partitions on newly created table/we can build external process to create our new table in incremental mode so it is efficient by only processing new date tables on daily basis and appending to existing master table (which will be again partitioned so models downstream can use it efficiently by filtering on partitions).

0 replies

suchac · 2024-03-11T14:14:57Z

suchac
Mar 11, 2024

+1 🙏

0 replies

thomascleberg · 2024-05-03T22:16:06Z

thomascleberg
May 3, 2024

+1

0 replies

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Sharded Tables #4457

{{title}}

Replies: 13 comments 4 replies

This comment was marked as spam.

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Feature Request: Sharded Tables #4457

jthandy Jul 30, 2019 Maintainer

Feature:

Feature description

Why would anyone want to use this?

Who will this benefit?

Replies: 13 comments · 4 replies

This comment was marked as spam.

jtcohen6 Dec 8, 2021 Maintainer

jtcohen6 May 13, 2022 Maintainer

jthandy
Jul 30, 2019
Maintainer

Replies: 13 comments 4 replies

jtcohen6
Dec 8, 2021
Maintainer

jtcohen6 May 13, 2022
Maintainer