Replies: 13 comments 4 replies
This comment was marked as spam.
This comment was marked as spam.
-
+1 for including sharding as a dbt feature. To highlight on another use case, when we hit a max. partitioning limit of 4000 in BigQuery (as of today) for large tables an alternative would be to shard them (by txn year as an example) followed by partitioning on each table on the original intended partition (e.g txn date date) column. |
Beta Was this translation helpful? Give feedback.
-
This would be super useful for us as well, we'd love to physically separate some of the data in our tables based on the customer's id. It would dramatically simplify security. |
Beta Was this translation helpful? Give feedback.
-
+1 |
Beta Was this translation helpful? Give feedback.
-
+1 It will be very useful for us as well. We usually need to shard our orders data (10TB) by country and other fields. |
Beta Was this translation helpful? Give feedback.
-
As I've mentioned in some of the other issues that linked back here: I'd be interested in taking another look at sharded tables next year. The basic idea—one model, multiple objects, with a "union" view atop them—is quite similar to what we'd need to natively support lambda views. One open question for me is the extent to which we want to plug into existing database capabilities that approximate this functionality, such as BigQuery's older school ingestion-time-partitioned tables. We officially deprecated support for those in |
Beta Was this translation helpful? Give feedback.
-
+1 Not sure what's the current condition on this, but I was able to workaround this using two things:
A little troublesome at the moment, but works for me |
Beta Was this translation helpful? Give feedback.
-
+1 |
Beta Was this translation helpful? Give feedback.
-
I am thinking of taking a below approach meanwhile as we have permission issue for dbt users on those external sharded tables:
Benefits:
To be added: Partitions on newly created table/we can build external process to create our new table in incremental mode so it is efficient by only processing new date tables on daily basis and appending to existing master table (which will be again partitioned so models downstream can use it efficiently by filtering on partitions). |
Beta Was this translation helpful? Give feedback.
-
Feature:
Feature description
Currently in dbt, there is a hard-and-fast rule that one model file results in at most one object (table or view) in the database. This is a core part of the design of the product and has been true since the very first commit. In most situations, this works OK. There is one specific case where it does not, however: sharded tables.
Sharding isn't a term that's often used outside of the Bigquery world, but it's a pattern that in practice is used on Snowflake and Redshift semi-frequently when organizations are dealing with large enough datasets. Essentially, sharding is simply creating a series of physical tables that are "sharded" on a key (the most often I've seen are
customer_id
andcreated_date
), that, when taken all together represent a complete view of the entire table. These tables are typically named in the formattable_name_[shard]
, i.e.fct_orders_190401
.Bigquery provides a wildcard operator to allow all shards in a logical table to be selected from at the same time, and building tables using this paradigm was well-supported in even early public versions of Bigquery. Redshift and Snowflake do not have quite such native support for this style, but the Redshift docs specifically talk about this strategy, and I've heard from the Snowflake internal analytics team that they use this pattern as well.
I can imagine multiple ways that dbt could theoretically be modified in order to output this type of data structure in a more idiomatic way, but this is far enough from dbt's standard paradigm today that I don't want to be prescriptive here: I legitimately don't know what the ideal answer is from either a dbt user's perspective or from a technology perspective. Instead, I just want to flag this as a real need—one that I have personally felt on recent projects and have spoken to several teams who would get value out of this. Currently, those teams are employing some fascinating hacks to end-around dbt's inability to handle this type of data structure by escaping to Python and Airflow.
Why would anyone want to use this?
Who will this benefit?
This will benefit dbt users who are using dbt to process large tables, typically 50GB+ but often 1TB+, who want to apply the fairly common data engineering design pattern of sharding data into multiple physical tables.
Beta Was this translation helpful? Give feedback.
All reactions