Make it possible for DBT to generate models from a piece of JINJA code #5101
Replies: 16 comments 13 replies
-
@bashyroger Thanks for opening! Apologies for the delay in my response. There's a lot here to chew on. Today, one of the foundational assumptions underlying dbt is that 1 model file = 1 model node = 1 object (view or table) in the database. What's the problem here? Whenever we want a new dbt-managed database object, it requires a PR to add a new model file. That's hardly an automated approach to data warehousing. I've seen three quite different approaches that try to solve this problem:
At the risk of writing much too much here, I want to discuss each approach in some detail. 1. One model, many objectsIn the most common use case, someone wants a large model result—same logic, same columns—to be split in the the database across multiple tables, with perhaps a few parametrized differences across them. There may be a good reason (such as PII) that each subset needs to be isolated from other records, with the same column schema, transformed in the exact same manner. We see this sometimes with marketing or consulting agencies who offer modeled data as a service to their clients. There are parts about this that are tricky for dbt today:
In the meantime, while this functionality isn't natively supported, I've seen valid approaches here go in two different directions:
I wanted to bring up this approach because these have often been part of the same conversation. It's not exactly the issue you describe here, and I don't think it would apply nearly as well to unstructured data that doesn't share a common column schema. 2. One file, many modelsThis is an old, old issue: #184. There was a long time when we were convinced that model blocks the ultimate way to go. They've fallen out of fashion recently, but this vein of thinking was especially popular around the time that we created snapshots (dbt v0.14.0, summer 2019). -- models/many_models.sql
{% model stg_stripe_payments %}
select * from {{ source('stripe', 'payments') }}
where is_deleted = false
{% endmodel %}
{% model net_revenue %}
{{ config(materialized = 'table') }}
select
orders.order_id,
payments.amount - orders.cogs as net_revenue
from {{ ref('stg_stripe_payments') }} as payments
join {{ ref('stg_orders') }} as orders
on payments.order_id = orders.order_id
{% endmodel %} There are a few things that model blocks have going for them:
This may still be a direction we head in, ultimately. We won't get there before dbt v1.0, but who's to say it couldn't be an essential component of dbt v2.0? I bring up model blocks here because the prospect of decoupling 1 model-per-file also suggests the possibility of taking this further, by treating the larger file as a single Jinja template that could produce many model blocks dynamically: -- models/many_models.sql
{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
{% model stg_stripe_{{ source.name }} %} {# how would this work? #}
select * from {{ source(source.source_name, source.name) }}
where is_deleted = false
{% endmodel %}
{% endfor %} If you try to do this with snapshots today, you'll get an error message:
For good reason: This gets really complex to parse, and it makes us even more reliant on Jinja at a time when we're investing in static analysis to speed up parse-time manifest construction. There's an alternative syntax toyed with in #1374 (comment). That issue was about dynamic columns, to avoid repeating the same yaml properties over and over. (That's ground we've since retrod in #2995, among other places, and definitely interested in thinking about more.) There, Drew mentioned the theoretical idea of a -- models/many_models.sql
{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
{% set model_sql %}
select * from {{ source(source_name, name) }}
{% endset %}
{% set model_name = 'stg_' + source.source_name + '_' + source.name %}
{% do manifest.add_model(name=model_name, sql=model_sql) %}
{% endfor %} While it's easier to imagine writing code like this, it would still be fiendishly complex to handle at parse time. This approach quickly runs into a big limitation in how dbt works today, given these two premises:
So the SQL for each individual model could be dynamically generated, just as it can today, based on the latest content or metadata in the database. But it would not be possible to create more or fewer models on the basis on information stored in the database—exactly the sort of use case that your issue is getting at. Fuller capabilities would require big foundational changes to the way that dbt works. 3. Code generationAfter all, maybe dbt cannot be—ought not be—a data warehouse automater itself, but merely the best-possible substrate for "true" DWH automation? I feel this gets at the heart of the matter. In my early years using dbt, I thought the extensibility and flexibility of its tooling were its strongest features; with Jinja, all things are possible, certainly compared to SQL. Nowadays, I'm inclined to think that dbt's greatest strenght is the rigidity of its opinionated framework, its demand for decisiveness and verbosity in the places that matter. The guardrails are there for a reason. So, let's say we keep living in a world where one database object has to be one model has to be one file (or at least one Jinja block). An automated process could still be of tremendous help by generating those files; open PRs; run CI checks; require a human reviewer (or not); merge; revert if necessary. In that world, data warehouse automation is a time-lapse photograph: a double-speed verson of the current git-based dbt development workflow, with human intervention at only the appropriate moments. The key difference is that data warehouse automation is decoupled from data warehouse execution—their handoff point is dbt model code, explicitly defined. What should that automated process look like? Should the premier executor of dbt code also try to be its own code generator? Or should there be a separate tool? There's fair arguments on either side:
Why not both? tl;drEventually, I could see us doing pieces of all three of the things mentioned above:
Ultimately, I find the possibilities offered by the third approach most compelling. Having written 1500 words on dbt and model automation, I'm coming around to the idea that a native dbt task (#1082), or even just a way to plug together |
Beta Was this translation helpful? Give feedback.
-
A masterful play in three acts ! Thank you Jeremy. |
Beta Was this translation helpful? Give feedback.
-
We built workflows like this at Aula and we'll be talking about it at Coalsece 2021! Jeremy's description, above, is an elegant summary of what our automation does. Source schema files, snapshots, and the first staging layer are all auto-generated and auto-maintaining. The automation relies on the following components:
I didn't know about dbt-helper until just now, so I'm interested to take a look and see how importing dbt-core directly might help standardize the methods and make the code more usable for others. Assuming my coworker from Aula is ok with it, I'd also love to contribute it somewhere in an open-source way. I see somewhat frequent posts over in the dbt slack from people looking for similar functionality. |
Beta Was this translation helpful? Give feedback.
-
Great discussion. If I get it right, approaches 1 and 2 above do not currently work, correct? Even if we end up with just approach 3, a "template" or reference implementation would be handy. |
Beta Was this translation helpful? Give feedback.
-
My team would also like this feature! |
Beta Was this translation helpful? Give feedback.
-
I'm curious if the ability from Jinja2 to extend from another template and replace predefined blocks of code (template inheritance) was considered for this feature? Then a user could define something like |
Beta Was this translation helpful? Give feedback.
-
We would still really love this feature! |
Beta Was this translation helpful? Give feedback.
-
would love this feature! |
Beta Was this translation helpful? Give feedback.
-
Hi @Gwildor, there is an old feature request about jinja2 template inheritance : #1337 |
Beta Was this translation helpful? Give feedback.
-
My use case would be for a multi-tenant platform where we create an access layer (view) for every tenant. I would like to configure a list of tenant identifiers and generate a view for each tenant using one generic jinja template. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your extensive comment @jtcohen6 , I have been giving your writings a bit more thoughs
I still think it can / should be partially. Partially by indeed introducing 'template models': models with a recursive (Jinja) loop that can spawn the creation of multiple models files. From what I can infer about your writing, this initial request would largely be fulfilled if #3428 were implemented. Regarding you comment on parse time / vs execution time problem: those template models would not have to run at execution time initially. As a developer, I would purely see them as a way to automate the creation of multiple model files from a template, something that the current codegen package indeed cannot do. Practically, I would expect a command to exists like: Then later, further along the road, I would see them be added to the execution context:
|
Beta Was this translation helpful? Give feedback.
-
This would be amazing! |
Beta Was this translation helpful? Give feedback.
-
Seems like an extremely important feature that is missing in dbt when we are talking about scalability. Discussion started in 202, is there any "movement" towards achieving this? Example of how very easy it is to achieve it within DataForm https://docs.dataform.co/guides/javascript/js-api#example-dynamic-dataset-generation |
Beta Was this translation helpful? Give feedback.
-
I would like to suggest an alternative solution not covered in @jtcohen6's original response post, let's call it... 4. Native templated modelsOne pattern I see amongst some dbt developers is a templating step to generate dbt models prior to running a The main pain dbt developers experience with the "one model, one database object" constraint is that a new file must be created and maintained for each transformation. When there are 100s (or more) of transformations that can be expressed with a single macro, then it becomes a burden to the developer to add and maintain transformations via the filesystem. It has emerged in the developer base that there is a common desire to manage individual transformations using fewer files (e.g., YAML files), possibly in a tabular format (e.g., spreadsheet, CSV, or database table). I propose here that dbt adds the capability for native templated models, with the main purpose of eliminating pre-processing templating steps for models that can be expressed as a single macro. The benefits should be an improved developer experience and streamlined workflow. In essence, dbt developers who use a native templated model would be managing metadata about their transformations with a YAML file. This feature could also unlock some additional functionality for dbt down the line, as a templated model would be considerably more structured than a free-form SQL or python model -- it would be machine-readable. Existing templating examples (workarounds)In both cases shown below, developers have the ability to express their transformations in configuration. Then, they run a pre-processing step to generate dbt models. Finally, they run Example 1: Data Vault dbt packagesTwo example projects in the Data Vault space are turbovault4dbt and dbtvault-generator. Both of these projects allow developers to manage their transformations in CSV files or database tables by specifying parameters to the Hubs, Links, Satellites macros in datavault4dbt and AutomateDV respectively. Before:
After running pre-processing step:
Example 2: Custom generic templatingI have also seen presentations and workflows by other users who have built custom scripts to achieve similar results. One such templated model looks as follows (I call this one "yo dawg, I heard you like Jinja"). Before:
After running pre-processing step:
Proposed solutionAllow developers to add dbt models via YAML configuration. Then, with these defined in YAML, two additional models would appear in the DAG as though they were separate SQL files. This would not require explicit generation of physical SQL files prior to running dbt -- they would be interpolated during dbt runtime.
This does not allow for complete dynanicism of model generation. Notably, no programming constructs like for-loops and if-statements. But I believe this would suffice for many use-cases of "generated models", particularly in the Data Vault space. Related art
|
Beta Was this translation helpful? Give feedback.
-
This would be great! |
Beta Was this translation helpful? Give feedback.
-
my team is moving from dataform to dbt because it seemed to be a much more powerful tool. I'm trying to migrate a model from dataform and just can't believed that i won't be able to do that because dbt doesn't have this feature yet :( when this type of feature would be available?? |
Beta Was this translation helpful? Give feedback.
-
Describe the feature
While I like DBT a lot, it us currently quite limited in supporting true metadata based 'data warehouse automation'. Sure, a lot of things are automated for you in DBT:
However, what is lacking is the possibility for DBT to:
Describe alternatives you've considered
At my current client we store our raw data in a snowflake variant column, with the schema metadata of that column in a separate variant column.
We do this to keep loading data simple, to never ever be affected by source schema changes anymore in the raw data zone.
To actually use this data however, we have written a piece of python code that parses the stored schema metadata to generate a series of DBT models (materialized as views) on top of the raw data tables.
While this all works fine, this code is obviously disconnected from DBT which currently hinders development as it requires code / environment / context switching.
Additional context
Who will this benefit?
Everyone that wants to do more extensive data warehouse automation with / using DBT.
Are you interested in contributing this feature?
I am not a solid programmer, but definitely would like to contribute in the requirements, use cases and beta test this kind of functionality
Beta Was this translation helpful? Give feedback.
All reactions