Replies: 33 comments 30 replies
-
@dataexpertz-blr Thanks for opening, I'm surprised there wasn't already an issue for this :) It's something we're hearing and talking about a lot these days. MechanismsI view column-level lineage as existing in two orders of complexity:
Use casesAs with any compelling feature, column-level lineage feels both immensely valuable and a bit vague. If dbt could produce an EXPLAIN-style plan, of every single SQL function performed to produce a single column, that would be very cool, and also tricky to read and reason about as a human being. So I do find it useful to think concretely about the kinds of things we'd hope to enable here:
I'm curious to hear what other things come to mind! |
Beta Was this translation helpful? Give feedback.
-
+4 for the use case you've mentioned @jtcohen6 , I'd like to add the following, IMO important use case:
Instead of building this yourself, you could also think about integrating DBT with the only data lineage focused SAAS product I have heard of: https://getmanta.com/integrations/ |
Beta Was this translation helpful? Give feedback.
-
we are planning to use DBT Cloud for our project and this table by table lineage is really a killer feature which we would like to have as This would make data lineage and debugging much much easier |
Beta Was this translation helpful? Give feedback.
-
Hi 👋 , does anyone know how Datafold + DBT claims to provide this? |
Beta Was this translation helpful? Give feedback.
-
Datafold use their own lineage capabilities to do this — while they do read in your dbt project, the column level lineage is part of their platform (it's an awesome feature though!) |
Beta Was this translation helpful? Give feedback.
-
Monzo (a UK startup bank) have written about how they created column lineage using BigQuery audit logs and parsing the compiled dbt queries with ZetaSQL. Hadn't heard about ZetaSQL before but it's an open source project from Google for analysing SQL. That solution is BigQuery specific but still interesting to see an implemented approach to this problem. |
Beta Was this translation helpful? Give feedback.
-
Is there a huge benefit here where you could define your docs/descriptions once at the source table level and then have that trickle down to all columns that reference the original source column? Still wrestling how to populate descriptions for all columns effectively and efficiently. |
Beta Was this translation helpful? Give feedback.
-
one new application i thought of: dbt build could only block downstream nodes that are directly affected by a test failure. let's say a random column column1 fails in model A, and model A -> model B. B uses column2 from A but not column1. if a test on column1 fails, let's say a not null test, it shouldn't block model B since model B doesn't use column1. |
Beta Was this translation helpful? Give feedback.
-
It looks like the following company is working on this functionality: https://www.hivedive.io/ |
Beta Was this translation helpful? Give feedback.
-
I think there are countless vendors out there specializing on a certain database vendor (i.e. for Snowflake column level lineage is easier as they provide great meta data around accessed objects on column level already). Another very concrete use case I wanted to add: Optimize your tracking plan (and thus, saving compute) We are using the open source version of Snowplow and are lacking a unified UI for frontend + backend + analytics to align for. That lead to situations where very similar events where added but nothing was deprecated as "it could break things downstream". Being able to analyse which events from the raw source tables are used in downstream models would make that way more convenient. |
Beta Was this translation helpful? Give feedback.
-
I was expecting dbt to generate (inherit) column descriptions for downstream tables if it's already defined on source table. Then I faced this discussion and realised I also need column level lineage because it takes so much time to understand the affect of changing a column. |
Beta Was this translation helpful? Give feedback.
-
I address some of these limitations (specifically the maintenance ones, i.e. the ability to "forward" unchanged column descriptions and tests from upstream to downstream models -- this doesn't expose that lineage anywhere) with some dbt macros on top of the scripts I defined for #5093 (comment). It's a bit convoluted so not going to go into full detail right now unless folks ask. Summary: After I had a solution for #5093 via SQL comments & a regex extraction script, I started using a modified version of |
Beta Was this translation helpful? Give feedback.
-
I'm currently ramping up on an existing DBT stack, and I just wanted to stress how column-level lineage would help me do so more efficiently. Without it, figuring out the upstream flow of a transformed table field takes me forever, and I'd dream to be able to see it in one click! This would be a game-changer for my productivity and I hope we'll move forward on this topic 🙏 |
Beta Was this translation helpful? Give feedback.
-
Amundsen (second link) is open-source, perhaps someone can try it out in a real-world environment and report back (so I don't have to be the guinea pig) :)
|
Beta Was this translation helpful? Give feedback.
-
@jtcohen6, any news about doing it in DBT directly? |
Beta Was this translation helpful? Give feedback.
-
Enterprise compliance for columnar lineage is causing heartache for my corp as we adopt DBT; this is a well desired feature that would provide exponential value to us. Native DBT functionality in the capacity of DBTerd would be fantastic. We are working on creating a templated repo that would be used by hundreds of engineers across the corporation, all of which would need to support columnar lineage. |
Beta Was this translation helpful? Give feedback.
-
Hi! I'd love to read if there has been any update in this front. As I understand sqlmesh can do column lineage. Does anyone know how they achieved it? Is it something that could be translated somehow to dbt? |
Beta Was this translation helpful? Give feedback.
-
Would ColumnInfo be the right place to try to put this in the manifest? Or am I barking up the wrong tree? Would it be helpful if we could get something like {
"nodes": {
"model.my_dbt.child_model": {
"database": "dev_vince",
"schema": "my_schema",
"name": "child_model",
"resource_type": "model",
"package_name": "mydbt",
"columns": {
"id": {
"name": "id",
"description": "my_pk",
"depends_on": []
},
"source_column": {
"name": "source_column",
"description": "some column passed through from a source table",
"depends_on": [
{"node": "model.my_dbt.parent_source", "column": "source_column"}
]
},
"model_column": {
"name": "model_column",
"description": "some column passed through from a model",
"depends_on": [
{"node": "model.my_dbt.parent_model", "column": "model_column"}
]
},
"surrogate_key": {
"name": "surrogate_key",
"description": "this is a derived surrogate key from source_column/model_column",
"depends_on": [
{"node": "source.my_dbt.source_schema.parent_source", "column": "source_column"},
{"node": "model.my_dbt.parent_model", "column": "model_column"}
]
}
}
}
}
} |
Beta Was this translation helpful? Give feedback.
-
Is this on roadmap ? SQLMesh is able to use dbt project as source and generate column level lineage. Since that is open source, could that be used to bring the column level lineage to DBT ? |
Beta Was this translation helpful? Give feedback.
-
The underlying library, sqlglot makes this very easy. |
Beta Was this translation helpful? Give feedback.
-
After the great announcements at coalesce, is this on roadmap ? |
Beta Was this translation helpful? Give feedback.
-
Hello friends, is there any updates on this one? |
Beta Was this translation helpful? Give feedback.
-
Is progress still being made on this? |
Beta Was this translation helpful? Give feedback.
-
If anyone is interested into building something for BigQuery, this parser will give you full column lineage for your queries, it covers pretty much the whole BigQuery syntax! https://github.com/borjavb/bq-lineage-tool |
Beta Was this translation helpful? Give feedback.
-
It looks like dbt decided to make column level lineage for the cloud only. That's sad. |
Beta Was this translation helpful? Give feedback.
-
Is anyone surprised?
…On Wed, Feb 14, 2024 at 10:57 AM indy-jonesy ***@***.***> wrote:
Sad indeed.
—
Reply to this email directly, view it on GitHub
<#4458 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AS7LO2FSVVAFALMJGAKQ4DTYTTNFLAVCNFSM5JWKHCA2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TQNBWG43TGMQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
DBT is just too expensive - it's competitors who are building better
features are cheaper - see Paradime.io
…On Wed, Feb 14, 2024 at 3:07 PM Carl Vander ***@***.***> wrote:
I don't understand the disappointment.
Most of us are working for companies that needs to earn money to keep on
living. DBT is no exception. Why on earth should they be giving away the
product for free?
Of course we can go build something new and free, but it will take a long
time, and eventuelly we will also have to pay our bills.
—
Reply to this email directly, view it on GitHub
<#4458 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AS7LO2ECHRVVSQ2UNJLUCHDYTUKRXAVCNFSM5JWKHCA2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TQNBXGE3DGOI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Is there any plan to release this for dbt core? |
Beta Was this translation helpful? Give feedback.
-
Would it be accepted if community adds this feature to dbt core? im happy to help. |
Beta Was this translation helpful? Give feedback.
-
We just released an open-source utility for dbt column-level lineage. This was primarily developed to provide a programmatic interface to dbt column level lineage, enabling further development and use, such as creating automated tools for tagging sensitive column data. But you can also visualized the output json using tools such as jsoncrack.com Existing column-level lineage tools, like Atlan, dbt Cloud, SQLMesh, and Turntable, lack this programmatic capability and also face challenges such as subscription fees, indexing delays, complexity, or concerns about transmitting organizational code/data to vendor servers, which hinder their wider adoption. Hope it helps. |
Beta Was this translation helpful? Give feedback.
-
Describe the feature
Currently have table/view level lineage captured, can this be extended to field/column based?
Additional context
Not specific to a database, applies to the product.
Who will this benefit?
There could be multiple renames to a field and with multiple joins the complexity to track field level changes and source of a particular column becomes difficult. From an Ops perspective it always good to know where a field is coming from to quickly solve data issues.
Beta Was this translation helpful? Give feedback.
All reactions