Column level lineage #4458

sou-joshi · 2021-04-05T17:38:06Z

sou-joshi
Apr 5, 2021

Describe the feature

Currently have table/view level lineage captured, can this be extended to field/column based?

Additional context

Not specific to a database, applies to the product.

Who will this benefit?

There could be multiple renames to a field and with multiple joins the complexity to track field level changes and source of a particular column becomes difficult. From an Ops perspective it always good to know where a field is coming from to quickly solve data issues.

jtcohen6 · 2021-04-08T18:01:02Z

jtcohen6
Apr 8, 2021
Maintainer

@dataexpertz-blr Thanks for opening, I'm surprised there wasn't already an issue for this :) It's something we're hearing and talking about a lot these days.

Mechanisms

I view column-level lineage as existing in two orders of complexity:

Extending existing constructs. Today, dbt developers have to duplicate a lot of resource properties (descriptions, tags, meta, tests) across models, even when model Y is just select * from model X. Over in Doc (and potentially, Test) Inheritance #2995, and dbt doc blocks #1158 before then, we've been discussing ways that YAML anchors make a version of this possible today, and how it could be better in the future (cross-file anchors, souped-up docs blocks, hierarchical properties-as-config).
Massive new capabilities. To capture column-level lineage for real for real, we'd need a validating SQL grammar—same as would, incidentally, for a decent linter / auto-formatter (Automatic formatter for SQL #2356). In a world where we had this, and built it into dbt, we'd also have an AST representation of every column name, from relation, and SQL function. This is definitely on our minds; it's still a ways away.

Use cases

As with any compelling feature, column-level lineage feels both immensely valuable and a bit vague. If dbt could produce an EXPLAIN-style plan, of every single SQL function performed to produce a single column, that would be very cool, and also tricky to read and reason about as a human being.

So I do find it useful to think concretely about the kinds of things we'd hope to enable here:

Property (tag/meta/etc) inheritance. E.g. If a source PII/PHI column is the indirect input to a column in a downstream model, being able to mark the latter model as sensitive.
Saving code (+ time). Extending what I described above, if a column has not transformed from model X to model Y—no renames, no aggregations—dbt could natively inherit its properties, such as description + tests.
- Or, better yet, dbt would understand that the column has been tested upstream, has not changed, and so does not need to run the same tests again.
"dbt column advisor." If dbt has a full picture of how a column is produced—which input columns, which transformations—it could flag when there are potentially duplicative columns across models, and help avoid the repeating of business logic.
Slim CI to the max: If you've only changed one column across a few models, rather than running the changed models and all their children, you'd only need to run + test downstream models that use the affected column.

I'm curious to hear what other things come to mind!

0 replies

bashyroger · 2021-04-26T12:47:19Z

bashyroger
Apr 26, 2021

+4 for the use case you've mentioned @jtcohen6 , I'd like to add the following, IMO important use case:

A variant of Slim CI to the max: being able to do an impact analysis at development time when you're changing a source / model's column name. After all, the fact that a column changes does not necessarily mean that the column also used in any dependent models. For that, you really need column level lineage.

Instead of building this yourself, you could also think about integrating DBT with the only data lineage focused SAAS product I have heard of: https://getmanta.com/integrations/

0 replies

tufanrakshit · 2021-07-22T13:10:06Z

tufanrakshit
Jul 22, 2021

we are planning to use DBT Cloud for our project and this table by table lineage is really a killer feature which we would like to have as This would make data lineage and debugging much much easier

0 replies

devstein · 2021-09-23T00:22:40Z

devstein
Sep 23, 2021

Hi 👋 , does anyone know how Datafold + DBT claims to provide this?

0 replies

jaypeedevlin · 2021-09-23T00:26:22Z

jaypeedevlin
Sep 23, 2021

Datafold use their own lineage capabilities to do this — while they do read in your dbt project, the column level lineage is part of their platform (it's an awesome feature though!)

0 replies

MarkMacArdle · 2021-12-09T12:16:51Z

MarkMacArdle
Dec 9, 2021

Monzo (a UK startup bank) have written about how they created column lineage using BigQuery audit logs and parsing the compiled dbt queries with ZetaSQL. Hadn't heard about ZetaSQL before but it's an open source project from Google for analysing SQL.

That solution is BigQuery specific but still interesting to see an implemented approach to this problem.

1 reply

after-ephemera Aug 4, 2022

Here is GCP's post describing a similar approach.

christopherridley · 2022-01-14T23:30:14Z

christopherridley
Jan 14, 2022

Is there a huge benefit here where you could define your docs/descriptions once at the source table level and then have that trickle down to all columns that reference the original source column? Still wrestling how to populate descriptions for all columns effectively and efficiently.

1 reply

KayakinKoder Jul 2, 2022

This, in my opinion, is a huge unmet need. Documenting data has become a huge time suck, and the lack of description propagation is a reason why.

jtalmi · 2022-04-25T20:23:33Z

jtalmi
Apr 25, 2022

one new application i thought of: dbt build could only block downstream nodes that are directly affected by a test failure. let's say a random column column1 fails in model A, and model A -> model B. B uses column2 from A but not column1. if a test on column1 fails, let's say a not null test, it shouldn't block model B since model B doesn't use column1.

0 replies

bashyroger · 2022-05-18T12:27:30Z

bashyroger
May 18, 2022

It looks like the following company is working on this functionality: https://www.hivedive.io/
Did anyone testdrive this already?

0 replies

stephanclaus · 2022-05-21T11:03:01Z

stephanclaus
May 21, 2022

I think there are countless vendors out there specializing on a certain database vendor (i.e. for Snowflake column level lineage is easier as they provide great meta data around accessed objects on column level already).

Another very concrete use case I wanted to add: Optimize your tracking plan (and thus, saving compute)

We are using the open source version of Snowplow and are lacking a unified UI for frontend + backend + analytics to align for. That lead to situations where very similar events where added but nothing was deprecated as "it could break things downstream". Being able to analyse which events from the raw source tables are used in downstream models would make that way more convenient.

0 replies

mertbakir · 2022-05-28T21:52:52Z

mertbakir
May 28, 2022

I was expecting dbt to generate (inherit) column descriptions for downstream tables if it's already defined on source table. Then I faced this discussion and realised I also need column level lineage because it takes so much time to understand the affect of changing a column.

0 replies

vergenzt · 2022-07-15T21:05:56Z

vergenzt
Jul 15, 2022

I address some of these limitations (specifically the maintenance ones, i.e. the ability to "forward" unchanged column descriptions and tests from upstream to downstream models -- this doesn't expose that lineage anywhere) with some dbt macros on top of the scripts I defined for #5093 (comment). It's a bit convoluted so not going to go into full detail right now unless folks ask.

Summary: After I had a solution for #5093 via SQL comments & a regex extraction script, I started using a modified version of dbt_utils.star (I called my macro ref_columns_from) to pull from graph.nodes instead of the database, so I can get dbt's metadata about that model and its columns instead of just what the database knows about it from get_columns_in_relation. Then, since a single dbt compile pass can only propagate columns one level, I iterate through "dbt compile" → extract column definitions from compiled SQL → "dbt compile" → ..., until the generated "schema.yml" file (column doc & test definition) stops changing.

1 reply

z3z1ma Mar 29, 2023

https://github.com/z3z1ma/dbt-osmosis

This tool does it already with a very easy CLI @vergenzt

YohanGrember · 2022-08-10T15:27:21Z

YohanGrember
Aug 10, 2022

I'm currently ramping up on an existing DBT stack, and I just wanted to stress how column-level lineage would help me do so more efficiently.

Without it, figuring out the upstream flow of a transformed table field takes me forever, and I'd dream to be able to see it in one click! This would be a game-changer for my productivity and I hope we'll move forward on this topic 🙏

0 replies

ijoseph · 2022-08-25T18:21:19Z

ijoseph
Aug 25, 2022

Amundsen (second link) is open-source, perhaps someone can try it out in a real-world environment and report back (so I don't have to be the guinea pig) :)

We have added native lineage support in Amundsen so you can ingest lineage metadata (both at table and column level) straight into the graph backend.

1 reply

after-ephemera Aug 26, 2022

I've played with Amundsen and it's great for bringing lineage in, but I think this issue was specifically focused on capturing column-level lineage within dbt, which is a bit of a different beast.

axelborja · 2022-11-08T14:58:02Z

axelborja
Nov 8, 2022

@jtcohen6, any news about doing it in DBT directly?

0 replies

JoePollo · 2023-06-20T12:44:48Z

JoePollo
Jun 20, 2023

Enterprise compliance for columnar lineage is causing heartache for my corp as we adopt DBT; this is a well desired feature that would provide exponential value to us. Native DBT functionality in the capacity of DBTerd would be fantastic. We are working on creating a templated repo that would be used by hundreds of engineers across the corporation, all of which would need to support columnar lineage.

0 replies

nicoteiza · 2023-06-28T15:21:28Z

nicoteiza
Jun 28, 2023

Hi! I'd love to read if there has been any update in this front. As I understand sqlmesh can do column lineage. Does anyone know how they achieved it? Is it something that could be translated somehow to dbt?

2 replies

indy-jonesy Jul 17, 2023

The creator of sqlmesh, also created sqlglot.
You can find an independent dev team were able to build a proprietary dbt lineage solution here using sqlglot as a backbone. It would be a matter of leveraging sqlglot as a compiler and incorporating it into dbt core.

ismailsimsek May 8, 2024

@indy-jonesy @nicoteiza it seems like dbt proprietary feature is also based on sqlglot link1 link2

VDFaller · 2023-07-27T15:48:53Z

VDFaller
Jul 27, 2023

Would ColumnInfo be the right place to try to put this in the manifest? Or am I barking up the wrong tree? Would it be helpful if we could get something like

{
    "nodes": {
        "model.my_dbt.child_model": {
            "database": "dev_vince",
            "schema": "my_schema",
            "name": "child_model",
            "resource_type": "model",
            "package_name": "mydbt",
            "columns": {
                "id": {
                    "name": "id",
                    "description": "my_pk",
                    "depends_on": []                    
                },
                "source_column": {
                    "name": "source_column",
                    "description": "some column passed through from a source table",
                    "depends_on": [
                        {"node": "model.my_dbt.parent_source", "column": "source_column"}
                    ]
                },
                "model_column": {
                    "name": "model_column",
                    "description": "some column passed through from a model",
                    "depends_on": [
                        {"node": "model.my_dbt.parent_model", "column": "model_column"}
                    ]
                }, 
                "surrogate_key": {
                    "name": "surrogate_key",
                    "description": "this is a derived surrogate key from source_column/model_column",
                    "depends_on": [
                        {"node": "source.my_dbt.source_schema.parent_source", "column": "source_column"}, 
                        {"node": "model.my_dbt.parent_model", "column": "model_column"}
                    ]
                }
            }
        }
    }
}

0 replies

shishircc · 2023-09-16T14:32:50Z

shishircc
Sep 16, 2023

Is this on roadmap ? SQLMesh is able to use dbt project as source and generate column level lineage. Since that is open source, could that be used to bring the column level lineage to DBT ?

0 replies

shishircc · 2023-09-22T14:06:46Z

shishircc
Sep 22, 2023

The underlying library, sqlglot makes this very easy.
E.g. See the simple 3 lines of python code creating column level lineage graph here https://twitter.com/Captaintobs/status/1619185992111624197

3 replies

robomill Jan 9, 2024

It's definitely more complicated than this depending on the complexity of the sql. But this is essentially how datahub generates lineage:

https://github.com/datahub-project/datahub/blob/8415fc214b1b9d0f99eafbffc01f3e46043f54dc/metadata-ingestion/src/datahub/utilities/sqlglot_lineage.py#L178

ismailsimsek May 8, 2024

@robomill it seems like dbt proprietary feature is also based on sqlglot link1 link2

robomill May 9, 2024

@ismailsimsek It's kinda weird they didn't implement it themselves for enterprise.

They already have all the ingredients to write it themselves built into DBT. They already have the sqlparse library that generates an AST that they could use to generate the inputs and outputs of a model. They already have code that can introspect the schema of tables as well. It's easy enough with a good engineer to reverse engineer what SQLglot have done.

Seems kinda lazy on their part and less of a reason to exclude from dbt-core if they are using a third party open source library

shishircc · 2023-11-07T04:59:29Z

shishircc
Nov 7, 2023

After the great announcements at coalesce, is this on roadmap ?

3 replies

matfior-finn Nov 7, 2023

While we wait, I suggest checking https://www.turntable.so/, it's quite nifty and really fast.

shishircc Nov 12, 2023

Turntable is great. My main concern is that it requires developer level access to get the lineage in VS Code. All people who benefit from lineage may not have developer level access.

ianrtracey May 9, 2024

@shishircc Ian from Turntable here. What do you mean by developer level access? We're currently building a cloud version if you'd like to learn more

matteoannotell · 2023-11-23T17:28:48Z

matteoannotell
Nov 23, 2023

Hello friends, is there any updates on this one?

1 reply

matfior-finn Nov 24, 2023

I would like to know as well. I tried asking during coalesce in London but I got some evasive answers.

AlexThomas90210 · 2023-11-28T20:28:25Z

AlexThomas90210
Nov 28, 2023

Is progress still being made on this?

3 replies

rrivera-ut Dec 8, 2023

I'm going to take a stab and say probably not

matfior-finn Dec 12, 2023

would be cool to get some sort of official answer about it.

ViniciusRaphael Jan 5, 2024

That's true. Some official answer about status would be great.

borjavb · 2024-01-19T17:02:07Z

borjavb
Jan 19, 2024

If anyone is interested into building something for BigQuery, this parser will give you full column lineage for your queries, it covers pretty much the whole BigQuery syntax! https://github.com/borjavb/bq-lineage-tool

0 replies

mertbakir · 2024-02-14T10:53:02Z

mertbakir
Feb 14, 2024

It looks like dbt decided to make column level lineage for the cloud only. That's sad.

9 replies

indy-jonesy Feb 14, 2024

Sad indeed.

odikia Jan 11, 2025

Are they trying to push developers toward sqlmesh? Cause locking the critical CLL feature behind a MASSIVE paywall is how you do it...

ianrtracey Jan 11, 2025

For anyone looking for CLL, we've build a free extension for VS Code

https://marketplace.visualstudio.com/items?itemName=turntable.turntable-for-dbt-core

odikia Jan 11, 2025

Hey @ianrtracey , I do enjoy turntable, but I'm wondering if enterprise security would be happy with data transfer to outside servers for information parsing? Some of the information in most stacks regarding database objects is proprietary, and if they're not happy with sending data to OpenAI via API (leading to current trend in local LLM deployment), I don't think they'd be okay with sending data to Turntable's servers for LLM interaction. It's not entirely clear how security teams at an enterprise can evaluate the transfer of data at present with Turntable given the tool's present documentation (that I am aware of, at least).

ianrtracey Jan 11, 2025

@odikia everything in Turntable runs locally on your machine (besides LLM auto-documentation generation). This includes column-level lineage, the docs GUI, query preview, and language server features (unlike dbt-power user which I believe does send metadata to servers for column-level lineage).

Unless you click "auto generate docs", nothing ever leaves your device. You can also disable this feature (as well as all server communication besides auth) in our settings panel. We built this setting recently to support some folks at Anduril (which have intense info sec requirements).

We're also more than happy to provide source code for teams that want that extra peace of mind as well (and we're also considering open sourcing our extension this year).

rrivera-ut · 2024-02-14T16:05:21Z

rrivera-ut
Feb 14, 2024

Is anyone surprised?

…

On Wed, Feb 14, 2024 at 10:57 AM indy-jonesy ***@***.***> wrote: Sad indeed. — Reply to this email directly, view it on GitHub <#4458 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS7LO2FSVVAFALMJGAKQ4DTYTTNFLAVCNFSM5JWKHCA2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TQNBWG43TGMQ> . You are receiving this because you commented.Message ID: ***@***.***>

2 replies

calleo Feb 14, 2024

I don't understand the disappointment.

Most of us are working for companies that needs to earn money to keep on living. DBT is no exception. Why on earth should they be giving away the product for free?

Of course we can go build something new and free, but it will take a long time, and eventuelly we will also have to pay our bills.

KayakinKoder Feb 15, 2024

@calleo I'm a bit confused. The Team plan is $100/month per developer, and as of right now it doesn't include column level lineage. You consider that pricing to be free?

Let's all just hope that the folks at dbt realize that making column-level lineage available to SMBs on the Team plan would add a lot of value/use cases that could justify teams purchasing plans.

rrivera-ut · 2024-02-14T20:18:16Z

rrivera-ut
Feb 14, 2024

DBT is just too expensive - it's competitors who are building better features are cheaper - see Paradime.io

…

On Wed, Feb 14, 2024 at 3:07 PM Carl Vander ***@***.***> wrote: I don't understand the disappointment. Most of us are working for companies that needs to earn money to keep on living. DBT is no exception. Why on earth should they be giving away the product for free? Of course we can go build something new and free, but it will take a long time, and eventuelly we will also have to pay our bills. — Reply to this email directly, view it on GitHub <#4458 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS7LO2ECHRVVSQ2UNJLUCHDYTUKRXAVCNFSM5JWKHCA2U5DIOJSWCZC7NNSXTOKENFZWG5LTONUW63SDN5WW2ZLOOQ5TQNBXGE3DGOI> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

jyotidhiman0610 · 2024-05-07T04:06:42Z

jyotidhiman0610
May 7, 2024

Is there any plan to release this for dbt core?

1 reply

1cadumagalhaes May 7, 2024

Pretty sure there isn't

ismailsimsek · 2024-05-08T17:45:21Z

ismailsimsek
May 8, 2024

Would it be accepted if community adds this feature to dbt core? im happy to help.

3 replies

1cadumagalhaes May 8, 2024

I was thinking about that this week, maybe creating a dbt package that adds column level lineage would work.
I'm not sure if this could be added to the core itself.
It should be possible since both dbt (on cloud) and sqlmesh have column level lineage

ismailsimsek May 8, 2024

+1
also it looks like both (dbt on cloud and sqlmesh) are using opensource sqlglot for column level lineage link1 link2

side note: AFAIU using sqlglot, its also possible to get rid of the {{ref('model_a')}} syntax. and use pure sql(table name) instead. that's how sqlmesh find dependencies.

z3z1ma May 9, 2024

A few thoughts. At this point just use sqlmesh. Really. Why jump through these hoops. Dbt implemented CLL (and the backing UI) as a paid feature which isn't inherently wrong. They are a business at the end of the day. It would not benefit them to pursue this in the oss product after releasing it as a differentiator in the paid product. Either it's worst and people are unhappy and they must maintain it or they devalue their cloud products value proposition. So probably just make another pypi package if you want to attempt it.

And doing away with refs is shortsighted only because people override the ref macro constantly and it's used in so many places. Jinja and dbt are quite married at this point if you dig under the hood. Furthermore people depend on dbt behaving consistently and not discover new "refs" because of some new parsing.

Sqlglot is not just 'some oss package' with a one way relationship to sqlmesh but is in fact built and maintained by the creators of sqlmesh. So their usage of it is in the blood of the software and not bolted on.

wenwu35 · 2024-10-15T04:51:02Z

wenwu35
Oct 15, 2024

We just released an open-source utility for dbt column-level lineage.

This was primarily developed to provide a programmatic interface to dbt column level lineage, enabling further development and use, such as creating automated tools for tagging sensitive column data. But you can also visualized the output json using tools such as jsoncrack.com

Existing column-level lineage tools, like Atlan, dbt Cloud, SQLMesh, and Turntable, lack this programmatic capability and also face challenges such as subscription fees, indexing delays, complexity, or concerns about transmitting organizational code/data to vendor servers, which hinder their wider adoption.

Hope it helps.

2 replies

ismailsimsek Oct 15, 2024

@wenwu35 Would it be possible to add this feature to opendbt. opendbt already adds some new features to dbt-core and this will be great addition. opendbt is dbt-core+additional features which are not got in dbt-core. i believe this will be good addition. whoever is interested contributions are welcome.

i am in progress of adding sql formatting to it

1cadumagalhaes Oct 15, 2024

Hey, this looks like an interesting project. Some of the features that are there and work with airflow are in the astronomer cosmos project, maybe you could take a look

Column level lineage #4458

Describe the feature

Additional context

Who will this benefit?

Replies: 33 comments · 34 replies

jtcohen6 Apr 8, 2021 Maintainer

Mechanisms

Use cases

Replies: 33 comments 34 replies

jtcohen6
Apr 8, 2021
Maintainer