-
Notifications
You must be signed in to change notification settings - Fork 972
docs: π Add documentation on working with SQL in Kedro #4706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- π Outline key concepts and assumptions for SQL usage - π Describe approaches for integrating SQL, including Pandas SQL, Spark-JDBC, and Ibis - β Highlight advantages and limitations of each SQL approach -β οΈ Include recommended Ibis development workflow for building scalable pipelines - π οΈ Discuss Kedro's limitations and when to consider SQL-first alternatives
Signed-off-by: datajoely <[email protected]>
|
Not going to review this but this is so cool! β€οΈ |
Signed-off-by: datajoely <[email protected]>
Signed-off-by: datajoely <[email protected]>
Signed-off-by: datajoely <[email protected]>
Signed-off-by: Joel <[email protected]>
Signed-off-by: datajoely <[email protected]>
Signed-off-by: datajoely <[email protected]>
|
π€ Rendered result https://kedro--4706.org.readthedocs.build/en/4706/integrations/working_with_sql.html |
Signed-off-by: Joel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally reviewed this! Thanks a lot for putting it together @datajoely ππΌ
My main high level comments are:
- Having this in our docs is much better than not having anything, contains lots of useful guidance
- However, the bullet-point style is not very consistent with other parts of the Kedro docs (maybe something @stichbury can help us with if/when she has time)
- Some parts read as very definitive or give the impression that certain workflows will never change, which I'd like to soften somehow
- I would move the "Recommended Ibis development workflow" to a separate page (somewhat clashing with #4560 by @Stephanieewelu)
From a logistics point of view, better to rebase or recreate this PR on top of develop, which contains the most current MkDocs-based docs
|
|
||
| ### 2. CRUD Scope | ||
|
|
||
| - Kedro handles the parts of CRUD which do not require some concept of βstate.β |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth spelling out CRUD, since it might be a term that not all Kedro users are familiar with
|
|
||
| | Approach | Status | Setup Complexity | | ||
| |----------------|--------|--------------------| | ||
| | **Pandas SQL** | Legacy | Minimal | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning Polars SQL? https://docs.pola.rs/user-guide/sql/intro/
|
|
||
| --- | ||
|
|
||
| ### 2. Spark-JDBC (legacy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it's a bit too bold calling Spark "legacy" π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:
- You can treat a SQL database purely as a data sink/source, or
- Leverage the database engineβs compute power whenever possible.
I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.
For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.
I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).
| ```{tip} | ||
| Please also check out our [blog on building scalable data pipelines with Kedro and Ibis](https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis), [SQL data processing in Kedro ML pipelines](https://kedro.org/blog/sql-data-processing-in-kedro-ml-pipelines), and our [PyData London talk](https://www.youtube.com/watch?v=ffDHdtz_vKc). | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole {tip} calls for a separate page in our docs about Ibis
| Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features. | ||
| ``` | ||
|
|
||
| ### Recommended Ibis development workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole section should be a separate page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.
| More broadly, there are some wider limitations to be aware of when working with Kedro & SQL: | ||
|
|
||
| - **No conditional branching** | ||
| Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confuses me. How is "overwrite" so different from "update"? Both "break lineage" (see table at the beginning of the page)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the massive delay!
| |-----------|------------|---------------------------------------------------------| | ||
| | Create | β | Writing new tables (or views) | | ||
| | Read | β | Loading data into DataFrames or Ibis expressions | | ||
| | Update | β | Requires custom SQL outside Kedroβs DAG; breaks lineage | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support Update (Upsert), right? Or what do you mean by breaks lineage?
| Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features. | ||
| ``` | ||
|
|
||
| ### Recommended Ibis development workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.
|
|
||
| This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases. | ||
|
|
||
| ## Key concepts and assumptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder who the audience for this page is. The key concepts and assumptions starts out pretty advanced IMO; I think there are a lot of people who still don't think about the issue of pulling data out of the database, transforming it, and writing back, for instance.
|
|
||
| --- | ||
|
|
||
| ### 2. Spark-JDBC (legacy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:
- You can treat a SQL database purely as a data sink/source, or
- Leverage the database engineβs compute power whenever possible.
I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.
For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.
I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).
|
|
||
| - JVM + Spark cluster required | ||
| - Overkill for smaller workloads | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite predicate pushdown, it's still a lot of data transfer!
| ``` | ||
|
|
||
| - **Mechanism** | ||
| Sparkβs DataFrame API over JDBC. Leverages predicate pushdown so filters/projections occur in-database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, there's a question of who this is written for. It does feel more like notes at this time; I think predicate pushdown is definitely something that requires at least some level of knowledge to know of (which is not all Kedro users)
| Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes. | ||
|
|
||
| - **When to seek SQL-first alternatives** | ||
| If your workflow is entirely SQL, tools like [dbt](https://github.com/dbt-labs/dbt-core) or [SQLMesh](https://github.com/TobikoData/sqlmesh) offer richer lineage and transformation management than Kedroβs Python-centric approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even think it's about lineage and transformation management as much as it is simply about whether you're using SQL vs. Python.
I think it is good to point out that Kedro shouldn't be your orchestrator for all SQL transforms. On the flip side, if you want to do DE and DS in a unified Python codebase, that can be a good point to use Kedro, and hopefully the SQL recommendations help here.
Description
Development notes
N/A
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-byline in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.mdfile