docs: 📚 Add documentation on working with SQL in Kedro #4706

datajoely · 2025-05-07T15:54:01Z

Description

📝 Outline key concepts and assumptions for SQL usage
📊 Describe approaches for integrating SQL, including Pandas SQL, Spark-JDBC, and Ibis
✅ Highlight advantages and limitations of each SQL approach
⚠️ Include recommended Ibis development workflow for building scalable pipelines
🛠️ Discuss Kedro's limitations and when to consider SQL-first alternatives

Development notes

N/A

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

- 📝 Outline key concepts and assumptions for SQL usage - 📊 Describe approaches for integrating SQL, including Pandas SQL, Spark-JDBC, and Ibis - ✅ Highlight advantages and limitations of each SQL approach - ⚠️ Include recommended Ibis development workflow for building scalable pipelines - 🛠️ Discuss Kedro's limitations and when to consider SQL-first alternatives

Signed-off-by: datajoely <[email protected]>

yetudada · 2025-05-07T15:58:12Z

Not going to review this but this is so cool! ❤️

Signed-off-by: datajoely <[email protected]>

docs/source/integrations/working_with_sql.md

Signed-off-by: Joel <[email protected]>

Signed-off-by: datajoely <[email protected]>

astrojuanlu · 2025-05-08T09:55:38Z

🤖 Rendered result https://kedro--4706.org.readthedocs.build/en/4706/integrations/working_with_sql.html

docs/source/integrations/working_with_sql.md

Signed-off-by: Joel <[email protected]>

astrojuanlu

I finally reviewed this! Thanks a lot for putting it together @datajoely 🙏🏼

My main high level comments are:

Having this in our docs is much better than not having anything, contains lots of useful guidance
However, the bullet-point style is not very consistent with other parts of the Kedro docs (maybe something @stichbury can help us with if/when she has time)
Some parts read as very definitive or give the impression that certain workflows will never change, which I'd like to soften somehow
I would move the "Recommended Ibis development workflow" to a separate page (somewhat clashing with #4560 by @Stephanieewelu)

From a logistics point of view, better to rebase or recreate this PR on top of develop, which contains the most current MkDocs-based docs

astrojuanlu · 2025-05-13T14:36:49Z

docs/source/integrations/working_with_sql.md

+
+### 2. CRUD Scope
+
+- Kedro handles the parts of CRUD which do not require some concept of “state.”


Probably worth spelling out CRUD, since it might be a term that not all Kedro users are familiar with

astrojuanlu · 2025-05-13T14:38:34Z

docs/source/integrations/working_with_sql.md

+
+| Approach       | Status    | Setup Complexity   |
+|----------------|--------|--------------------|
+| **Pandas SQL** | Legacy | Minimal            |


Worth mentioning Polars SQL? https://docs.pola.rs/user-guide/sql/intro/

astrojuanlu · 2025-05-13T14:39:11Z

docs/source/integrations/working_with_sql.md

+
+---
+
+### 2. Spark-JDBC (legacy)


Wondering if it's a bit too bold calling Spark "legacy" 😄

I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:

You can treat a SQL database purely as a data sink/source, or

Leverage the database engine’s compute power whenever possible.

I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.

For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.

I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).

astrojuanlu · 2025-05-13T14:39:33Z

docs/source/integrations/working_with_sql.md

+```{tip}
+Please also check out our [blog on building scalable data pipelines with Kedro and Ibis](https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis), [SQL data processing in Kedro ML pipelines](https://kedro.org/blog/sql-data-processing-in-kedro-ml-pipelines), and our [PyData London talk](https://www.youtube.com/watch?v=ffDHdtz_vKc).
+```


This whole {tip} calls for a separate page in our docs about Ibis

astrojuanlu · 2025-05-13T14:40:06Z

docs/source/integrations/working_with_sql.md

+Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features.
+```
+
+### Recommended Ibis development workflow


I think this whole section should be a separate page

I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.

astrojuanlu · 2025-05-13T14:40:46Z

docs/source/integrations/working_with_sql.md

+More broadly, there are some wider limitations to be aware of when working with Kedro & SQL:
+
+- **No conditional branching**
+Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.


This confuses me. How is "overwrite" so different from "update"? Both "break lineage" (see table at the beginning of the page)

deepyaman

Sorry for the massive delay!

deepyaman · 2025-05-29T03:33:27Z

docs/source/integrations/working_with_sql.md

+|-----------|------------|---------------------------------------------------------|
+| Create    | ✅          | Writing new tables (or views)                           |
+| Read      | ✅          | Loading data into DataFrames or Ibis expressions        |
+| Update    | ❌          | Requires custom SQL outside Kedro’s DAG; breaks lineage |


We should support Update (Upsert), right? Or what do you mean by breaks lineage?

deepyaman · 2025-05-29T03:43:16Z

docs/source/integrations/working_with_sql.md

+Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features.
+```
+
+### Recommended Ibis development workflow


I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.

deepyaman · 2025-05-29T03:46:48Z

docs/source/integrations/working_with_sql.md

+
+This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases.
+
+## Key concepts and assumptions


I wonder who the audience for this page is. The key concepts and assumptions starts out pretty advanced IMO; I think there are a lot of people who still don't think about the issue of pulling data out of the database, transforming it, and writing back, for instance.

deepyaman · 2025-05-29T04:00:16Z

docs/source/integrations/working_with_sql.md

+
+---
+
+### 2. Spark-JDBC (legacy)


I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:

You can treat a SQL database purely as a data sink/source, or

Leverage the database engine’s compute power whenever possible.

I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.

For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.

I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).

deepyaman · 2025-05-29T04:02:22Z

docs/source/integrations/working_with_sql.md

+
+  - JVM + Spark cluster required
+  - Overkill for smaller workloads
+


Despite predicate pushdown, it's still a lot of data transfer!

deepyaman · 2025-05-29T04:03:12Z

docs/source/integrations/working_with_sql.md

+```
+
+- **Mechanism**
+  Spark’s DataFrame API over JDBC. Leverages predicate pushdown so filters/projections occur in-database.


Again, there's a question of who this is written for. It does feel more like notes at this time; I think predicate pushdown is definitely something that requires at least some level of knowledge to know of (which is not all Kedro users)

deepyaman · 2025-05-29T04:05:37Z

docs/source/integrations/working_with_sql.md

+Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.
+
+- **When to seek SQL-first alternatives**
+If your workflow is entirely SQL, tools like [dbt](https://github.com/dbt-labs/dbt-core) or [SQLMesh](https://github.com/TobikoData/sqlmesh) offer richer lineage and transformation management than Kedro’s Python-centric approach.


I don't even think it's about lineage and transformation management as much as it is simply about whether you're using SQL vs. Python.

I think it is good to point out that Kedro shouldn't be your orchestrator for all SQL transforms. On the flip side, if you want to do DE and DS in a unified Python codebase, that can be a good point to use Kedro, and hopefully the SQL recommendations help here.

datajoely requested review from astrojuanlu and deepyaman May 7, 2025 15:54

datajoely requested a review from yetudada as a code owner May 7, 2025 15:54

signed commit

dbad41b

Signed-off-by: datajoely <[email protected]>

datajoely added 3 commits May 7, 2025 17:02

linkcheck

2009fc5

Signed-off-by: datajoely <[email protected]>

toctree

4f8d408

Signed-off-by: datajoely <[email protected]>

lint

5c27782

Signed-off-by: datajoely <[email protected]>

astrojuanlu reviewed May 7, 2025

View reviewed changes

docs/source/integrations/working_with_sql.md Outdated Show resolved Hide resolved

datajoely commented May 7, 2025

View reviewed changes

docs/source/integrations/working_with_sql.md Outdated Show resolved Hide resolved

docs/source/integrations/working_with_sql.md Outdated Show resolved Hide resolved

datajoely and others added 3 commits May 8, 2025 08:32

Update docs/source/integrations/working_with_sql.md

afe8048

Signed-off-by: Joel <[email protected]>

try mermaid embed

9fdc284

Signed-off-by: datajoely <[email protected]>

try mermaid embed

00e1340

Signed-off-by: datajoely <[email protected]>

datajoely commented May 8, 2025

View reviewed changes

docs/source/integrations/working_with_sql.md Outdated Show resolved Hide resolved

Update working_with_sql.md

2531b82

Signed-off-by: Joel <[email protected]>

astrojuanlu reviewed May 13, 2025

View reviewed changes

deepyaman reviewed May 29, 2025

View reviewed changes

deepyaman mentioned this pull request Sep 21, 2025

Support upsert operations for SQL datasets #5090

Open


		### 2. CRUD Scope

		- Kedro handles the parts of CRUD which do not require some concept of “state.”


		This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases.

		## Key concepts and assumptions


		- JVM + Spark cluster required
		- Overkill for smaller workloads

docs: 📚 Add documentation on working with SQL in Kedro #4706

Are you sure you want to change the base?

docs: 📚 Add documentation on working with SQL in Kedro #4706

Uh oh!

Conversation

datajoely commented May 7, 2025

Description

Development notes

Developer Certificate of Origin

Checklist

Uh oh!

yetudada commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astrojuanlu commented May 8, 2025

Uh oh!

Uh oh!

astrojuanlu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepyaman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants