Skip to content

Conversation

@datajoely
Copy link
Contributor

Description

  • πŸ“ Outline key concepts and assumptions for SQL usage
  • πŸ“Š Describe approaches for integrating SQL, including Pandas SQL, Spark-JDBC, and Ibis
  • βœ… Highlight advantages and limitations of each SQL approach
  • ⚠️ Include recommended Ibis development workflow for building scalable pipelines
  • πŸ› οΈ Discuss Kedro's limitations and when to consider SQL-first alternatives

Development notes

N/A

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

 - πŸ“ Outline key concepts and assumptions for SQL usage
 - πŸ“Š Describe approaches for integrating SQL, including Pandas SQL, Spark-JDBC, and Ibis
 - βœ… Highlight advantages and limitations of each SQL approach
 - ⚠️ Include recommended Ibis development workflow for building scalable pipelines
 - πŸ› οΈ Discuss Kedro's limitations and when to consider SQL-first alternatives
@datajoely datajoely requested review from astrojuanlu and deepyaman May 7, 2025 15:54
@datajoely datajoely requested a review from yetudada as a code owner May 7, 2025 15:54
Signed-off-by: datajoely <[email protected]>
@yetudada
Copy link
Contributor

yetudada commented May 7, 2025

Not going to review this but this is so cool! ❀️

datajoely added 3 commits May 7, 2025 17:02
Signed-off-by: datajoely <[email protected]>
Signed-off-by: datajoely <[email protected]>
Signed-off-by: datajoely <[email protected]>
datajoely and others added 3 commits May 8, 2025 08:32
@astrojuanlu
Copy link
Member

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally reviewed this! Thanks a lot for putting it together @datajoely πŸ™πŸΌ

My main high level comments are:

  • Having this in our docs is much better than not having anything, contains lots of useful guidance
  • However, the bullet-point style is not very consistent with other parts of the Kedro docs (maybe something @stichbury can help us with if/when she has time)
  • Some parts read as very definitive or give the impression that certain workflows will never change, which I'd like to soften somehow
  • I would move the "Recommended Ibis development workflow" to a separate page (somewhat clashing with #4560 by @Stephanieewelu)

From a logistics point of view, better to rebase or recreate this PR on top of develop, which contains the most current MkDocs-based docs


### 2. CRUD Scope

- Kedro handles the parts of CRUD which do not require some concept of β€œstate.”
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth spelling out CRUD, since it might be a term that not all Kedro users are familiar with


| Approach | Status | Setup Complexity |
|----------------|--------|--------------------|
| **Pandas SQL** | Legacy | Minimal |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning Polars SQL? https://docs.pola.rs/user-guide/sql/intro/


---

### 2. Spark-JDBC (legacy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it's a bit too bold calling Spark "legacy" πŸ˜„

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:

  • You can treat a SQL database purely as a data sink/source, or
  • Leverage the database engine’s compute power whenever possible.

I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.

For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.

I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).

Comment on lines +96 to +98
```{tip}
Please also check out our [blog on building scalable data pipelines with Kedro and Ibis](https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis), [SQL data processing in Kedro ML pipelines](https://kedro.org/blog/sql-data-processing-in-kedro-ml-pipelines), and our [PyData London talk](https://www.youtube.com/watch?v=ffDHdtz_vKc).
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole {tip} calls for a separate page in our docs about Ibis

Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features.
```

### Recommended Ibis development workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole section should be a separate page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.

More broadly, there are some wider limitations to be aware of when working with Kedro & SQL:

- **No conditional branching**
Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me. How is "overwrite" so different from "update"? Both "break lineage" (see table at the beginning of the page)

Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the massive delay!

|-----------|------------|---------------------------------------------------------|
| Create | βœ… | Writing new tables (or views) |
| Read | βœ… | Loading data into DataFrames or Ibis expressions |
| Update | ❌ | Requires custom SQL outside Kedro’s DAG; breaks lineage |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support Update (Upsert), right? Or what do you mean by breaks lineage?

Consult the [Ibis support matrix](https://ibis-project.org/backends/support/matrix) to verify that needed SQL functions are available. You can still write raw SQL for missing features.
```

### Recommended Ibis development workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this section, but I wonder if it should be a blog post? I think it's a sensible way of working, but I think it is also a pretty specific workflow.


This page outlines the various approaches and recommended practices when building a Kedro pipeline with SQL databases.

## Key concepts and assumptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder who the audience for this page is. The key concepts and assumptions starts out pretty advanced IMO; I think there are a lot of people who still don't think about the issue of pulling data out of the database, transforming it, and writing back, for instance.


---

### 2. Spark-JDBC (legacy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would combine the first section with some of these sections. For example, you said the two ways of working with data in SQL are:

  • You can treat a SQL database purely as a data sink/source, or
  • Leverage the database engine’s compute power whenever possible.

I think a nice (and easier to grasp) structure could be explaining when you want to do each, and then showing the workflow/examples. For instance, maybe you want to do ML things that can't be expressed in SQL/done efficiently on the database, in which case you might read data in with pandas or Polars using the datasets you mention.

For the second section, I think it could be worth mentioning that a lot of the modern databases double as execution engines. For example, if you can process code on BigQuery or Snowflake, it will almost certainly be more efficient and effective than doing the bad thing of loading it into pandas to transform. I guess here is also a good time to expound upon the benefits of in-memory transformation with DuckDB.

I would like to come out with the clear understanding of when I should use Ibis vs. when I should use pandas/Polars/Spark (I think can group this in to two options to make it simple).


- JVM + Spark cluster required
- Overkill for smaller workloads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite predicate pushdown, it's still a lot of data transfer!

```

- **Mechanism**
Spark’s DataFrame API over JDBC. Leverages predicate pushdown so filters/projections occur in-database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, there's a question of who this is written for. It does feel more like notes at this time; I think predicate pushdown is definitely something that requires at least some level of knowledge to know of (which is not all Kedro users)

Kedro does not support conditional nodes, making UPSERT logic difficult. Kedro favors reproducibility with append & overwrite modes over in-place updates/deletes.

- **When to seek SQL-first alternatives**
If your workflow is entirely SQL, tools like [dbt](https://github.com/dbt-labs/dbt-core) or [SQLMesh](https://github.com/TobikoData/sqlmesh) offer richer lineage and transformation management than Kedro’s Python-centric approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even think it's about lineage and transformation management as much as it is simply about whether you're using SQL vs. Python.

I think it is good to point out that Kedro shouldn't be your orchestrator for all SQL transforms. On the flip side, if you want to do DE and DS in a unified Python codebase, that can be a good point to use Kedro, and hopefully the SQL recommendations help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants