Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB sqlglot backend [WIP] #6996

Closed
wants to merge 222 commits into from

Conversation

gforsyth
Copy link
Member

Opening this so I can refer to a big ol' diff on the browser and in case anyone wants to start taking a look.

Still very much in-progress / incomplete.

Maps are mostly not working
Arrays are largely broken
Aggregates are definitely broken

Most other stuff is mostly working. There are a couple of annoying dtype mismatches that are currently failing a bunch of the tests (NaN where we want None, etc).

Comment on lines 108 to 83
def raw_sql(self, query: str, **kwargs: Any) -> Any:
return self.con.execute(query, **kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be self.con.sql instead, but we should be able to change it in just the one spot.

@kszucs
Copy link
Member

kszucs commented Sep 1, 2023

@gforsyth thanks for putting this up! Regarding the datatype conversions I'd suggest to rebase this PR on top of #6876 which adds proper dtype conversions between sqlglot and ibis (including parsing and string generation).

@@ -859,6 +859,11 @@ def test_string(backend, alltypes, df, result_func, expected_func):
["mysql", "mssql", "druid", "oracle"],
raises=com.OperationNotDefinedError,
)
@pytest.mark.broken(
["duckdb"],
reason="no idea, generated SQL looks very correct but this fails",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be related to backslash escaping.



@functools.singledispatch
def serialize(ty) -> str:
Copy link
Member

@kszucs kszucs Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be needed anymore since DuckDBType.to_string(dtype) is supposed to handle the same task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kszucs ! Yeah, I think I ripped out all usage from values.py but I'll go ahead and remove it from datatypes.py

@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase duckdb The DuckDB backend labels Sep 17, 2023
@cpcloud
Copy link
Member

cpcloud commented Sep 17, 2023

I've gotten this down to a single test failure:

…/ibis on  gil/duckdb_sqlglot is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env) took 27s
❯ pytest -m duckdb --snapshot-update --allow-snapshot-deletion -n auto --dist loadgroup -q
bringing up nodes...
..x...xx.x..x......................xx...x......................x.xx..............x.x.x.....xx.....x.x.x..............................................................x............s........... [ 10%]
..s....s.......................................s......x................sss.....s.........xxx...x.....................................x...........................x.......................x.... [ 22%]
...............................................................................................x.....................................x...........x........xx............x......x.......xx..x.. [ 33%]
.....x.....x.............x..........F...x......x...........................x.........x............x......x..................x.x......x............x................x..........xxx............. [ 44%]
......x.......x...xx.x..................x........x...........x.x..........x.........................x..................xx..x........x.xx..........x............x...................x.......... [ 55%]
..x.........x................................x................................................................................................................................................ [ 66%]
..........x.......................................x........................................................................................................................................... [ 77%]
.............................................................................................................................................................................................. [ 88%]
.......................................................................................s.......................................................................................s.............. [ 99%]
..............                                                                                                                                                                                 [100%]
FAILED ibis/backends/tests/test_temporal.py::test_timestamp_with_timezone_literal[duckdb-iso] - AssertionError: assert '2022-02-04 16:20:00PST' == '2022-02-04 08:20:00PST'

Of course it's related to time zones 🙄 😒 😂 ⏲️

@cpcloud cpcloud force-pushed the gil/duckdb_sqlglot branch 2 times, most recently from 11219d3 to cdb8b31 Compare September 18, 2023 09:19
@cpcloud
Copy link
Member

cpcloud commented Sep 18, 2023

I believe the non-cloud backends are all passing now.

Going to start looking at the clouds, then will take a look at TPC-H failures.

@cpcloud
Copy link
Member

cpcloud commented Sep 18, 2023

The clouds have parted, and BigQuery and Snowflake are passing:

…/ibis on  gil/duckdb_sqlglot is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env)
❯ pytest -m 'bigquery or snowflake' -n auto --dist=loadgroup --snapshot-update -q
bringing up nodes...
...x...xx.......x.....xx.......x.....x.xx...xx...x.....x...............xxx.....xx..x......xx.x......x.x..x.........x............s...............xx.x...x..x..xx.........x..xx..x.....x......x. [  6%]
...........xx.x........x..x......xx........x...xx..xx.x.xxx..x..x..x....xxxx.....x..xxxx.xx...x......x.x....x.xxx.x.x...x...x.x.....x...x..xx.xx...x..x......x.xxx.....x.x..........x....x.... [ 13%]
...x........xx.xx.....x.....x.....xxx..x.....x.....x.........x..........x.xx....x.....x.......x.......x.x...xxx..x.x......x..x...xxx.......................................x.................. [ 20%]
.......xx.....x..x......x.................x.............................x............................x..x.x.....x.x.x............xx.xx..................x......x...x..x......x...x..x..x..x... [ 27%]
...x.......x..xx......x....x.........x...x..x.x......x...x..........x..x.....xx....x..............xx.........x...x...........xxxxx.xxx.xxxxxxxxxxxxsssxsx..x..x.xs.....x..xxxx...xx..xx..xx... [ 34%]
xx..sxx..x......x.......x.x..xxx...x....x...x..x..x...sx........xx.....xx.......x...x..xxx..xxx.x...x.x....x.x.x...x...xxxxxxxx.xxxxx.xx.x.xxx.xxxx...x.x.xx.x.x.x......x..x........x...xx..x. [ 41%]
.x...x...........xx.........x....x...xx.x.................x...x..x....x.x............x....................s.x.......x......x........x.......x..x..x...x....xx............xx.x.........x......x [ 48%]
.x.................x.....x....x....xx...............x.x..........xx........x....xx.......................x.................x.xx..x........x..x......xx..x.....x..x....xx.............x........ [ 55%]
.......................................................x.....x..................xx....x.x.............x..............x..................x............x.........x..............x..........x.... [ 62%]
..xx................x................x.........x.........................................................................x.......x........x.x....x...........x...x........x...x....x..x.....xx [ 68%]
..xx....x...........x................x.x...............x.x.x......s.s.....x.x......s..s..............x........x................x....x.......x.......x.......x.s.............x................. [ 75%]
......xx.............x.x.....xxxxx...xxxxxx.......x.x.................xx..x..x.xxsx..x..xxxx..x.............x..x....xx..x................x..x..x.....x.....x.....................x............ [ 82%]
...............x..........................................................................x........................................................x...........x.............................. [ 89%]
...............................x......x....................................................................................................................................................... [ 96%]
......................................................................s..........................                                                                                              [100%]
2300 passed, 16 skipped, 441 xfailed in 436.29s (0:07:16)

@@ -177,7 +177,7 @@ with closing(con.raw_sql("CREATE TEMP TABLE my_table AS SELECT * FROM RANGE(10)"

Here's an example:

```{python}
```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to revisit the return of raw_sql for duckdb, it doesn't seem like it can ever be closed when using raw duckdb.

sge.DataTypeParam(this=sge.Literal.number(dtype.precision)),
sge.DataTypeParam(this=sge.Literal.number(dtype.scale)),
sge.DataTypeParam(this=sge.Literal.number(precision)),
sge.DataTypeParam(this=sge.Literal.number(scale)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to fix an actual bug which is that we weren't handling default precision and scale when converting from ibis decimal types to sqlglot decimal types

import ibis.expr.operations as ops


def unalias(op: ops.Value) -> ops.Value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this.

It solves a problem which is that certain expressions cannot have aliases, despite reuse of the expression.

Here's an example:

select key as my_key
from group by key as my_key -- not allowed, but in ibis these are the same expression

so we have to compile the unaliased group by expressions in the GROUP BY, and the aliased ones in the SELECT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this must be handled by compilers and doesn't indicate any design problems with ibis's IR.


by = tuple(map(tr_val, op.by))
metrics = tuple(map(tr_val, op.metrics))
selections = (by + metrics) or "*"
sel = sg.select(*selections).from_(table)

if group_keys := op.by:
sel = sel.group_by(*map(tr_val_no_alias, group_keys), dialect="clickhouse")
sel = sel.group_by(*map(tr_val, map(unalias, group_keys)), dialect="clickhouse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of where unalias is necessary.

@@ -64,11 +65,9 @@ def _column(op, *, aliases, **_):


@translate_val.register(ops.Alias)
def _alias(op, render_aliases: bool = True, **kw):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

render_alias is superseded by calling unalias where needed.

render_alias was also recursive, which isn't actually the desired behavior.

from collections.abc import Mapping


def translate(op: ops.TableNode, params: Mapping[ir.Value, Any]) -> sg.exp.Expression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up is to move this into ibis/backends/base/sqlglot/core.py and reuse it for both the clickhouse and duckdb backends.

I think we may we want to extract subqueries before doing that though.

I did play around with using sqlglot's optimizer, which can remove duplicate subqueries, but it's also defeated by a bunch of dot_sql tests and functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An immediate follow-up that I believe is fairly self-contained is to deduplicate this file with the clickhouse one, and see we can shove into ibis/backends/base/sqlglot/relations.py to be shared by both backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prefer the approach in this file, because @gforsyth did a great job on using dialect-free sqlglot objects here and this approach is more likely to work across multiple backends.

CASE t0.continent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may look into putting this back.

raises=AssertionError,
reason="duckdb 0.8.0 returns DateOffset columns",
reason="duckdb returns dateoffsets",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look into handling this somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's a follow up

@@ -1458,12 +1458,12 @@ def test_strftime(backend, alltypes, df, expr_fn, pandas_pattern):
pytest.mark.notimpl(
["pyspark"],
raises=com.UnsupportedArgumentError,
reason="PySpark backend does not support timestamp from unix time with unit ms. Supported unit is s.",
reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.",
reason="PySpark backend does not support timestamp from unix time with unit us. Supported unit is s.",

),
pytest.mark.notimpl(
["duckdb", "mssql", "clickhouse"],
raises=com.UnsupportedOperationError,
reason="`ms` unit is not supported!",
reason="`ns` unit is not supported!",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reason="`ns` unit is not supported!",
reason="`us` unit is not supported!",

expected = df.rename({"A": "a"}, axis=1)
with pytest.warns(FutureWarning):
new_df = schema.apply_to(df.copy())
tm.assert_frame_equal(new_df, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally removed apply_to.

@@ -851,9 +851,13 @@ def __hash__(self) -> int:
return super().__hash__()

def __eq__(self, other: Value) -> ir.BooleanValue:
if other is None:
return _binop(ops.IdenticalTo, self, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently allow None in equals and not equals expressions, like a == None to mean a is None. This was being turned into a is None by sqlalchemy, here we are moving that transformation into the ibis API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be handled at the backend instead.

Is it well defined what is the difference between EqualTo and IdenticalTo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Equals is not null safe, IdenticalTo is null safe.

@cpcloud
Copy link
Member

cpcloud commented Sep 26, 2023

Rebasing this now ...

@gforsyth
Copy link
Member Author

Closing in favor of #7796

@gforsyth gforsyth closed this Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb The DuckDB backend refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants