DuckDB sqlglot backend [WIP] #6996

gforsyth · 2023-08-31T20:29:27Z

Opening this so I can refer to a big ol' diff on the browser and in case anyone wants to start taking a look.

Still very much in-progress / incomplete.

Maps are mostly not working
Arrays are largely broken
Aggregates are definitely broken

Most other stuff is mostly working. There are a couple of annoying dtype mismatches that are currently failing a bunch of the tests (NaN where we want None, etc).

ibis/backends/duckdb/__init__.py

gforsyth · 2023-08-31T20:30:37Z

ibis/backends/duckdb/__init__.py

+    def raw_sql(self, query: str, **kwargs: Any) -> Any:
+        return self.con.execute(query, **kwargs)


this should probably be self.con.sql instead, but we should be able to change it in just the one spot.

kszucs · 2023-09-01T06:23:48Z

@gforsyth thanks for putting this up! Regarding the datatype conversions I'd suggest to rebase this PR on top of #6876 which adds proper dtype conversions between sqlglot and ibis (including parsing and string generation).

cpcloud · 2023-09-01T11:11:36Z

ibis/backends/tests/test_string.py

@@ -859,6 +859,11 @@ def test_string(backend, alltypes, df, result_func, expected_func):
    ["mysql", "mssql", "druid", "oracle"],
    raises=com.OperationNotDefinedError,
 )
+@pytest.mark.broken(
+    ["duckdb"],
+    reason="no idea, generated SQL looks very correct but this fails",


Might be related to backslash escaping.

kszucs · 2023-09-07T17:23:59Z

ibis/backends/duckdb/datatypes.py

+
+
+@functools.singledispatch
+def serialize(ty) -> str:


It shouldn't be needed anymore since DuckDBType.to_string(dtype) is supposed to handle the same task.

Thanks @kszucs ! Yeah, I think I ripped out all usage from values.py but I'll go ahead and remove it from datatypes.py

cpcloud · 2023-09-17T16:18:39Z

I've gotten this down to a single test failure:

…/ibis on  gil/duckdb_sqlglot is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env) took 27s
❯ pytest -m duckdb --snapshot-update --allow-snapshot-deletion -n auto --dist loadgroup -q
bringing up nodes...
..x...xx.x..x......................xx...x......................x.xx..............x.x.x.....xx.....x.x.x..............................................................x............s........... [ 10%]
..s....s.......................................s......x................sss.....s.........xxx...x.....................................x...........................x.......................x.... [ 22%]
...............................................................................................x.....................................x...........x........xx............x......x.......xx..x.. [ 33%]
.....x.....x.............x..........F...x......x...........................x.........x............x......x..................x.x......x............x................x..........xxx............. [ 44%]
......x.......x...xx.x..................x........x...........x.x..........x.........................x..................xx..x........x.xx..........x............x...................x.......... [ 55%]
..x.........x................................x................................................................................................................................................ [ 66%]
..........x.......................................x........................................................................................................................................... [ 77%]
.............................................................................................................................................................................................. [ 88%]
.......................................................................................s.......................................................................................s.............. [ 99%]
..............                                                                                                                                                                                 [100%]
FAILED ibis/backends/tests/test_temporal.py::test_timestamp_with_timezone_literal[duckdb-iso] - AssertionError: assert '2022-02-04 16:20:00PST' == '2022-02-04 08:20:00PST'

Of course it's related to time zones 🙄 😒 😂 ⏲️

cpcloud · 2023-09-18T13:07:35Z

I believe the non-cloud backends are all passing now.

Going to start looking at the clouds, then will take a look at TPC-H failures.

cpcloud · 2023-09-18T14:15:59Z

The clouds have parted, and BigQuery and Snowflake are passing:

…/ibis on  gil/duckdb_sqlglot is 📦 v6.1.0 via 🐍 v3.10.12 via ❄️  impure (ibis-3.10.12-env)
❯ pytest -m 'bigquery or snowflake' -n auto --dist=loadgroup --snapshot-update -q
bringing up nodes...
...x...xx.......x.....xx.......x.....x.xx...xx...x.....x...............xxx.....xx..x......xx.x......x.x..x.........x............s...............xx.x...x..x..xx.........x..xx..x.....x......x. [  6%]
...........xx.x........x..x......xx........x...xx..xx.x.xxx..x..x..x....xxxx.....x..xxxx.xx...x......x.x....x.xxx.x.x...x...x.x.....x...x..xx.xx...x..x......x.xxx.....x.x..........x....x.... [ 13%]
...x........xx.xx.....x.....x.....xxx..x.....x.....x.........x..........x.xx....x.....x.......x.......x.x...xxx..x.x......x..x...xxx.......................................x.................. [ 20%]
.......xx.....x..x......x.................x.............................x............................x..x.x.....x.x.x............xx.xx..................x......x...x..x......x...x..x..x..x... [ 27%]
...x.......x..xx......x....x.........x...x..x.x......x...x..........x..x.....xx....x..............xx.........x...x...........xxxxx.xxx.xxxxxxxxxxxxsssxsx..x..x.xs.....x..xxxx...xx..xx..xx... [ 34%]
xx..sxx..x......x.......x.x..xxx...x....x...x..x..x...sx........xx.....xx.......x...x..xxx..xxx.x...x.x....x.x.x...x...xxxxxxxx.xxxxx.xx.x.xxx.xxxx...x.x.xx.x.x.x......x..x........x...xx..x. [ 41%]
.x...x...........xx.........x....x...xx.x.................x...x..x....x.x............x....................s.x.......x......x........x.......x..x..x...x....xx............xx.x.........x......x [ 48%]
.x.................x.....x....x....xx...............x.x..........xx........x....xx.......................x.................x.xx..x........x..x......xx..x.....x..x....xx.............x........ [ 55%]
.......................................................x.....x..................xx....x.x.............x..............x..................x............x.........x..............x..........x.... [ 62%]
..xx................x................x.........x.........................................................................x.......x........x.x....x...........x...x........x...x....x..x.....xx [ 68%]
..xx....x...........x................x.x...............x.x.x......s.s.....x.x......s..s..............x........x................x....x.......x.......x.......x.s.............x................. [ 75%]
......xx.............x.x.....xxxxx...xxxxxx.......x.x.................xx..x..x.xxsx..x..xxxx..x.............x..x....xx..x................x..x..x.....x.....x.....................x............ [ 82%]
...............x..........................................................................x........................................................x...........x.............................. [ 89%]
...............................x......x....................................................................................................................................................... [ 96%]
......................................................................s..........................                                                                                              [100%]
2300 passed, 16 skipped, 441 xfailed in 436.29s (0:07:16)

cpcloud · 2023-09-18T14:47:02Z

docs/how-to/extending/sql.qmd

@@ -177,7 +177,7 @@ with closing(con.raw_sql("CREATE TEMP TABLE my_table AS SELECT * FROM RANGE(10)"

 Here's an example:

-```{python}
+```python


I need to revisit the return of raw_sql for duckdb, it doesn't seem like it can ever be closed when using raw duckdb.

cpcloud · 2023-09-18T14:48:05Z

ibis/backends/base/sqlglot/datatypes.py

-                sge.DataTypeParam(this=sge.Literal.number(dtype.precision)),
-                sge.DataTypeParam(this=sge.Literal.number(dtype.scale)),
+                sge.DataTypeParam(this=sge.Literal.number(precision)),
+                sge.DataTypeParam(this=sge.Literal.number(scale)),


This change is to fix an actual bug which is that we weren't handling default precision and scale when converting from ibis decimal types to sqlglot decimal types

cpcloud · 2023-09-18T14:51:38Z

ibis/backends/base/sqlglot/__init__.py

+import ibis.expr.operations as ops
+
+
+def unalias(op: ops.Value) -> ops.Value:


I have mixed feelings about this.

It solves a problem which is that certain expressions cannot have aliases, despite reuse of the expression.

Here's an example:

select key as my_key from group by key as my_key -- not allowed, but in ibis these are the same expression

so we have to compile the unaliased group by expressions in the GROUP BY, and the aliased ones in the SELECT.

I believe this must be handled by compilers and doesn't indicate any design problems with ibis's IR.

cpcloud · 2023-09-18T14:53:09Z

ibis/backends/clickhouse/compiler/relations.py


    by = tuple(map(tr_val, op.by))
    metrics = tuple(map(tr_val, op.metrics))
    selections = (by + metrics) or "*"
    sel = sg.select(*selections).from_(table)

    if group_keys := op.by:
-        sel = sel.group_by(*map(tr_val_no_alias, group_keys), dialect="clickhouse")
+        sel = sel.group_by(*map(tr_val, map(unalias, group_keys)), dialect="clickhouse")


Examples of where unalias is necessary.

cpcloud · 2023-09-18T14:54:16Z

ibis/backends/clickhouse/compiler/values.py

@@ -64,11 +65,9 @@ def _column(op, *, aliases, **_):


 @translate_val.register(ops.Alias)
-def _alias(op, render_aliases: bool = True, **kw):


render_alias is superseded by calling unalias where needed.

render_alias was also recursive, which isn't actually the desired behavior.

cpcloud · 2023-09-18T14:56:26Z

ibis/backends/duckdb/compiler/core.py

+    from collections.abc import Mapping
+
+
+def translate(op: ops.TableNode, params: Mapping[ir.Value, Any]) -> sg.exp.Expression:


A follow-up is to move this into ibis/backends/base/sqlglot/core.py and reuse it for both the clickhouse and duckdb backends.

I think we may we want to extract subqueries before doing that though.

I did play around with using sqlglot's optimizer, which can remove duplicate subqueries, but it's also defeated by a bunch of dot_sql tests and functionality.

cpcloud · 2023-09-18T14:59:06Z

ibis/backends/duckdb/compiler/relations.py

An immediate follow-up that I believe is fairly self-contained is to deduplicate this file with the clickhouse one, and see we can shove into ibis/backends/base/sqlglot/relations.py to be shared by both backends.

We should prefer the approach in this file, because @gforsyth did a great job on using dialect-free sqlglot objects here and this approach is more likely to work across multiple backends.

cpcloud · 2023-09-18T15:40:10Z

ibis/backends/tests/snapshots/test_sql/test_group_by_has_index/duckdb/out.sql

I may look into putting this back.

cpcloud · 2023-09-18T15:42:46Z

ibis/backends/tests/test_temporal.py

                    raises=AssertionError,
-                    reason="duckdb 0.8.0 returns DateOffset columns",
+                    reason="duckdb returns dateoffsets",


We need to look into handling this somehow.

But that's a follow up

cpcloud · 2023-09-18T15:43:38Z

ibis/backends/tests/test_temporal.py

@@ -1458,12 +1458,12 @@ def test_strftime(backend, alltypes, df, expr_fn, pandas_pattern):
                pytest.mark.notimpl(
                    ["pyspark"],
                    raises=com.UnsupportedArgumentError,
-                    reason="PySpark backend does not support timestamp from unix time with unit ms. Supported unit is s.",
+                    reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.",


Suggested change

reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.",

reason="PySpark backend does not support timestamp from unix time with unit us. Supported unit is s.",

cpcloud · 2023-09-18T15:43:59Z

ibis/backends/tests/test_temporal.py

                ),
                pytest.mark.notimpl(
                    ["duckdb", "mssql", "clickhouse"],
                    raises=com.UnsupportedOperationError,
-                    reason="`ms` unit is not supported!",
+                    reason="`ns` unit is not supported!",


Suggested change

reason="`ns` unit is not supported!",

reason="`us` unit is not supported!",

cpcloud · 2023-09-18T15:46:13Z

ibis/expr/tests/test_schema.py

-    expected = df.rename({"A": "a"}, axis=1)
-    with pytest.warns(FutureWarning):
-        new_df = schema.apply_to(df.copy())
-    tm.assert_frame_equal(new_df, expected)


Finally removed apply_to.

cpcloud · 2023-09-18T15:47:25Z

ibis/expr/types/generic.py

@@ -851,9 +851,13 @@ def __hash__(self) -> int:
        return super().__hash__()

    def __eq__(self, other: Value) -> ir.BooleanValue:
+        if other is None:
+            return _binop(ops.IdenticalTo, self, other)


We currently allow None in equals and not equals expressions, like a == None to mean a is None. This was being turned into a is None by sqlalchemy, here we are moving that transformation into the ibis API.

I think this should be handled at the backend instead.

Is it well defined what is the difference between EqualTo and IdenticalTo?

Yes. Equals is not null safe, IdenticalTo is null safe.

…ession

cpcloud · 2023-09-26T14:59:10Z

Rebasing this now ...

gforsyth · 2023-12-20T22:12:07Z

Closing in favor of #7796

gforsyth commented Aug 31, 2023

View reviewed changes

ibis/backends/duckdb/__init__.py Outdated Show resolved Hide resolved

gforsyth commented Aug 31, 2023

View reviewed changes

cpcloud reviewed Sep 1, 2023

View reviewed changes

gforsyth force-pushed the gil/duckdb_sqlglot branch from eba5041 to 4a484b5 Compare September 5, 2023 18:40

kszucs reviewed Sep 7, 2023

View reviewed changes

jcrist mentioned this pull request Sep 8, 2023

bug(duckdb): percent signs aren't escaped correctly in .case() #7103

Closed

1 task

gforsyth force-pushed the gil/duckdb_sqlglot branch from 00c54b5 to 04d12ca Compare September 11, 2023 17:16

cpcloud mentioned this pull request Sep 12, 2023

feat(duckdb): add geospatial operations #6399

Closed

1 task

cpcloud added refactor Issues or PRs related to refactoring the codebase duckdb The DuckDB backend labels Sep 17, 2023

cpcloud force-pushed the gil/duckdb_sqlglot branch from 4760c59 to b61c363 Compare September 17, 2023 16:16

cpcloud force-pushed the gil/duckdb_sqlglot branch 2 times, most recently from 11219d3 to cdb8b31 Compare September 18, 2023 09:19

cpcloud reviewed Sep 18, 2023

View reviewed changes

cpcloud added 8 commits September 26, 2023 10:53

chore: generate group by indices instead of repeating the expression

7461847

fixup! chore: generate group by indices instead of repeating the expr…

aaa25dd

…ession

chore: remove use of sg.condition in favor of sg.and_

46c0cda

chore: pluck out aliases kwarg

fe72a41

chore: remove remove_aliases kwarg from duckdb compiler

9b9b5e4

chore(duckdb): use sg_literal everywhere

d9e9484

chore: unalias everything that should not have expr AS name

5be5c91

chore: remove rebase screwups

282bdfb

cpcloud added 6 commits September 26, 2023 10:59

chore: bring back unalias temporarily

189e427

chore: regen sql

29aeac5

chore: raw_sql

f46cb0f

chore: fix for new array function representation

1360eaa

chore: regen passing tpch queries

a2b848d

chore: fix clip

ff5855e

cpcloud force-pushed the gil/duckdb_sqlglot branch from 19f9918 to ff5855e Compare September 26, 2023 15:37

cpcloud added 12 commits September 26, 2023 15:05

chore: add base compiler functionality

995ff19

chore(clickhouse): move to base compiler functionality

d61d530

chore(duckdb): move to base compiler functionality

fe96efc

chore: get almost everything working

b689374

chore: regen sql

ed74968

chore: remove duplicate clickhouse xfail

8676b65

chore: remove dead code and unnecessary lit calls

9da4977

chore: clean up kw

c50f4d8

chore: remove unalias function

159fe97

chore: add pattern to transform empty percent rank et al

f4f4054

chore(clickhouse): regen sql

fcfaad8

chore: more lit and array cleanup

dbf34e5

gforsyth closed this Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DuckDB sqlglot backend [WIP] #6996

DuckDB sqlglot backend [WIP] #6996

gforsyth commented Aug 31, 2023

gforsyth Aug 31, 2023

kszucs commented Sep 1, 2023

cpcloud Sep 1, 2023

kszucs Sep 7, 2023 •

edited

Loading

gforsyth Sep 7, 2023

cpcloud commented Sep 17, 2023

cpcloud commented Sep 18, 2023

cpcloud commented Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

cpcloud Sep 18, 2023

kszucs Sep 21, 2023

cpcloud Sep 21, 2023

cpcloud commented Sep 26, 2023

gforsyth commented Dec 20, 2023

		def raw_sql(self, query: str, **kwargs: Any) -> Any:
		return self.con.execute(query, **kwargs)

		import ibis.expr.operations as ops


		def unalias(op: ops.Value) -> ops.Value:

		@@ -64,11 +65,9 @@ def _column(op, , aliases, *_):


		@translate_val.register(ops.Alias)
		def _alias(op, render_aliases: bool = True, **kw):

		from collections.abc import Mapping


		def translate(op: ops.TableNode, params: Mapping[ir.Value, Any]) -> sg.exp.Expression:

	reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.",
	reason="PySpark backend does not support timestamp from unix time with unit us. Supported unit is s.",

	reason="`ns` unit is not supported!",
	reason="`us` unit is not supported!",

DuckDB sqlglot backend [WIP] #6996

DuckDB sqlglot backend [WIP] #6996

Conversation

gforsyth commented Aug 31, 2023

Choose a reason for hiding this comment

kszucs commented Sep 1, 2023

Choose a reason for hiding this comment

kszucs Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Sep 17, 2023

cpcloud commented Sep 18, 2023

cpcloud commented Sep 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Sep 26, 2023

gforsyth commented Dec 20, 2023

kszucs Sep 7, 2023 •

edited

Loading