-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDB sqlglot backend [WIP] #6996
Conversation
ibis/backends/duckdb/__init__.py
Outdated
def raw_sql(self, query: str, **kwargs: Any) -> Any: | ||
return self.con.execute(query, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be self.con.sql
instead, but we should be able to change it in just the one spot.
@@ -859,6 +859,11 @@ def test_string(backend, alltypes, df, result_func, expected_func): | |||
["mysql", "mssql", "druid", "oracle"], | |||
raises=com.OperationNotDefinedError, | |||
) | |||
@pytest.mark.broken( | |||
["duckdb"], | |||
reason="no idea, generated SQL looks very correct but this fails", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be related to backslash escaping.
eba5041
to
4a484b5
Compare
ibis/backends/duckdb/datatypes.py
Outdated
|
||
|
||
@functools.singledispatch | ||
def serialize(ty) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be needed anymore since DuckDBType.to_string(dtype)
is supposed to handle the same task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kszucs ! Yeah, I think I ripped out all usage from values.py
but I'll go ahead and remove it from datatypes.py
00c54b5
to
04d12ca
Compare
4760c59
to
b61c363
Compare
I've gotten this down to a single test failure:
Of course it's related to time zones 🙄 😒 😂 ⏲️ |
11219d3
to
cdb8b31
Compare
I believe the non-cloud backends are all passing now. Going to start looking at the clouds, then will take a look at TPC-H failures. |
The clouds have parted, and BigQuery and Snowflake are passing:
|
@@ -177,7 +177,7 @@ with closing(con.raw_sql("CREATE TEMP TABLE my_table AS SELECT * FROM RANGE(10)" | |||
|
|||
Here's an example: | |||
|
|||
```{python} | |||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to revisit the return of raw_sql
for duckdb, it doesn't seem like it can ever be closed when using raw duckdb
.
sge.DataTypeParam(this=sge.Literal.number(dtype.precision)), | ||
sge.DataTypeParam(this=sge.Literal.number(dtype.scale)), | ||
sge.DataTypeParam(this=sge.Literal.number(precision)), | ||
sge.DataTypeParam(this=sge.Literal.number(scale)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to fix an actual bug which is that we weren't handling default precision and scale when converting from ibis decimal types to sqlglot decimal types
import ibis.expr.operations as ops | ||
|
||
|
||
def unalias(op: ops.Value) -> ops.Value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have mixed feelings about this.
It solves a problem which is that certain expressions cannot have aliases, despite reuse of the expression.
Here's an example:
select key as my_key
from group by key as my_key -- not allowed, but in ibis these are the same expression
so we have to compile the unaliased group by expressions in the GROUP BY
, and the aliased ones in the SELECT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this must be handled by compilers and doesn't indicate any design problems with ibis's IR.
|
||
by = tuple(map(tr_val, op.by)) | ||
metrics = tuple(map(tr_val, op.metrics)) | ||
selections = (by + metrics) or "*" | ||
sel = sg.select(*selections).from_(table) | ||
|
||
if group_keys := op.by: | ||
sel = sel.group_by(*map(tr_val_no_alias, group_keys), dialect="clickhouse") | ||
sel = sel.group_by(*map(tr_val, map(unalias, group_keys)), dialect="clickhouse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples of where unalias
is necessary.
@@ -64,11 +65,9 @@ def _column(op, *, aliases, **_): | |||
|
|||
|
|||
@translate_val.register(ops.Alias) | |||
def _alias(op, render_aliases: bool = True, **kw): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
render_alias
is superseded by calling unalias
where needed.
render_alias
was also recursive, which isn't actually the desired behavior.
from collections.abc import Mapping | ||
|
||
|
||
def translate(op: ops.TableNode, params: Mapping[ir.Value, Any]) -> sg.exp.Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up is to move this into ibis/backends/base/sqlglot/core.py
and reuse it for both the clickhouse and duckdb backends.
I think we may we want to extract subqueries before doing that though.
I did play around with using sqlglot's optimizer, which can remove duplicate subqueries, but it's also defeated by a bunch of dot_sql
tests and functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An immediate follow-up that I believe is fairly self-contained is to deduplicate this file with the clickhouse one, and see we can shove into ibis/backends/base/sqlglot/relations.py
to be shared by both backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should prefer the approach in this file, because @gforsyth did a great job on using dialect-free sqlglot objects here and this approach is more likely to work across multiple backends.
CASE t0.continent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may look into putting this back.
raises=AssertionError, | ||
reason="duckdb 0.8.0 returns DateOffset columns", | ||
reason="duckdb returns dateoffsets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to look into handling this somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's a follow up
@@ -1458,12 +1458,12 @@ def test_strftime(backend, alltypes, df, expr_fn, pandas_pattern): | |||
pytest.mark.notimpl( | |||
["pyspark"], | |||
raises=com.UnsupportedArgumentError, | |||
reason="PySpark backend does not support timestamp from unix time with unit ms. Supported unit is s.", | |||
reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason="PySpark backend does not support timestamp from unix time with unit ns. Supported unit is s.", | |
reason="PySpark backend does not support timestamp from unix time with unit us. Supported unit is s.", |
), | ||
pytest.mark.notimpl( | ||
["duckdb", "mssql", "clickhouse"], | ||
raises=com.UnsupportedOperationError, | ||
reason="`ms` unit is not supported!", | ||
reason="`ns` unit is not supported!", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason="`ns` unit is not supported!", | |
reason="`us` unit is not supported!", |
expected = df.rename({"A": "a"}, axis=1) | ||
with pytest.warns(FutureWarning): | ||
new_df = schema.apply_to(df.copy()) | ||
tm.assert_frame_equal(new_df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally removed apply_to
.
@@ -851,9 +851,13 @@ def __hash__(self) -> int: | |||
return super().__hash__() | |||
|
|||
def __eq__(self, other: Value) -> ir.BooleanValue: | |||
if other is None: | |||
return _binop(ops.IdenticalTo, self, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently allow None
in equals and not equals expressions, like a == None
to mean a is None
. This was being turned into a is None
by sqlalchemy, here we are moving that transformation into the ibis API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be handled at the backend instead.
Is it well defined what is the difference between EqualTo
and IdenticalTo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Equals
is not null safe, IdenticalTo
is null safe.
Rebasing this now ... |
19f9918
to
ff5855e
Compare
Closing in favor of #7796 |
Opening this so I can refer to a big ol' diff on the browser and in case anyone wants to start taking a look.
Still very much in-progress / incomplete.
Maps are mostly not working
Arrays are largely broken
Aggregates are definitely broken
Most other stuff is mostly working. There are a couple of annoying dtype mismatches that are currently failing a bunch of the tests (
NaN
where we wantNone
, etc).