Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to SQL parser for unnest function. #5214

Closed
wants to merge 13 commits into from
Closed

Add support to SQL parser for unnest function. #5214

wants to merge 13 commits into from

Conversation

vincev
Copy link
Contributor

@vincev vincev commented Feb 7, 2023

Which issue does this PR close?

This PR introduces initial changes to add support for calling unnest function in SQL as requested in #212.

Rationale for this change

Following #5106 that adds unnest_column to DataFrame this PR adds support for the following SQL:

DataFusion CLI v18.0.0
❯ create external table shapes stored as parquet location 'nested.parquet';
0 rows in set. Query took 0.007 seconds.
❯ select count(*) from shapes;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 100000          |
+-----------------+
1 row in set. Query took 0.009 seconds.
❯ select * from shapes limit 10;
+----------+------------------------------------------------------------+--------------------------+
| shape_id | points                                                     | tags                     |
+----------+------------------------------------------------------------+--------------------------+
| 1        | [{"x": -3, "y": -4}, {"x": -3, "y": 6}, {"x": 2, "y": -2}] | [tag1]                   |
| 2        | [{"x": -9, "y": 2}, {"x": -10, "y": -4}]                   |                          |
| 3        | [{"x": -3, "y": 5}, {"x": 2, "y": -1}]                     | [tag4, tag7, tag3]       |
| 4        | [{"x": -2, "y": -10}, {"x": 6, "y": -5}, {"x": 0, "y": 6}] | [tag4]                   |
| 5        | [{"x": -7, "y": -6}, {"x": -10, "y": 7}]                   | [tag5, tag9, tag6]       |
| 6        |                                                            | [tag1, tag6, tag6]       |
| 7        | [{"x": -9, "y": -1}, {"x": -1, "y": -3}]                   | [tag5, tag3]             |
| 8        | [{"x": 8, "y": -7}, {"x": -1, "y": -1}]                    | [tag1, tag7, tag4]       |
| 9        |                                                            | [tag4, tag1, tag9, tag4] |
| 10       |                                                            | [tag2, tag1]             |
+----------+------------------------------------------------------------+--------------------------+
10 rows in set. Query took 0.008 seconds.
❯ select shape_id, unnest(tags) from shapes limit 10;
+----------+------+
| shape_id | tags |
+----------+------+
| 1        | tag1 |
| 2        |      |
| 3        | tag4 |
| 3        | tag7 |
| 3        | tag3 |
| 4        | tag4 |
| 5        | tag5 |
| 5        | tag9 |
| 5        | tag6 |
| 6        | tag1 |
+----------+------+
10 rows in set. Query took 0.039 seconds.
❯ select shape_id, unnest(points) from shapes limit 10;
+----------+---------------------+
| shape_id | points              |
+----------+---------------------+
| 1        | {"x": -3, "y": -4}  |
| 1        | {"x": -3, "y": 6}   |
| 1        | {"x": 2, "y": -2}   |
| 2        | {"x": -9, "y": 2}   |
| 2        | {"x": -10, "y": -4} |
| 3        | {"x": -3, "y": 5}   |
| 3        | {"x": 2, "y": -1}   |
| 4        | {"x": -2, "y": -10} |
| 4        | {"x": 6, "y": -5}   |
| 4        | {"x": 0, "y": 6}    |
+----------+---------------------+
10 rows in set. Query took 0.070 seconds.
❯ select count(tags) from shapes;
+--------------------+
| COUNT(shapes.tags) |
+--------------------+
| 80273              |
+--------------------+
1 row in set. Query took 0.022 seconds.
❯ select count(unnest(tags)) from shapes;
+--------------------+
| COUNT(shapes.tags) |
+--------------------+
| 200863             |
+--------------------+
1 row in set. Query took 0.036 seconds.
❯ select count(distinct(unnest(tags))) from shapes;
+-----------------------------+
| COUNT(DISTINCT shapes.tags) |
+-----------------------------+
| 9                           |
+-----------------------------+
1 row in set. Query took 0.038 seconds.
❯ select shape_id, unnest(tags), unnest(points) from shapes where shape_id < 10 limit 10;
+----------+------+---------------------+
| shape_id | tags | points              |
+----------+------+---------------------+
| 1        | tag1 | {"x": -3, "y": -4}  |
| 1        | tag1 | {"x": -3, "y": 6}   |
| 1        | tag1 | {"x": 2, "y": -2}   |
| 2        |      | {"x": -9, "y": 2}   |
| 2        |      | {"x": -10, "y": -4} |
| 3        | tag4 | {"x": -3, "y": 5}   |
| 3        | tag4 | {"x": 2, "y": -1}   |
| 3        | tag7 | {"x": -3, "y": 5}   |
| 3        | tag7 | {"x": 2, "y": -1}   |
| 3        | tag3 | {"x": -3, "y": 5}   |
+----------+------+---------------------+
10 rows in set. Query took 0.045 seconds.
❯ select unnest(tags) as tag, count(*) from shapes group by tag; 
+------+-----------------+
| tag  | COUNT(UInt8(1)) |
+------+-----------------+
| tag6 | 22357           |
| tag7 | 22597           |
| tag5 | 22182           |
| tag9 | 22401           |
| tag4 | 22165           |
| tag3 | 22446           |
|      | 19727           |
| tag8 | 22307           |
| tag1 | 22529           |
| tag2 | 21879           |
+------+-----------------+
10 rows in set. Query took 0.040 seconds.

What changes are included in this PR?

This PR add code to extract the unnest function call from the SQL AST and create a logical plan.

Are these changes tested?

Added tests here.

Are there any user-facing changes?

Add unnest function to SQL.

@github-actions github-actions bot added logical-expr Logical plan and expressions sql SQL Planner labels Feb 7, 2023
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Feb 9, 2023
@github-actions github-actions bot removed the sql SQL Planner label Feb 9, 2023
@github-actions github-actions bot added the sql SQL Planner label Feb 9, 2023
@github-actions github-actions bot added the core Core DataFusion crate label Feb 13, 2023
@vincev vincev marked this pull request as ready for review February 13, 2023 15:03
@vincev
Copy link
Contributor Author

vincev commented Feb 13, 2023

Just found a problem while adding more tests, if I run the following:

let ctx = SessionContext::new();
ctx.sql("create external table shapes stored as parquet location 'nested.parquet'").await?;
ctx.sql("select count(unnest(tags)) from shapes").await?.show().await?;

I get the correct output:

+--------------------+
| COUNT(shapes.tags) |
+--------------------+
| 8                  |
+--------------------+

but if I run the same query with register_parquet:

let ctx = SessionContext::new();
ctx.register_parquet("shapes", "nested.parquet", ParquetReadOptions::default()).await?;
ctx.sql("select count(unnest(tags)) from shapes").await?.show().await?;

the UnnestExec gets optimized away somehow (its execute is not called) and I get this:

+--------------------+
| COUNT(shapes.tags) |
+--------------------+
| 4                  |
+--------------------+

I'll investigate.

@vincev
Copy link
Contributor Author

vincev commented Feb 13, 2023

Going to close this PR I have found more problems with where conditions on unnested columns.

@vincev vincev closed this Feb 13, 2023
@vincev vincev deleted the unnest branch February 13, 2023 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant