Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SQL explode / unnest function #212

Closed
andygrove opened this issue Apr 27, 2021 · 4 comments · Fixed by #10044
Closed

Add support for SQL explode / unnest function #212

andygrove opened this issue Apr 27, 2021 · 4 comments · Fixed by #10044
Labels
datafusion Changes in the datafusion crate enhancement New feature or request sql SQL Planner

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I would like to be able to expand an array into rows.

Describe the solution you'd like
See unnest in https://www.postgresql.org/docs/current/functions-array.html

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added enhancement New feature or request datafusion Changes in the datafusion crate sql SQL Planner labels Apr 27, 2021
This was referenced Jan 18, 2022
@mildbyte
Copy link
Contributor

mildbyte commented Oct 6, 2022

Is there any prior art / existing design documents for this either in the DataFusion repo or an application that uses DataFusion? We're interested in getting this working for splitgraph/seafowl#137. In particular:

  • a json_each()/json_array_elements() function has to be set-returning
  • we could implement user-defined set-returning functions: Set-returning UDFs #1604
  • we could also go the DuckDB route of adding a function that turns JSON into an Arrow List and then relying on UNNEST
    • it'll still require implementing UNNEST, but then everything can be implemented as a List-returning function + UNNEST

We could take a stab at implementing this, either as a PR to DataFusion or as a separate crate with planner nodes/optimizer rules.

I did some digging into how PostgreSQL implements this and it seems like there are two positions for a set-returning function/UNNEST (which is implemented as a set-returning function in PG):

Inside the FROM clause

postgres=# WITH t AS (SELECT * FROM (VALUES 
    (1, '{11,12,13}'::int[]),
    (2, '{21,22,23}'::int[])) t(k, ks))
SELECT * FROM t, UNNEST(t.ks);

 k |     ks     | unnest 
---+------------+--------
 1 | {11,12,13} |     11
 1 | {11,12,13} |     12
 1 | {11,12,13} |     13
 2 | {21,22,23} |     21
 2 | {21,22,23} |     22
 2 | {21,22,23} |     23
(6 rows)

                             QUERY PLAN                             
--------------------------------------------------------------------
 Nested Loop  (cost=0.00..0.43 rows=20 width=40)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=36)
   ->  Function Scan on unnest  (cost=0.00..0.10 rows=10 width=4)

In this case it's a NestLoop that grabs values from the outer loop and passes them as parameters to the FuncScan.

As an expression

postgres=# WITH t AS (SELECT * FROM (VALUES 
    (1, '{11,12,13}'::int[]),
    (2, '{21,22,23}'::int[])) t(k, ks))
SELECT t.*, UNNEST(t.ks) FROM t;

                             QUERY PLAN                             
--------------------------------------------------------------------
 ProjectSet  (cost=0.00..0.14 rows=20 width=40)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=36)
(2 rows)

This turns it into a ProjectSet which returns a product {k: 1, ks: [11, 12, 13], unnest: UNNEST([11, 12, 13])} -> {k: 1, ks: [11, 12, 13], unnest: 11}, ..., {k: 1, ks: [11, 12, 13], unnest: 13}, ...

The way to go?

The NestLoop + FuncScan way requires parameterized plans, which don't yet seem to be supported in DataFusion (from inspecting the ExecutionPlan trait). I also haven't wrapped my head around the transformations we'd need to write to wire the whole thing up. It's possible there's some work that's been done with #2248 that makes this easier to build?

The ProjectSet way is easier to figure out. At plan time, we'd inspect every Projection node for calls to functions that return a List and are somehow tagged as set-returning. then we'd replace the Projection with a ProjectSet that "unnests" the List, e.g.

SELECT t.k, UNNEST(t.ks) ...

becomes

Projection([Column("t", "k"), Unnest(Column("t", "ks"))])

becomes

ProjectSet(scalarValues=[Column("k")], toUnnest=[Column("t", "ks")])

Something like

-- assume t.ks is a string that represents a JSON array 
-- assume the implementation for json_array_elements has the type
--   Utf8 -> List[Utf8]
SELECT t.k, json_array_elements(t.ks) ...

would become

Projection([Column("t", "k"), SetReturningUDF(JsonArrayElements, Column("t", "ks"))])

or we could do

Projection([Column("t", "k"), Unnest(ScalarUDF(JsonArrayElements, Column("t", "ks")))])

which would become

ProjectSet(scalarValues=[Column("k")], toUnnest=[ScalarUDF(JsonArrayElements, Column("t", "ks"))]])

We could also modify the Projection node itself, but this way we'll possibly be able to implement it faster as a separate opt-in DataFusion extension (via custom planner nodes and optimizer rules).

This won't let us place set-returning functions in the FROM clause, but I think the user will be able to emulate it via CTEs:

SELECT
  t.*, json_extract_path(a, '{some, path}')
FROM t, json_array_elements(t.ks) a

-- becomes

WITH unnested AS (
  SELECT t.*, json_array_elements(t.ks) AS a
  FROM t
)
SELECT unnested.*, json_extract_path(unnested.a, '{some, path}')
FROM unnested

Any thoughts/advice?

@ghost
Copy link

ghost commented Nov 7, 2022

This feature would really make this awesome project more usable. Are there any plans to add it?

@kesavkolla
Copy link

Yes really looking forward for this feature.

@vincev
Copy link
Contributor

vincev commented Feb 2, 2023

I have created PR #5106 that adds support for unnest to DataFrame, some test on parquet files with 1.9M rows that unnest to 15M:

Data set rows: 1941258 elapsed: 0.443
Dataset unnested rows: 15939196 elapsed: 0.799

I can try to add support to the SQL parser in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request sql SQL Planner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants