Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support conditional joins in cudf-polars #16926

Open
wence- opened this issue Sep 26, 2024 · 2 comments
Open

[FEA] Support conditional joins in cudf-polars #16926

wence- opened this issue Sep 26, 2024 · 2 comments
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Sep 26, 2024

Is your feature request related to a problem? Please describe.

Recently, polars gained support for a restricted class of conditional joins. Specifically those where the conditions are combinations of equality conditions and inequality conditions.

As a consequence the query optimiser in polars now turns queries like left.join(right, on="a").filter("b" < "c") into a conditional join if b and c are not from the same table (if they are from the same table then this is rewritten as (say) left.filter("b" < "c").join(right, on="a"): this one we handle fine). There is also a new join_where primitive that directly allows user to write out conditional joins.

For us, this means that queries that might previously have run are currently not supported (because we don't have handlers for the new inequality join node (called IEJoin).

Describe the solution you'd like

Libcudf has a general conditional join implementation that takes an AST expression that is applied to determine if a pair of rows should be kept in the join output. It should be possible to run these inequality joins using that implementation.

However, since we know that we only need to support equality and inequality (for which the key columns must admit a total order) it should be possible to do something smarter by sorting the inequality columns to match and then intersecting the ranges that the inequality conditions impose. For multiple such conditions the ranges describe masks that must be permuted through to the original order (since we must sort each column pair individually), but it (in my head) seems doable in a way that isn't naively $\mathcal{O}(n m)$ when the left table has $n$ rows and the right has $m$.

Describe alternatives you've considered

Expose the new IR node in polars and rewrite it in translation into a combination of cross joins + filters, which potentially blows up the memory requirement a lot.

@wence- wence- added feature request New feature or request cudf.polars Issues specific to cudf.polars libcudf Affects libcudf (C++/CUDA) code. labels Sep 26, 2024
@TNieuwdorp
Copy link

TNieuwdorp commented Oct 9, 2024

Until this is resolved, when using the GPU implementation of Polars, it must remain pinned < 1.8.2.

@wence-
Copy link
Contributor Author

wence- commented Oct 9, 2024

We can start to support this with polars 1.9.1 (which will include pola-rs/polars#19104). There is a first cut pass at doing so the "dumb" way (without exposing libcudfs mixed/conditional join implementation to python) in #17000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants