Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more JOIN strategies #14518

Open
Jackie-Jiang opened this issue Nov 22, 2024 · 3 comments
Open

Add more JOIN strategies #14518

Jackie-Jiang opened this issue Nov 22, 2024 · 3 comments
Labels
feature request multi-stage Related to the multi-stage query engine

Comments

@Jackie-Jiang
Copy link
Contributor

Currently we support 4 JOIN algorithms:

  1. Hash: hash partition both sides and shuffle to the same worker to join
  2. Random left + Broadcast right: for non-equal join, random distribute left side and broadcast right side to join
  3. Colocated: when left and right table are partitioned the same way, join locally without shuffling
  4. Lookup: when right table is a dimension table, and join is equality join on right table's primary key, join locally by looking up the lookup table of the dimension table

Both 3 and 4 can avoid data shuffling, and we want to add more strategies that can reduce data shuffling:

  1. When right table is fully replicated to all servers (similar to dimension table, but not necessary with a lookup table), we can join locally
  2. Broadcast the right table, and join on left table local worker
  3. When left table is partitioned and the join is on the partition key, partition the right table, and join on left table local worker
@Jackie-Jiang Jackie-Jiang added multi-stage Related to the multi-stage query engine feature request labels Nov 22, 2024
@Jackie-Jiang
Copy link
Contributor Author

Jackie-Jiang commented Nov 22, 2024

cc @gortiz @yashmayya @bziobrowski

@gortiz
Copy link
Contributor

gortiz commented Jan 16, 2025

This PR may be useful to get inspiration in how to implement strategies 1 and 3

@Jackie-Jiang
Copy link
Contributor Author

1 (local replicated) is supported with #14893
2 (broadcast) is supported with #14797

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request multi-stage Related to the multi-stage query engine
Projects
None yet
Development

No branches or pull requests

2 participants