Skip to content

Conversation

Tmonster
Copy link
Owner

@Tmonster Tmonster commented Jan 21, 2025

This PR adds support for left join reordering to the join order optimizer.

This PR includes the following pieces of logic

  1. Removing unnecessary projections. This is needed because queries like (Select * from (select * from t1, t2) left join (select * from t3, t4) on (a = d)) have projections above the left and right subqueries. Currently, the Join Order Optimizer treats Projections as stand along relations and optimizes the children in isolation of the rest of the plan. In cases like this however, the projection is not necessary and is only added as a part of the process to plan a subquery.
  2. Different estimation logic for Left Joins. Right now our cardinality estimator supports inner joins, which is a relatively estimation. However, given that the cardinality of a table resulting from many joins is the same as regardless of the join order, we would like that to hold true for estimates as well. So finding an equation that respects this property took some time. The formula to get this number is below
  3. A refactor of how filters are extracted from Logical filters/Joins. This logic was sort of in two places before (query_graph_manager and relation_manager). Now it is all in relation_manager.cpp, and is a bit more readable (in my opinion).
  4. The actual logic to add the left join. After reading the paper On the correct and complete enumeration of the core search space I was able to gather that left joins should be treated the same as Semi and Anti joins. This means any join happening in the RHS of the Left Join cannot be pulled out of the RHS. Likewise, and join happening outside the RHS of the Left join cannot be pushed into the left join. Another rule that must be followed (making SEMI/ANTI different from LEFT is that any filter operating on a column of the RHS of a left join after the join must not be pushed into the left join). To enforce these rules, once a left join is extracted, all previous filters are visited and checked to see if they use the relation from the RHS of the left join. If yes, then all LHS relations of the left join are required in order to apply the filter).

I've investigated the regression on the realnest benchmark and I don't think it's a huge issue. The benchmark performs a number of left joins on rowid, which is not the best way to join many nested tables. In addition, the realnest benchmark is mostly to test our performance of nested data structures, I think tpcds/tpch are better indicators of our performance of reordering left joins.

I ran tpcds on this branch at sf100 and two queries that stood out were q40 and q80

benchmark/tpcds/sf1/q40.benchmark
Old timing: 0.106831
New timing: 0.025253

benchmark/tpcds/sf1/q80.benchmark
Old timing: 0.804367
New timing: 0.46505

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant