Adding relational optimization for Join-Aggregate transpose #407

knassre-bodo · 2025-08-19T15:00:32Z

Adding an optimization pass which identifies cases where a join occurs after an aggregation, but the join is on the aggregation keys and placing the join before the aggregation would reduce the number of rows inputted to the aggregation, so it switched the two:

Only allowed on a semi join (if the aggregation is on the left) or an inner join (if either side is the aggregation)
- Also allowed for left joins (if the aggregation is on the right) with special circumstances (referred to as the left join case)
Only allowed if the cardinality from the aggregate to the other side is singular & filtering
All columns from the non-aggregation side are passed through via ANY_VALUE (except equijoin keys, which are just replaced with the equivalent join key from the aggregation side, which should be one of the aggregation key).
If the left join case occurs, then COUNT(*) is replaced with COUNT(x) (where x is one of the equi-join keys from the right hand side, aborting the attempt if there is no such key).
Similarly, COUNT(x) calls have an additional KEEP_IF(..., y != 0) post-processing step, where y is a sentinel column to determine whether there was a match or not from the left join. If there was a COUNT(*) rewritten as COUNT(x), then that column is used as y. Otherwise, a new COUNT(*) is inserted in the aggregation and used as y.

Also added some new simplification rules to the SQLGlot simplifier override file to account for newly generated patterns:

CASE WHEN x != y THEN x ELSE NULL END -> NULLIF(x, y)
COALESCE(NULLIF(x, y), y) -> COALESCE(x, y)
COALESCE(NULLIF(x, y), z) -> CASE WHEN x = y THEN z ELSE x END
SUM(NULLIF(x, 0)) -> SUM(x)
COALESCE(COUNT(x)) -> COUNT(x)`
COALESCE(COUNT_IF(x)) -> COUNT_IF(x)
NULLIF(COALESCE(x, y), y) -> NULLIF(x, y)

The net result of all of these changes is the following:

Plans will tend to be slightly more compact, and closer to how a human would write them
Performance of queries will sometimes change, most often with a slight performance gain (though occasionally with a slight performance regression)

A side effect of these changes is the sometimes-extraneous and unnecessary inclusion of the extra ANY_VALUE calls, when those columns could instead be switched out with the grouping keys. A followup may potentially be warranted to change how uniqueness information is propagated so that column pruning can make this switch when the grouping key is unused, but one of these MIN / MAX / ANY_VALUE columns is also unique.

review-notebook-app · 2025-10-21T18:31:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

knassre-bodo · 2025-10-23T17:30:33Z

tests/test_pipeline_common_prefix.py

-                            for i in (138841, 36091, 54952, 103768, 46081)
+                            for i in (6434, 45280, 60493, 87616, 132775)
                        ],
-                        "n_orders": [21, 20, 19, 19, 17],
+                        "n_orders": [2, 2, 2, 2, 2],
                    }
                ),
                "common_prefix_y",
+                order_sensitive=True,


Changed the test slightly so it isn't as time consuming to compute

john-sanchez31

Good job Kian! Just some type hint reminders and a suggestion below

john-sanchez31 · 2025-10-24T21:17:57Z