Skip to content

Conversation

@knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Aug 19, 2025

Adding an optimization pass which identifies cases where a join occurs after an aggregation, but the join is on the aggregation keys and placing the join before the aggregation would reduce the number of rows inputted to the aggregation, so it switched the two:

  • Only allowed on a semi join (if the aggregation is on the left) or an inner join (if either side is the aggregation)
    • Also allowed for left joins (if the aggregation is on the right) with special circumstances (referred to as the left join case)
  • Only allowed if the cardinality from the aggregate to the other side is singular & filtering
  • All columns from the non-aggregation side are passed through via ANY_VALUE (except equijoin keys, which are just replaced with the equivalent join key from the aggregation side, which should be one of the aggregation key).
  • If the left join case occurs, then COUNT(*) is replaced with COUNT(x) (where x is one of the equi-join keys from the right hand side, aborting the attempt if there is no such key).
  • Similarly, COUNT(x) calls have an additional KEEP_IF(..., y != 0) post-processing step, where y is a sentinel column to determine whether there was a match or not from the left join. If there was a COUNT(*) rewritten as COUNT(x), then that column is used as y. Otherwise, a new COUNT(*) is inserted in the aggregation and used as y.

Also added some new simplification rules to the SQLGlot simplifier override file to account for newly generated patterns:

  • CASE WHEN x != y THEN x ELSE NULL END -> NULLIF(x, y)
  • COALESCE(NULLIF(x, y), y) -> COALESCE(x, y)
  • COALESCE(NULLIF(x, y), z) -> CASE WHEN x = y THEN z ELSE x END
  • SUM(NULLIF(x, 0)) -> SUM(x)
  • COALESCE(COUNT(x)) -> COUNT(x)`
  • COALESCE(COUNT_IF(x)) -> COUNT_IF(x)
  • NULLIF(COALESCE(x, y), y) -> NULLIF(x, y)

The net result of all of these changes is the following:

  • Plans will tend to be slightly more compact, and closer to how a human would write them
  • Performance of queries will sometimes change, most often with a slight performance gain (though occasionally with a slight performance regression)

A side effect of these changes is the sometimes-extraneous and unnecessary inclusion of the extra ANY_VALUE calls, when those columns could instead be switched out with the grouping keys. A followup may potentially be warranted to change how uniqueness information is propagated so that column pruning can make this switch when the grouping key is unused, but one of these MIN / MAX / ANY_VALUE columns is also unique.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@knassre-bodo knassre-bodo changed the title WIP: Join-Aggregate transpose Adding relational optimization for Join-Aggregate transpose Oct 23, 2025
@knassre-bodo knassre-bodo marked this pull request as ready for review October 23, 2025 02:15
@knassre-bodo knassre-bodo requested review from a team, hadia206, john-sanchez31 and juankx-bodo and removed request for a team October 23, 2025 02:15
Comment on lines -626 to +632
for i in (138841, 36091, 54952, 103768, 46081)
for i in (6434, 45280, 60493, 87616, 132775)
],
"n_orders": [21, 20, 19, 19, 17],
"n_orders": [2, 2, 2, 2, 2],
}
),
"common_prefix_y",
order_sensitive=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the test slightly so it isn't as time consuming to compute

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job Kian! Just some type hint reminders and a suggestion below

"""
if base not in used_names:
return base
i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check type hints on i and name

i += 1

def join_aggregate_transpose(
self, join: Join, aggregate: Aggregate, is_left: bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the name of the parameter is_left to is_left_agg to make it clear we're talking about the aggregation and not the join.

join.cardinality if is_left else join.reverse_cardinality
)

left_join_case = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint missing


# Extract the join key references from both sides of the join in the
# order they appear in the join condition.
agg_key_refs, non_agg_key_refs = extract_equijoin_keys(join)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallel assignments like these can't have type hints, but I can pre-declare them above

# `col` is one of the aggregation keys. If this is not possible, then
# abort. Also abort if any of the aggregation keys are not used as
# equi-join keys.
sentinel_column: RelationalExpression | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a sentinel column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the sentinel column is a COUNT(*) expression used to know when the number of rows in the group is actually zero, since in the left join transform case we need to know when the RHS has no matches.

break
else:
# Otherwise, create a new COUNT(*) column for that purpose.
agg_name = self.generate_name("n_rows", aggregate.columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hint

Copy link
Contributor Author

@knassre-bodo knassre-bodo Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint is further up in the function, as a pre-declaration (since agg_name is a variable that gets used several times)

node = simplify_concat(node)
node = simplify_conditionals(node)

# PyDough Change: new pre-order transformations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these functions overridden from sqlglot or PyDough specific functions?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brand new rules we created, but using SQLGlot functionalities to operate on their AST (basically, extending the patterns available in their simplifier).

"type": "simple table",
"table path": "main.sbCustomer",
"unique properties": ["_id"],
"unique properties": ["_id", "name", "email", "address1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this chage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were columns that are actually unique, but previously weren't marked as such in the defog metadata. I happened to correct these while making other changes, since I was looking through these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this change. Is this required for a specific test case?
While this set of fields may be unique in the actual data, we generally can't assume the same for 'real' databases where data is changing. I think simulating live database behavior with the defog datasets would be better for testing purposes, as this reflects how they would be used in the real world.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a change I made incidentally. Also, for each database, we should be filling out the metadata based on the actual data for that database, not based on behavior for other databases. As for "real" databases, many of them will have multiple unique columns besides the primary key.

)
return Broker.CALCULATE(
n_transactions=KEEP_IF(COUNT(selected_txns), COUNT(selected_txns) > 0),
n_transactions=KEEP_IF(COUNT(selected_txns), COUNT(selected_txns) != 0),
Copy link
Contributor

@john-sanchez31 john-sanchez31 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these queries change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing other occurrences of this pattern to match for consistency, since COUNT(x) > 0 is the same as COUNT(x) != 0 (since it can never be negative). The difference between the two is that COUNT(x) != 0 can be optimized into NULLIF patterns in SQLGlot, which opens doors for further simplification.

"type": "simple table",
"table path": "main.sbCustomer",
"unique properties": ["_id"],
"unique properties": ["_id", "name", "email", "address1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this change. Is this required for a specific test case?
While this set of fields may be unique in the actual data, we generally can't assume the same for 'real' databases where data is changing. I think simulating live database behavior with the defog datasets would be better for testing purposes, as this reflects how they would be used in the real world.

return expr

lhs = first.args.get("this")
rhs = first.args.get("expression")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints for first, second, lhs & rhs.
Same on rewrite_case_to_nullif() and rewrite_sum_nullif()

@knassre-bodo knassre-bodo merged commit 8b30bca into main Nov 24, 2025
12 checks passed
@knassre-bodo knassre-bodo deleted the kian/join_aggregate_transpose branch November 24, 2025 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants