-
Notifications
You must be signed in to change notification settings - Fork 3
Adding relational optimization for Join-Aggregate transpose #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
| for i in (138841, 36091, 54952, 103768, 46081) | ||
| for i in (6434, 45280, 60493, 87616, 132775) | ||
| ], | ||
| "n_orders": [21, 20, 19, 19, 17], | ||
| "n_orders": [2, 2, 2, 2, 2], | ||
| } | ||
| ), | ||
| "common_prefix_y", | ||
| order_sensitive=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the test slightly so it isn't as time consuming to compute
john-sanchez31
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job Kian! Just some type hint reminders and a suggestion below
| """ | ||
| if base not in used_names: | ||
| return base | ||
| i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check type hints on i and name
| i += 1 | ||
|
|
||
| def join_aggregate_transpose( | ||
| self, join: Join, aggregate: Aggregate, is_left: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the name of the parameter is_left to is_left_agg to make it clear we're talking about the aggregation and not the join.
| join.cardinality if is_left else join.reverse_cardinality | ||
| ) | ||
|
|
||
| left_join_case = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hint missing
|
|
||
| # Extract the join key references from both sides of the join in the | ||
| # order they appear in the join condition. | ||
| agg_key_refs, non_agg_key_refs = extract_equijoin_keys(join) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallel assignments like these can't have type hints, but I can pre-declare them above
| # `col` is one of the aggregation keys. If this is not possible, then | ||
| # abort. Also abort if any of the aggregation keys are not used as | ||
| # equi-join keys. | ||
| sentinel_column: RelationalExpression | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a sentinel column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the sentinel column is a COUNT(*) expression used to know when the number of rows in the group is actually zero, since in the left join transform case we need to know when the RHS has no matches.
| break | ||
| else: | ||
| # Otherwise, create a new COUNT(*) column for that purpose. | ||
| agg_name = self.generate_name("n_rows", aggregate.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint is further up in the function, as a pre-declaration (since agg_name is a variable that gets used several times)
| node = simplify_concat(node) | ||
| node = simplify_conditionals(node) | ||
|
|
||
| # PyDough Change: new pre-order transformations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these functions overridden from sqlglot or PyDough specific functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brand new rules we created, but using SQLGlot functionalities to operate on their AST (basically, extending the patterns available in their simplifier).
| "type": "simple table", | ||
| "table path": "main.sbCustomer", | ||
| "unique properties": ["_id"], | ||
| "unique properties": ["_id", "name", "email", "address1"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this chage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were columns that are actually unique, but previously weren't marked as such in the defog metadata. I happened to correct these while making other changes, since I was looking through these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this change. Is this required for a specific test case?
While this set of fields may be unique in the actual data, we generally can't assume the same for 'real' databases where data is changing. I think simulating live database behavior with the defog datasets would be better for testing purposes, as this reflects how they would be used in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a change I made incidentally. Also, for each database, we should be filling out the metadata based on the actual data for that database, not based on behavior for other databases. As for "real" databases, many of them will have multiple unique columns besides the primary key.
| ) | ||
| return Broker.CALCULATE( | ||
| n_transactions=KEEP_IF(COUNT(selected_txns), COUNT(selected_txns) > 0), | ||
| n_transactions=KEEP_IF(COUNT(selected_txns), COUNT(selected_txns) != 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these queries change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing other occurrences of this pattern to match for consistency, since COUNT(x) > 0 is the same as COUNT(x) != 0 (since it can never be negative). The difference between the two is that COUNT(x) != 0 can be optimized into NULLIF patterns in SQLGlot, which opens doors for further simplification.
| "type": "simple table", | ||
| "table path": "main.sbCustomer", | ||
| "unique properties": ["_id"], | ||
| "unique properties": ["_id", "name", "email", "address1"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this change. Is this required for a specific test case?
While this set of fields may be unique in the actual data, we generally can't assume the same for 'real' databases where data is changing. I think simulating live database behavior with the defog datasets would be better for testing purposes, as this reflects how they would be used in the real world.
pydough/sqlglot/override_simplify.py
Outdated
| return expr | ||
|
|
||
| lhs = first.args.get("this") | ||
| rhs = first.args.get("expression") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints for first, second, lhs & rhs.
Same on rewrite_case_to_nullif() and rewrite_sum_nullif()
Adding an optimization pass which identifies cases where a join occurs after an aggregation, but the join is on the aggregation keys and placing the join before the aggregation would reduce the number of rows inputted to the aggregation, so it switched the two:
ANY_VALUE(except equijoin keys, which are just replaced with the equivalent join key from the aggregation side, which should be one of the aggregation key).COUNT(*)is replaced withCOUNT(x)(wherexis one of the equi-join keys from the right hand side, aborting the attempt if there is no such key).COUNT(x)calls have an additionalKEEP_IF(..., y != 0)post-processing step, whereyis a sentinel column to determine whether there was a match or not from the left join. If there was aCOUNT(*)rewritten asCOUNT(x), then that column is used asy. Otherwise, a newCOUNT(*)is inserted in the aggregation and used asy.Also added some new simplification rules to the SQLGlot simplifier override file to account for newly generated patterns:
CASE WHEN x != y THEN x ELSE NULL END->NULLIF(x, y)COALESCE(NULLIF(x, y), y)->COALESCE(x, y)COALESCE(NULLIF(x, y), z)->CASE WHEN x = y THEN z ELSE x ENDSUM(NULLIF(x, 0))->SUM(x)COALESCE(COUNT(x)) ->COUNT(x)`COALESCE(COUNT_IF(x))->COUNT_IF(x)NULLIF(COALESCE(x, y), y)->NULLIF(x, y)The net result of all of these changes is the following:
A side effect of these changes is the sometimes-extraneous and unnecessary inclusion of the extra
ANY_VALUEcalls, when those columns could instead be switched out with the grouping keys. A followup may potentially be warranted to change how uniqueness information is propagated so that column pruning can make this switch when the grouping key is unused, but one of theseMIN/MAX/ANY_VALUEcolumns is also unique.