-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split column pruner into two phases #1501
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.
f"SQL, but this is a bug and should be investigated." | ||
) | ||
return node | ||
|
||
pruned_select_columns = tuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns
). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this could be renamed. To understand where you're coming from, you mention that [pruned] typically is used to refer to what has been removed
. What examples were you thinking?
When I think of pruned, I think about an overgrown tree. Once I prune it, I would call it a pruned tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.
I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:
required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node. |
metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py
Lines 349 to 350 in d01cbea
def test_dont_prune_if_in_where( | |
request: FixtureRequest, |
metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py
Lines 394 to 395 in d01cbea
def test_dont_prune_with_str_expr( | |
request: FixtureRequest, |
And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:
Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'm thinking we use different terms like "removed" and "retained" then, but will have to handle in a follow up.
self._column_alias_tagger = tagged_column_alias_set | ||
|
||
def _search_for_expressions( | ||
self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern re: the name pruned_select_columns
here
@courtneyholcomb Updated with naming changes + other alterations. I'm going to rename the files later as they cause conflicts in the stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New names look great, overall code looks great! Thank you!
f"SQL, but this is a bug and should be investigated." | ||
) | ||
return node | ||
|
||
pruned_select_columns = tuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.
I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:
required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node. |
metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py
Lines 349 to 350 in d01cbea
def test_dont_prune_if_in_where( | |
request: FixtureRequest, |
metricflow/tests_metricflow/sql/optimizer/test_column_pruner.py
Lines 394 to 395 in d01cbea
def test_dont_prune_with_str_expr( | |
request: FixtureRequest, |
And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:
Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.
Currently, the column pruner checks the columns that are needed in each `SELECT` statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases. First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
Currently, the column pruner checks the columns that are needed in each
SELECT
statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
The logic in
SqlTagRequiredColumnAliasesVisitor
has been copied from the original implementation.