Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split column pruner into two phases #1501

Merged
merged 5 commits into from
Nov 10, 2024
Merged

Split column pruner into two phases #1501

merged 5 commits into from
Nov 10, 2024

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Nov 4, 2024

Currently, the column pruner checks the columns that are needed in each SELECT statement and generates the pruned SQL in a single pass. For better readability and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.

The logic in SqlTagRequiredColumnAliasesVisitor has been copied from the original implementation.

@cla-bot cla-bot bot added the cla:yes label Nov 4, 2024
@plypaul plypaul marked this pull request as ready for review November 4, 2024 17:07
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this logic looks great!
I had a bit of trouble reading the code (sorry for the slow review - that's why!), but I think this was only due to the naming of some of the classes / variables / etc. I've left some suggestions to improve readability, and most all of them are just related to naming.

f"SQL, but this is a bug and should be investigated."
)
return node

pruned_select_columns = tuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tangential, but I've frequently read this code and found this variable name confusing (pruned_select_columns). We frequently refer to "pruned columns" when we mean the ones that have been removed, but in this case we mean the columns that have been kept. I think the word pruned can technically be used both ways, but it typically is used to refer to what has been removed. Can we change this to a more clear variable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this could be renamed. To understand where you're coming from, you mention that [pruned] typically is used to refer to what has been removed. What examples were you thinking?

When I think of pruned, I think about an overgrown tree. Once I prune it, I would call it a pruned tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.

I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:

required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node.

def test_dont_prune_if_in_where(
request: FixtureRequest,

def test_dont_prune_with_str_expr(
request: FixtureRequest,

And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:
Screenshot 2024-11-08 at 1 10 32 PM

Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'm thinking we use different terms like "removed" and "retained" then, but will have to handle in a follow up.

metricflow/sql/optimizer/tag_required_column_aliases.py Outdated Show resolved Hide resolved
self._column_alias_tagger = tagged_column_alias_set

def _search_for_expressions(
self, select_node: SqlSelectStatementNode, pruned_select_columns: Tuple[SqlSelectColumn, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern re: the name pruned_select_columns here

metricflow/sql/optimizer/tag_column_aliases.py Outdated Show resolved Hide resolved
metricflow/sql/optimizer/tag_required_column_aliases.py Outdated Show resolved Hide resolved
metricflow/sql/optimizer/tag_required_column_aliases.py Outdated Show resolved Hide resolved
metricflow/sql/optimizer/tag_required_column_aliases.py Outdated Show resolved Hide resolved
metricflow/sql/optimizer/tag_required_column_aliases.py Outdated Show resolved Hide resolved
@plypaul
Copy link
Contributor Author

plypaul commented Nov 8, 2024

@courtneyholcomb Updated with naming changes + other alterations. I'm going to rename the files later as they cause conflicts in the stack.

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New names look great, overall code looks great! Thank you!

f"SQL, but this is a bug and should be investigated."
)
return node

pruned_select_columns = tuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true - the tree has been pruned. But if you're referring to the branches, I think the "pruned" branches would typically refer to the ones removed. In this case the tree is the SQL node and the columns are the branches.

I just did a quick search through the code to see how we use this word, and there are a couple places where we use the opposite meaning of prune:

required_alias_mapping: Describes columns aliases that should be kept / not pruned for each node.

def test_dont_prune_if_in_where(
request: FixtureRequest,

def test_dont_prune_with_str_expr(
request: FixtureRequest,

And this is silly but I just did a quick google search for a gut check here and it does look like the pruned leaves are the ones that have been removed:
Screenshot 2024-11-08 at 1 10 32 PM

Not totally related to this PR so this isn't blocking! But if you don't update the naming here I probably will the next time I come across it.

Base automatically changed from p--cte--05 to main November 9, 2024 02:00
Currently, the column pruner checks the columns that are needed in each `SELECT`
statement and generates the pruned SQL in a single pass. For better readability
and easier modification, this splits the column pruner into two phases.

First, the SQL nodes are traversed to figure out which columns are required and
which can be pruned. Then, the SQL nodes are rewritten with the pruned columns.
@plypaul plypaul merged commit dd090a2 into main Nov 10, 2024
15 checks passed
@plypaul plypaul deleted the p--cte--06 branch November 10, 2024 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants