-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary FilterElementsNodes
before measure aggregation
#1506
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
83bb17b
to
5e18253
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me! The constant filtering of nodes throughout does seem alil silly in the world where we have the optimizers, but it does seem to make the unoptimized plans abit more wild. Though I don't see a use case of us ever dropping the optimizer for whatever reason, so i think this is a good change!
specs_to_keep_after_join = InstanceSpecSet(measure_specs=(measure_spec,)).merge( | ||
InstanceSpecSet.create_from_specs(measure_recipe.all_linkable_specs_required_for_source_nodes.as_tuple), | ||
) | ||
|
||
after_join_filtered_node = FilterElementsNode.create( | ||
parent_node=filtered_measures_with_joined_elements, include_specs=specs_to_keep_after_join | ||
) | ||
unaggregated_measure_node = after_join_filtered_node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: This logic is all unnecessary because it gets filtered out in the optimizer anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep exactly
Measure aggregation logic applys several FilterElementsNodes that do not actually impact the outcome of optimized queries. Me and Will are making an effort to align the logic in the dataflow plan builder as much as possible. In order to align measure aggregation logic with other logic in the dataflow plan builder, we need to remove these nodes.
5e18253
to
a2a01c0
Compare
The difference between `queried_linkable_specs`, `required_linkable_specs`, and `extraneous_linkable_specs` has often been overlooked and therefore been the source of many bugs. In [this PR](#1506), we removed the unneeded `FilterElementsNodes` that were the main reason we needed those different spec variables. So now we can remove the `extraneous_linkable_specs` variables altogether, and hopefully reduce confusion here. There is also some related cleanup in here. It's not a huge PR but I still recommend reviewing by commit.
We are working to dedupe logic in the dataflow plan builder in order to reduce bugs. There are currently three different paths the dataflow plan may take, each of which duplicates much of the dataflow plan building logic:
And I am in the process of adding one more path related to time spine nodes.
The differences in these paths is frequently the source of bugs that could be easily prevented. For example, forgetting to include a where filter spec in the original source node is a bug that frequently pops up for new paths in the dataflow plan.
Further up the stack, I'm going to dedupe a lot of the dataflow plan logic in order to prevent these bugs from popping up in the future. Specifically, I'll move all the logic after building a source node recipe and before applying aggregations to a helper function that can be reused in all paths (current or future). That will help to ensure that applying joins, joining custom granularities, applying where filters, and applying time constraints, always uses the correct logic.
Measure aggregation logic applies several
FilterElementsNodes
that do not actually impact the outcome of optimized queries. Other paths do not have these unnecessary nodes. Removing these nodes is a step needed to ensure measures use the same logic as other dataflow plans, which will then allow us to dedupe that logic.