-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable granularities for derived metrics with time offset #726
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
8d57703
to
9a31896
Compare
@@ -1,12 +1,12 @@ | |||
-- Compute Metrics via Expressions | |||
SELECT | |||
subq_4.ds__month | |||
, subq_4.txn_revenue AS trailing_2_months_revenue | |||
subq_3.ds__month |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the SQL change from removing the unused JoinOverTimeRangeNode
visit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time reasoning about the changes here because this is an old and gnarly part of the codebase, it's been a while, and the PR itself is doing three seemingly independent but possibly interconnected things.
- Re-ordering the dataflow plan
- Removing some extra join
- Re-structuring the granularity adjustment source from the parent node of the time spine join to the first metric_time reference in the query parameters, and moving the adjustment logic inline in the visitor method
The last of these does not seem like it'll be deterministically correct. If a user groups by metric_time__day
but applies a filter on metric_time__month
they'll potentially get different results than if they did the opposite, and filtered on day while grouping by month, because the group by ones are added to the spec set first, but the filter happens after the joins.
Are all of these changes necessary to address the core problem with derived metric offsets? And can they be split up into stages so we can see what they're doing independently of each other? Even if it's rough, like if there's a commit pointer in this stack that splits things a bit more clearly, or if you can squash and reorder some of commits and maybe do some git add --patch
maneuvering to get to a commit breakpoint that divides this up (and especially if you can run the snapshots in between), that'd really help me.
metric_time_dimension_spec: Optional[TimeDimensionSpec] = None | ||
for linkable_spec in queried_linkable_specs.time_dimension_specs: | ||
if linkable_spec.element_name == self._metric_time_dimension_reference.element_name: | ||
metric_time_dimension_spec = linkable_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the correct spec for metric_time - as far as I can tell, the queried_linkable_specs may include a variety of metric_time granularities. We probably want the smallest granularity one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. What if the query included two incompatible granularities like week and month? We might need to take a list of metric_time_specs
in the time spine builder!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To wrap up this thread - I ended up implementing the logic to handle multiple metric time dimensions in the dataflow to SQL logic, not the time spine builder.
new_time_dim_spec = TimeDimensionSpec( | ||
element_name=original_time_dim_instance.spec.element_name, | ||
entity_links=original_time_dim_instance.spec.entity_links, | ||
time_granularity=node.metric_time_dimension_spec.time_granularity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a larger-than-expected time granularity I think we'll end up producing incorrect results here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why's that? I've included some integration tests with granularities larger than day for reference if you want to point to a SQL example.
...low_to_sql_plan.py/SqlQueryPlan/BigQuery/test_derived_metric_with_offset_to_grain__plan0.sql
Outdated
Show resolved
Hide resolved
@@ -718,6 +718,7 @@ class JoinToTimeSpineNode(Generic[SourceDataSetT], BaseOutput[SourceDataSetT], A | |||
def __init__( | |||
self, | |||
parent_node: BaseOutput[SourceDataSetT], | |||
metric_time_dimension_spec: TimeDimensionSpec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this here?
Previously, the granularity was sourced from the parent node for making the time spine dataset, which contained all of the available granularities. Now it's coming in from the query parameters.
Now it's getting threaded through from the callsite, and what we're doing at the callsite is taking one of the potentially many query instances and wiring it through this node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent dataset will no longer be narrowed down to only the requested metric_time_dimension_instances
since the measure aggregation is being moved to after joining to time spine in this PR. Because of that, I thought it would be more efficient to pass through the requested granularity and only aggregate time spine columns based on that request. However, I can see a couple of potential issues with that.
- There might be multiple
metric_time
dimensions in the query. To handle this case, we could accept a list ofmetric_time_dimension_instances
in_make_time_spine_data_set
and createDATE_TRUNC
columns for each. - Maybe we don't want to prematurely optimize this query, and I should leave that to the optimizer. In that case we could create a
DATE_TRUNC
column for every possible granularity in_make_time_spine_data_set
, and expect the unrequested ones to get filtered out byFilterElementsNode
.
I'll definitely need to update the logic to handle case #1, but not sure about #2. Thoughts on that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after some noodling on this I understood more what you meant. Ended up going back to inheriting the lowest granularity available from the parent dataset after all. So you can ignore this thread!
@tlento I'll take this PR and separate it into smaller PRs to make this all clearer. I didn't totally follow what you meant in #3 which probably means it was not the logic I intended. Hopefully breaking into smaller PRs will help make this more clear to both of us and I can adjust that logic if needed! |
That'd be amazing, thanks so much! There are some fairly large changes I want to make this week so hopefully we can get this all in ahead of those so as to avoid the looming merge conflict pain. I should be back to my normal review cadence this week, assuming no other major disruptions happen over here. Stupid yellow jackets.... |
Resolves #SL-767
Description
With the current placement of the
JoinToTimeSpineNode
in the dataflow plan, we filter down to time columns with the requested granularity BEFORE joining to time spine. Since joining to time spine expects a time column withDAY
granularity, this means queries with any granularity besidesDAY
don't return expected results. Here, we move the time spine join to happen before filtering elements to fix that issue. So instead ofFilterElementsNode
->JoinToTimeSpineNode
, we doJoinToTimeSpineNode
->FilterElementsNode
.For cumulative metrics, we need to factor in the
JoinOverTimeRangeNode
. This PR updates the dataflow plan fromFilterElementsNode
->JoinOverTimeRangeNode
(both required) toJoinOverTimeRangeNode
(optional, only used if a time dimension is requested) ->JoinToTimeSpineNode
(optional, only used for offset metrics) ->FilterElementsNode
. This ensures we'll have the columns needed when joining to time spine.Unrelated to time offset, this change also removes the unnecessary
JoinOverTimeRangeNode
step for a cumulative metric queried without a time dimension, which 1) does nothing in SQL and 2) might result in inaccurate costing for that query.Note: there is currently a table missing in the Postgres & BQ test warehouses that is blocking generation of snapshots for some tests.