Enable granularities for derived metrics with time offset #726

courtneyholcomb · 2023-08-16T01:08:59Z

Resolves #SL-767

Description

With the current placement of the JoinToTimeSpineNode in the dataflow plan, we filter down to time columns with the requested granularity BEFORE joining to time spine. Since joining to time spine expects a time column with DAY granularity, this means queries with any granularity besides DAY don't return expected results. Here, we move the time spine join to happen before filtering elements to fix that issue. So instead of FilterElementsNode -> JoinToTimeSpineNode, we do JoinToTimeSpineNode -> FilterElementsNode.
For cumulative metrics, we need to factor in the JoinOverTimeRangeNode. This PR updates the dataflow plan from FilterElementsNode -> JoinOverTimeRangeNode (both required) to JoinOverTimeRangeNode (optional, only used if a time dimension is requested) -> JoinToTimeSpineNode (optional, only used for offset metrics) -> FilterElementsNode. This ensures we'll have the columns needed when joining to time spine.
Unrelated to time offset, this change also removes the unnecessary JoinOverTimeRangeNode step for a cumulative metric queried without a time dimension, which 1) does nothing in SQL and 2) might result in inaccurate costing for that query.

Note: there is currently a table missing in the Postgres & BQ test warehouses that is blocking generation of snapshots for some tests.

github-actions · 2023-08-16T01:09:18Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

courtneyholcomb · 2023-08-22T03:25:20Z

...apshots/test_dataflow_to_sql_plan.py/SqlQueryPlan/BigQuery/test_cumulative_metric__plan0.sql

@@ -1,12 +1,12 @@
 -- Compute Metrics via Expressions
 SELECT
-  subq_4.ds__month
-  , subq_4.txn_revenue AS trailing_2_months_revenue
+  subq_3.ds__month


This is the SQL change from removing the unused JoinOverTimeRangeNode visit.

tlento

I'm having a hard time reasoning about the changes here because this is an old and gnarly part of the codebase, it's been a while, and the PR itself is doing three seemingly independent but possibly interconnected things.

Re-ordering the dataflow plan
Removing some extra join
Re-structuring the granularity adjustment source from the parent node of the time spine join to the first metric_time reference in the query parameters, and moving the adjustment logic inline in the visitor method

The last of these does not seem like it'll be deterministically correct. If a user groups by metric_time__day but applies a filter on metric_time__month they'll potentially get different results than if they did the opposite, and filtered on day while grouping by month, because the group by ones are added to the spec set first, but the filter happens after the joins.

Are all of these changes necessary to address the core problem with derived metric offsets? And can they be split up into stages so we can see what they're doing independently of each other? Even if it's rough, like if there's a commit pointer in this stack that splits things a bit more clearly, or if you can squash and reorder some of commits and maybe do some git add --patch maneuvering to get to a commit breakpoint that divides this up (and especially if you can run the snapshots in between), that'd really help me.

tlento · 2023-08-24T21:56:27Z

metricflow/dataflow/builder/dataflow_plan_builder.py

+        metric_time_dimension_spec: Optional[TimeDimensionSpec] = None
+        for linkable_spec in queried_linkable_specs.time_dimension_specs:
+            if linkable_spec.element_name == self._metric_time_dimension_reference.element_name:
+                metric_time_dimension_spec = linkable_spec


This might not be the correct spec for metric_time - as far as I can tell, the queried_linkable_specs may include a variety of metric_time granularities. We probably want the smallest granularity one, right?

Interesting point. What if the query included two incompatible granularities like week and month? We might need to take a list of metric_time_specs in the time spine builder!

To wrap up this thread - I ended up implementing the logic to handle multiple metric time dimensions in the dataflow to SQL logic, not the time spine builder.

tlento · 2023-08-24T22:08:21Z

metricflow/plan_conversion/dataflow_to_sql.py

+            new_time_dim_spec = TimeDimensionSpec(
+                element_name=original_time_dim_instance.spec.element_name,
+                entity_links=original_time_dim_instance.spec.entity_links,
+                time_granularity=node.metric_time_dimension_spec.time_granularity,


If this is a larger-than-expected time granularity I think we'll end up producing incorrect results here.

Why's that? I've included some integration tests with granularities larger than day for reference if you want to point to a SQL example.

...low_to_sql_plan.py/SqlQueryPlan/BigQuery/test_derived_metric_with_offset_to_grain__plan0.sql

metricflow/plan_conversion/dataflow_to_sql.py

tlento · 2023-08-24T23:08:44Z

metricflow/dataflow/dataflow_plan.py

@@ -718,6 +718,7 @@ class JoinToTimeSpineNode(Generic[SourceDataSetT], BaseOutput[SourceDataSetT], A
    def __init__(
        self,
        parent_node: BaseOutput[SourceDataSetT],
+        metric_time_dimension_spec: TimeDimensionSpec,


Do we want this here?

Previously, the granularity was sourced from the parent node for making the time spine dataset, which contained all of the available granularities. Now it's coming in from the query parameters.

Now it's getting threaded through from the callsite, and what we're doing at the callsite is taking one of the potentially many query instances and wiring it through this node.

The parent dataset will no longer be narrowed down to only the requested metric_time_dimension_instances since the measure aggregation is being moved to after joining to time spine in this PR. Because of that, I thought it would be more efficient to pass through the requested granularity and only aggregate time spine columns based on that request. However, I can see a couple of potential issues with that.

There might be multiple metric_time dimensions in the query. To handle this case, we could accept a list of metric_time_dimension_instances in _make_time_spine_data_set and create DATE_TRUNC columns for each.

Maybe we don't want to prematurely optimize this query, and I should leave that to the optimizer. In that case we could create a DATE_TRUNC column for every possible granularity in _make_time_spine_data_set, and expect the unrequested ones to get filtered out by FilterElementsNode.

I'll definitely need to update the logic to handle case #1, but not sure about #2. Thoughts on that one?

Actually, after some noodling on this I understood more what you meant. Ended up going back to inheriting the lowest granularity available from the parent dataset after all. So you can ignore this thread!

metricflow/plan_conversion/dataflow_to_sql.py

courtneyholcomb · 2023-08-28T19:12:29Z

I'm having a hard time reasoning about the changes here because this is an old and gnarly part of the codebase, it's been a while, and the PR itself is doing three seemingly independent but possibly interconnected things.

Re-ordering the dataflow plan

Removing some extra join

Re-structuring the granularity adjustment source from the parent node of the time spine join to the first metric_time reference in the query parameters, and moving the adjustment logic inline in the visitor method

The last of these does not seem like it'll be deterministically correct. If a user groups by metric_time__day but applies a filter on metric_time__month they'll potentially get different results than if they did the opposite, and filtered on day while grouping by month, because the group by ones are added to the spec set first, but the filter happens after the joins.

Are all of these changes necessary to address the core problem with derived metric offsets? And can they be split up into stages so we can see what they're doing independently of each other? Even if it's rough, like if there's a commit pointer in this stack that splits things a bit more clearly, or if you can squash and reorder some of commits and maybe do some git add --patch maneuvering to get to a commit breakpoint that divides this up (and especially if you can run the snapshots in between), that'd really help me.

@tlento I'll take this PR and separate it into smaller PRs to make this all clearer. I didn't totally follow what you meant in #3 which probably means it was not the logic I intended. Hopefully breaking into smaller PRs will help make this more clear to both of us and I can adjust that logic if needed!

tlento · 2023-08-28T19:14:52Z

@tlento I'll take this PR and separate it into smaller PRs to make this all clearer. I didn't totally follow what you meant in #3 which probably means it was not the logic I intended. Hopefully breaking into smaller PRs will help make this more clear to both of us and I can adjust that logic if needed!

That'd be amazing, thanks so much!

There are some fairly large changes I want to make this week so hopefully we can get this all in ahead of those so as to avoid the looming merge conflict pain. I should be back to my normal review cadence this week, assuming no other major disruptions happen over here. Stupid yellow jackets....

courtneyholcomb · 2023-08-29T04:44:13Z

Splitting into smaller PRs:
#743
#744

cla-bot bot added the cla:yes label Aug 16, 2023

courtneyholcomb mentioned this pull request Aug 16, 2023

Enable different granularities for metrics with time offset #426

Closed

courtneyholcomb added 4 commits August 16, 2023 15:15

Fix granularity for derived metrics with time offset

41fc5a3

Changelog

3fd699a

Lint

5ad2e92

Update snapshots

9a31896

courtneyholcomb force-pushed the court/derived-offset-granularity branch from 8d57703 to 9a31896 Compare August 16, 2023 22:17

courtneyholcomb added 9 commits August 16, 2023 15:57

Update snapshots part 1

02eb0bf

Databricks snapshots

f7a9f32

Postgres snapshots

92bc655

Lint

214bddc

Cleanup

90df131

Fix some test SQL

baa8c40

Fix bug + snapshots

2db9e05

Remove redundant DATE_TRUNCs

3a40ec1

Cleanup

45bece3

courtneyholcomb commented Aug 22, 2023

View reviewed changes

courtneyholcomb added 2 commits August 21, 2023 21:15

Fix alias bug

31caec6

Update snapshots

d747632

courtneyholcomb changed the title ~~Fix granularity for derived metrics with time offset~~ Enable granularities for derived metrics with time offset Aug 22, 2023

courtneyholcomb marked this pull request as ready for review August 22, 2023 04:19

courtneyholcomb requested review from tlento and plypaul August 22, 2023 04:19

courtneyholcomb added 2 commits August 21, 2023 21:22

Merge branch 'main' into court/derived-offset-granularity

5aee90e

Update Postgres snapshots

0afb86c

tlento reviewed Aug 24, 2023

View reviewed changes

Update BQ snapshots

d574875

courtneyholcomb closed this Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable granularities for derived metrics with time offset #726

Enable granularities for derived metrics with time offset #726

courtneyholcomb commented Aug 16, 2023 •

edited

Loading

github-actions bot commented Aug 16, 2023

courtneyholcomb Aug 22, 2023

tlento left a comment

tlento Aug 24, 2023

courtneyholcomb Aug 29, 2023

courtneyholcomb Aug 29, 2023

tlento Aug 24, 2023

courtneyholcomb Aug 29, 2023

tlento Aug 24, 2023

courtneyholcomb Aug 29, 2023

courtneyholcomb Aug 29, 2023

courtneyholcomb commented Aug 28, 2023

tlento commented Aug 28, 2023

courtneyholcomb commented Aug 29, 2023

Enable granularities for derived metrics with time offset #726

Enable granularities for derived metrics with time offset #726

Conversation

courtneyholcomb commented Aug 16, 2023 • edited Loading

Description

github-actions bot commented Aug 16, 2023

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

courtneyholcomb commented Aug 28, 2023

tlento commented Aug 28, 2023

courtneyholcomb commented Aug 29, 2023

courtneyholcomb commented Aug 16, 2023 •

edited

Loading