Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable granularities for derived metrics with time offset #726

Closed
wants to merge 18 commits into from

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Aug 16, 2023

Resolves #SL-767

Description

With the current placement of the JoinToTimeSpineNode in the dataflow plan, we filter down to time columns with the requested granularity BEFORE joining to time spine. Since joining to time spine expects a time column with DAY granularity, this means queries with any granularity besides DAY don't return expected results. Here, we move the time spine join to happen before filtering elements to fix that issue. So instead of FilterElementsNode -> JoinToTimeSpineNode, we do JoinToTimeSpineNode -> FilterElementsNode.
For cumulative metrics, we need to factor in the JoinOverTimeRangeNode. This PR updates the dataflow plan from FilterElementsNode -> JoinOverTimeRangeNode (both required) to JoinOverTimeRangeNode (optional, only used if a time dimension is requested) -> JoinToTimeSpineNode (optional, only used for offset metrics) -> FilterElementsNode. This ensures we'll have the columns needed when joining to time spine.
Unrelated to time offset, this change also removes the unnecessary JoinOverTimeRangeNode step for a cumulative metric queried without a time dimension, which 1) does nothing in SQL and 2) might result in inaccurate costing for that query.

Note: there is currently a table missing in the Postgres & BQ test warehouses that is blocking generation of snapshots for some tests.

@cla-bot cla-bot bot added the cla:yes label Aug 16, 2023
@github-actions
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@courtneyholcomb courtneyholcomb force-pushed the court/derived-offset-granularity branch from 8d57703 to 9a31896 Compare August 16, 2023 22:17
@@ -1,12 +1,12 @@
-- Compute Metrics via Expressions
SELECT
subq_4.ds__month
, subq_4.txn_revenue AS trailing_2_months_revenue
subq_3.ds__month
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the SQL change from removing the unused JoinOverTimeRangeNode visit.

@courtneyholcomb courtneyholcomb changed the title Fix granularity for derived metrics with time offset Enable granularities for derived metrics with time offset Aug 22, 2023
@courtneyholcomb courtneyholcomb marked this pull request as ready for review August 22, 2023 04:19
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time reasoning about the changes here because this is an old and gnarly part of the codebase, it's been a while, and the PR itself is doing three seemingly independent but possibly interconnected things.

  1. Re-ordering the dataflow plan
  2. Removing some extra join
  3. Re-structuring the granularity adjustment source from the parent node of the time spine join to the first metric_time reference in the query parameters, and moving the adjustment logic inline in the visitor method

The last of these does not seem like it'll be deterministically correct. If a user groups by metric_time__day but applies a filter on metric_time__month they'll potentially get different results than if they did the opposite, and filtered on day while grouping by month, because the group by ones are added to the spec set first, but the filter happens after the joins.

Are all of these changes necessary to address the core problem with derived metric offsets? And can they be split up into stages so we can see what they're doing independently of each other? Even if it's rough, like if there's a commit pointer in this stack that splits things a bit more clearly, or if you can squash and reorder some of commits and maybe do some git add --patch maneuvering to get to a commit breakpoint that divides this up (and especially if you can run the snapshots in between), that'd really help me.

metric_time_dimension_spec: Optional[TimeDimensionSpec] = None
for linkable_spec in queried_linkable_specs.time_dimension_specs:
if linkable_spec.element_name == self._metric_time_dimension_reference.element_name:
metric_time_dimension_spec = linkable_spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the correct spec for metric_time - as far as I can tell, the queried_linkable_specs may include a variety of metric_time granularities. We probably want the smallest granularity one, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. What if the query included two incompatible granularities like week and month? We might need to take a list of metric_time_specs in the time spine builder!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To wrap up this thread - I ended up implementing the logic to handle multiple metric time dimensions in the dataflow to SQL logic, not the time spine builder.

new_time_dim_spec = TimeDimensionSpec(
element_name=original_time_dim_instance.spec.element_name,
entity_links=original_time_dim_instance.spec.entity_links,
time_granularity=node.metric_time_dimension_spec.time_granularity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a larger-than-expected time granularity I think we'll end up producing incorrect results here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that? I've included some integration tests with granularities larger than day for reference if you want to point to a SQL example.

metricflow/plan_conversion/dataflow_to_sql.py Show resolved Hide resolved
@@ -718,6 +718,7 @@ class JoinToTimeSpineNode(Generic[SourceDataSetT], BaseOutput[SourceDataSetT], A
def __init__(
self,
parent_node: BaseOutput[SourceDataSetT],
metric_time_dimension_spec: TimeDimensionSpec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this here?

Previously, the granularity was sourced from the parent node for making the time spine dataset, which contained all of the available granularities. Now it's coming in from the query parameters.

Now it's getting threaded through from the callsite, and what we're doing at the callsite is taking one of the potentially many query instances and wiring it through this node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent dataset will no longer be narrowed down to only the requested metric_time_dimension_instances since the measure aggregation is being moved to after joining to time spine in this PR. Because of that, I thought it would be more efficient to pass through the requested granularity and only aggregate time spine columns based on that request. However, I can see a couple of potential issues with that.

  1. There might be multiple metric_time dimensions in the query. To handle this case, we could accept a list of metric_time_dimension_instances in _make_time_spine_data_set and create DATE_TRUNC columns for each.
  2. Maybe we don't want to prematurely optimize this query, and I should leave that to the optimizer. In that case we could create a DATE_TRUNC column for every possible granularity in _make_time_spine_data_set, and expect the unrequested ones to get filtered out by FilterElementsNode.

I'll definitely need to update the logic to handle case #1, but not sure about #2. Thoughts on that one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after some noodling on this I understood more what you meant. Ended up going back to inheriting the lowest granularity available from the parent dataset after all. So you can ignore this thread!

metricflow/plan_conversion/dataflow_to_sql.py Show resolved Hide resolved
@courtneyholcomb
Copy link
Contributor Author

I'm having a hard time reasoning about the changes here because this is an old and gnarly part of the codebase, it's been a while, and the PR itself is doing three seemingly independent but possibly interconnected things.

  1. Re-ordering the dataflow plan
  2. Removing some extra join
  3. Re-structuring the granularity adjustment source from the parent node of the time spine join to the first metric_time reference in the query parameters, and moving the adjustment logic inline in the visitor method

The last of these does not seem like it'll be deterministically correct. If a user groups by metric_time__day but applies a filter on metric_time__month they'll potentially get different results than if they did the opposite, and filtered on day while grouping by month, because the group by ones are added to the spec set first, but the filter happens after the joins.

Are all of these changes necessary to address the core problem with derived metric offsets? And can they be split up into stages so we can see what they're doing independently of each other? Even if it's rough, like if there's a commit pointer in this stack that splits things a bit more clearly, or if you can squash and reorder some of commits and maybe do some git add --patch maneuvering to get to a commit breakpoint that divides this up (and especially if you can run the snapshots in between), that'd really help me.

@tlento I'll take this PR and separate it into smaller PRs to make this all clearer. I didn't totally follow what you meant in #3 which probably means it was not the logic I intended. Hopefully breaking into smaller PRs will help make this more clear to both of us and I can adjust that logic if needed!

@tlento
Copy link
Contributor

tlento commented Aug 28, 2023

@tlento I'll take this PR and separate it into smaller PRs to make this all clearer. I didn't totally follow what you meant in #3 which probably means it was not the logic I intended. Hopefully breaking into smaller PRs will help make this more clear to both of us and I can adjust that logic if needed!

That'd be amazing, thanks so much!

There are some fairly large changes I want to make this week so hopefully we can get this all in ahead of those so as to avoid the looming merge conflict pain. I should be back to my normal review cadence this week, assuming no other major disruptions happen over here. Stupid yellow jackets....

@courtneyholcomb
Copy link
Contributor Author

Splitting into smaller PRs:
#743
#744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants