Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAP-544: Add configuration change options #659

Closed
wants to merge 19 commits into from

Conversation

mikealfare
Copy link
Contributor

@mikealfare mikealfare commented Jun 14, 2023

resolves #603

Description

Implement dynamic tables as a materialization option for Snowflake.

Workflow:

  • Refresh existing dynamic table with no changes
  • Accept and process changes to a dynamic table:
    • target_lag: alter dynamic table in place
    • warehouse: full refresh

Configuration:

  • on_configuration_change: determines the action to take when changes are detected:
    • apply: full refresh on warehouse changes, alter on target_lag changes
    • continue: soft fail, skip the processing of this model, but process downstream
    • fail: fail entire run

Checklist

@mikealfare mikealfare self-assigned this Jun 14, 2023
@cla-bot cla-bot bot added the cla:yes label Jun 14, 2023
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

@mikealfare mikealfare changed the title Feature/dynamic tables/adap 544 ADAP-544: Add configuration change options Jun 14, 2023
@mikealfare mikealfare marked this pull request as ready for review June 21, 2023 02:50
@mikealfare mikealfare requested a review from a team as a code owner June 21, 2023 02:50
@mikealfare mikealfare requested review from McKnight-42 and Fleid and removed request for McKnight-42 June 21, 2023 02:50
return SnowflakeRelationType

@classmethod
def from_runtime_config(cls, runtime_config: RuntimeConfigObject) -> RelationConfigBase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this function mean? I tried figuring that out from reading where it is invoked. That said, it's not immediately clear. You want the model from the runtime config which the materialization invokes? Just a lot of indirection here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely need a docstring here; I'll add it. The idea is to produce a RelationConfig from the RuntimeConfigObject; in this case it would produce a SnowflakeDynamicTableConfig instance. RuntimeConfigObject contains all of the information needed to create a SnowflakeDynamicTableConfig, plus a lot more. There are two key differences. The latter is in the parlance of the Snowflake object, and not the generic RuntimeConfigObject. And producing the latter provides a much better place for validation checks and structural checks; we don't need to do it at the SnowflakeRelation level or in the jinja.

model_node: ModelNode = runtime_config.model
relation_type = SnowflakeRelationType(model_node.config.materialized)

if relation_config := cls.relation_configs.get(relation_type):
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the walrus operator is now usable by us? Oh good.

What's your intended semantics? Non-Optional assignment semantics frighten me. What does get do if relation_type is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we control the value, and it can't be None because relation_type is a Enum. But I do explicitly use that in all of the config parsers. Passing None is effectively not passing it at all. I think I covered the scenarios there, where None is not actually a valid value for that config parameter. But agreed, we need to watch out for that and check for the existence of the key in those scenarios. I could also make the argument that we don't allow that value in our interface.


@dataclass
class SnowflakeDynamicTableConfigChangeset:
target_lag: Optional[SnowflakeDynamicTableTargetLagConfigChange] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember. Could you just have the vars without the = None and get the same thing? Are you just erring to explicit? Is one considered cleaner? Can I find a way to ask more questions about this one line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, dataclass yells at you if you have an Optional and you don't provide the variable because you haven't defined a default, even if that default is None. I think it doesn't assume. That being said, these are one of two varieties. There can either be a change that can only happen once (e.g. target_lag), or a set of changes all of the same type (e.g. indexes on Postgres). So I'm also being explicit that the null case here is None and not set(). I've also thought about collapsing this entirely and just having one single set of changes, but that leaves the parsing of "what type of change is this?" to the caller, which in this case is jinja.

existing_dynamic_table = SnowflakeDynamicTableConfig.from_relation_results(
relation_results
)
assert isinstance(existing_dynamic_table, SnowflakeDynamicTableConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover assert? Should it be an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make the linter happy. from_relation_results() returns a generic RelationConfigBase() because it's on the base class. The base class has no platform-specific attributes like target_lag, warehouse, etc. Instead of doing a # type: ignore and dealing with the warnings in the IDE, I solve both with this assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens when this crashes on a user for some reason down the line against all our assumptions? We fine with this being the possible bug report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should probably wrap that in a try / catch and make it pretty, fair point.

@classmethod
def dynamic_table_config_changeset(
cls,
dynamic_table: SnowflakeDynamicTableConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is pointing to what we're creating? Should this be like target_ prefixed or something like that to highlight this is the to-be-built copy of an existing table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely focusing on the new snapshot, which lives in the user config, as the "primary" view of the dynamic table. I did that because that's where I am when I'm running this, in the future looking back at the past. However, as noted above, this should probably change. I also don't like relation_results since it's very vague. This was the result of several back and forth's in reviews. I think I just might bite the bullet, and do the parsing for the existing_dynamic_table outside of this method as well, and then wind up with existing_dynamic_table and new_dynamic_table as the arguments, which is more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love a new_dynamic_table name everywhere personally!


if config_changeset.has_changes:
return config_changeset
return None
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: complete this with an else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::gets on soap box and raises pinky:: My understanding is that it's more idiomatic to return the None outside of the else block since the else block isn't doing anything. That's how I've generally seen it, so the else: return None construct looks strange to me, as if we could get past it somehow and not return anything (because it's indented). A compromise could be:

return config_changeset if config_changeset.has_changes else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bruh I trust you but wow. Conventions be something. I'm happy with the compromise hehe

)

@classmethod
def dynamic_table_config_changeset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function is building the changeset between the new and existing version of the table? I find the name doesn't capture these semantics, and it would help self-document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went around the block on naming this and I agree that it's not super intuitive. I'll see if I can come up with something.

For context, here was my thought process leading up to this name. At this point in the process, we know that there's an existing database object and an object in the config that points to that database object, which may or may not have changes. Both things are really the same thing, the same dynamic table. We're merely taking another snapshot and then determining if the state of the object (minus data) in the config matches the already existing object in the database. Since they both represent the same dynamic table at two points in time, we're effectively determining the changeset between those two points in time on the same dynamic table; hence dynamic_table_config_changeset(). This makes me wonder if maybe the method name isn't that bad, but the arguments should be updated, as you point out below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with more specific args and doc strings!

@staticmethod
def _render_part(component: ComponentName, value: str) -> Optional[str]:
if SnowflakeIncludePolicy().get_part(component):
if SnowflakeQuotePolicy().get_part(component):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the QuotePolicy intended to be subordinate to the IncludePolicy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not including that part, then we don't care about the quote policy for that part. But if we are including that part, then we need to see if we're quoting it.

query: str
target_lag: SnowflakeDynamicTableTargetLagConfig
warehouse: str

@property
def path(self) -> str:
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howabout relation path or db_relation_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will always be used in the form dynamic_table.path, I'm not keen on dynamic_table.relation_path since it feels like doublespeak. I chose path because I believe we call it that in dbt-core somewhere. However, path is generic; what this really is is the "fully qualified path". And since we're in relation_config land, we should be talking in database parlance. I'll update it to fully_qualified_path, but let me know your thoughts on that name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love fully_qualified_path.

@property
def path(self) -> str:
return ".".join(
part for part in [self.database_name, self.schema_name, self.name] if part is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if self.schema_name is None but the other's aren't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I thought the same thing, but this is how we roll in dbt-core. fwiw, those three attributes are required on this class, but could be provided as "". I'm fine dealing with that edge case since the user would effectively be choosing to break it. That becomes more of an issue when this winds up in a base class in dbt-core (down the road). If you're still concerned, I would address it with a validation rule, checking that all three parts are of non-zero length (rather than complicate this logic).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this but I'll take the explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put validation checks here.

@@ -21,6 +35,147 @@ class SnowflakeDynamicTableConfig(RelationConfigBase):
"""

name: str
schema_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am confused. Down on line 67 you are using if not none assignment style guards' but these are str (as opposed to options?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I did that in cases when there are defaults because passing in my_arg=None does not result in the default for my_arg, it results in None. However, I don't have defaults. I'll remove those guards below.

create or replace dynamic table {{ dynamic_table.path }}
target_lag = '{{ dynamic_table.target_lag }}'
warehouse = {{ dynamic_table.warehouse }}
as ({{ dynamic_table.query }})
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I think this is one of those places that without spacing to

as (
  {{ dynamic_table.query }}
)

would result in bugs (e.g. user puts comments at bottom of file).

We've seen this in RS (or whichever db 🙈 ). Did you prescrub out comments? If not, we should probably code defensively here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I forgot about that. Nice catch. To be honest, I made it look like that because there were extra returns in the logs and this looked prettier, lol. Will change.

validation_check=(
(
self.duration >= 1
and self.period != SnowflakeDynamicTableTargetLagPeriod.seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconds -> minute?

Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as insane as it sounds, could someone say 0 hours by chance? 🙊 Might want to guard against.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already guarding against the 0 hours scenario because it's requiring a positive number. But it's a bit hidden since I do a check against the type and then combine it with another check. I'll break this into two checks: positive integer, and >= 60 if your unit is seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh excellent point about the boolean logic though!

return target_lag

@classmethod
def parse_model_node(cls, model_node: ModelNode) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this paradigm you've made but I want to! Who is calling this? Why is the name so generic when the result is so specific to the class? (I see you do this elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a poor man's mashumaro, pydantic, marshmallow, etc. I could see replacing this in the long term with a framework like that. The idea is that there are several ways to create this "generic" dynamic view representation. This dataclass is just one giant set of parsers with validation rules. The two parsers I found so far is database > config and user-config > config. I wouldn't be surprised if we wind up with more. Maybe someone wants to get crazy and do a Redshift MV user-config > Snowflake DT config to do a migration. Whatever it is, this config class is our anchor. And we generally want two representations of this config. We want a nice shiny, python object with dot path-ing, etc. so that the jinja reads cleanly, other dependent code reads cleanly, etc. And then we usually want the dictionary version that is ubiquitous; it's nice for logs, it translates out of python nicely if necessary (basically json). In order to do that in an hierarchical object, each object needs both and we need to split apart the foreign format > dict and dict > object computation. A framework like mashumaro does all of that behind the scenes so you don't need to worry about it. However, we still have the added complexity of needing more than one parser since our objects come from more than one source. I'd be happy to discuss this further, I think it need sync discussions to make sense.

target_lag: str = model_node.config.extra.get("target_lag")
try:
duration, period = target_lag.split(" ")
except IndexError:
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also catch the AttributeError when target_lag is None (since get can do that) unless that's guarded against elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be in dbt/adapters/snowflake/relation_configs/dynamic_table.py around line 78? Still, if you ever instantiate this without the AttributeError check, you'll get an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds right. I think I originally had this in an if:

if target_lag := model_node.config.extra.get("target_lag"):
    try:
        duration, period = target_lag.split(" ")
    except IndexError:
        duration, period = None, None

But then I removed the if and forgot about that case. I'll update.

return config_dict

@classmethod
def parse_relation_results(cls, relation_results: RelationResults) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this method get used? For output or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It get's used in super().from_relation_results(), which merely chains together this and from_dict(). This may not make as much sense here, but I'm going for a consistent look/feel across adapters and RelationConfigBase classes. When there are attributes which are themselves RelationConfigBase classes, e.g. target_lag below, then we need this method as a stopping point to produce a dict. Hence it gets called if this instance were an attribute on a bigger object.

tl;dr; Consistency? And to fit into the base class framework

"text",
"target_lag",
"warehouse"
from table(result_scan(last_query_id()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrency nightmare possible? Maybe.....we could do a dynamic query call? Or are you confident this is thread-safe/otherwise-collision-safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the recommended way to read the results from show dynamic tables according to the docs I found. I could see if I can find that again if you want. I'm not tied to it though; it worked so I went with it.

@@ -1,46 +1,96 @@
{% macro snowflake__get_alter_dynamic_table_as_sql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file a bunch of helper DDL for the snowflake methods for building out this table? If I'm right, would love to see that documented at filetop. If I'm wrong, another reason to document our intentions 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the former. It's the pieces that go into dynamic_table/materialization.sql. When the materialization is in dbt-core, this boils down to just a file dynamic_table.sql. I broke it out here to emphasize that the materialization is local (and because it's a lot of code with slightly different uses). Agreed, I should (and will) add a doc block at the top.


"""
open_paren = query.find("as (") + len("as (")
close_paren = query.rindex(")")
Copy link
Contributor

@VersusFacit VersusFacit Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, that's stressful to me -- so many edge cases -- but it does seem like you've accounted to the 5 my brain came up with!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically looked at what Snowflake returned on the most basic dynamic table. That being said, we control the format of what goes in there since we're writing the DDL that wraps this statement (the user's SQL). We're effectively unwrapping the statement here. I'm sure we'll find an edge case eventually and this will grow. The upside? We can actually compare the query of the object in the database and the query the user provided to see if there's a difference before deciding to deploy (this could be used to simplify view processing in the future).

{% elif full_refresh_mode or not existing_relation.is_dynamic_table %}
{% set build_sql = snowflake__get_replace_dynamic_table_as_sql(target_relation, sql, existing_relation, backup_relation, intermediate_relation) %}
{% set build_sql = snowflake__replace_dynamic_table_sql(dynamic_table, existing_relation, backup_relation, intermediate_relation) %}
{% else %}

-- get config options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangling/incomplete comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very similar to the new built-in materialized view materialization in dbt-core (if you find/replace "dynamic table" with "materialized view", you might even say they're the same). That being said, I had these comments at the time to try and lead myself (and the next unfortunate soul) through this giant, nested, jinja-dwelling if block. But it goes back to your comment of "comments should say why, not what". And I think what this is suggesting is that maybe it should be its own macro, which would self-document what I'm doing here. I'll try some combination of all of that to make this better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be rad!

Copy link
Contributor

@VersusFacit VersusFacit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOW! That's a lot of code! Still, all the boilerplate you wrote made it pretty starightforward to read so kudos.

@mikealfare mikealfare marked this pull request as draft June 22, 2023 18:21
@mikealfare mikealfare marked this pull request as ready for review June 26, 2023 20:51
@Fleid
Copy link
Contributor

Fleid commented Jun 28, 2023

Refresh existing dynamic table with no changes

Hey @mikealfare looking at the original spec, the DT should not be refreshed if there are no changes and their status isn't failed.
I've re-worded the expected lifecycle here, hopefully clarifying it.

full refresh on warehouse changes

I'm not sure what that means? Is that still an ALTER statement? Or are you doing a CREATE OR REPLACE?

mikealfare and others added 5 commits July 5, 2023 15:26
# Conflicts:
#	dbt/include/snowflake/macros/materializations/dynamic_table/ddl.sql
#	tests/functional/adapter/dynamic_table_tests/fixtures.py
#	tests/functional/adapter/dynamic_table_tests/test_dynamic_table.py
@mikealfare mikealfare marked this pull request as draft July 25, 2023 14:59
@mikealfare
Copy link
Contributor Author

This PR is way out of date. The branch will be saved for reference, but the PR is stale.

@mikealfare mikealfare closed this Sep 11, 2023
@mikealfare mikealfare deleted the feature/dynamic-tables/adap-544 branch July 17, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-544] [Feature] Support Dynamic Tables as a materialization (following base Materialized View ergonomics)
3 participants