Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1859651 Change stage name prefix for tables matched with regex #1030

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sfc-gh-mbobowski
Copy link
Contributor

Overview

SNOW-1859651

This change prevents running file cleaner for Snowpipe-based connector when topic to table map contains any regular expression. With regular expressions being used it is highly probable that more than one topic is being ingested into a single table.

Pre-review checklist

  • This change should be part of a Behavior Change Release. See go/behavior-change.
  • This change has passed Merge gate tests
  • Snowpipe Changes
  • Snowpipe Streaming Changes
  • This change is TEST-ONLY
  • This change is README/Javadocs only
  • This change is protected by a config parameter <PARAMETER_NAME> eg snowflake.ingestion.method.
    • Yes - Added end to end and Unit Tests.
    • No - Suggest why it is not param protected
  • Is his change protected by parameter <PARAMETER_NAME> on the server side?
    • The parameter/feature is not yet active in production (partial rollout or PrPr, see Changes for Unreleased Features and Fixes).
    • If there is an issue, it can be safely mitigated by turning the parameter off. This is also verified by a test (See go/ppp).

@sfc-gh-mbobowski sfc-gh-mbobowski requested a review from a team as a code owner December 18, 2024 13:11

public class TopicToTableModeExtractor {

private static Pattern topicRegexPattern =
Pattern.compile("\\[([0-9]-[0-9]|[a-z]-[a-z]|[A-Z]-[A-Z]|[!-@])\\][*+]?");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity - do you think this expression could be extended somehow? the initial expression i've came up in the ticket was kinda quick thinking, I'm wonder if there could be other/better approach to tackle that?

One idea - in case we'd figue out we need to extend this - do you think it would make sense to embed this regexp in configuration, provide reasonable default and have possiblity to change it at runtime for some extreme corner cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I was thinking about simply detecting any of the special characters like [ or ]. But I am not sure if it's better.

If we want to be more defensive with this change I would create a feature flag instead of exposing the regex. The worst case for messing sth up is that cleaner will stop working - it's not good for sure but also not terrible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo having a Regex for finding if a String is a multi-matching Regex is not what we want to do. (not only because it is hard and error prone). I think that we should calculate resulting table for the whole set of topics on each rebalance to create a literal topic to table map for the actual set of topics — only then we know if a topic defined in topic2table.map should be a literal topic or a topic-regex, e.g. "topic.name" could be both, literal matching "topic.name" or a regex for "topic[a-z|A-Z|0-9|._-]name".
Having a concrete mapping for the actual topics allows us to give the definite answer whether we have a MANY_TOPICS_SINGLE_TABLE case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this idea a lot to undwind all the topics up front to their final form rather than rely on regexes. There are two (at least) corner cases I'm worried about:

  • the order of initialization is sequential - I can see the scenario, when first topic is matched to the regex, is added to the table; at this moment - we have a table containing exactly one topic - so the logic would assume this is a signle topic and there is no need to salt the name with topic's hash.
  • the potential distribution of topics across nodes - if there is a regexp matching say 2 topics and we distribute these topics on 2 different nodes - the map would contain only 1 entry and we are back to square one.

Potentially, we could consider one more option - Michal, correct me if i'm wrong on this one - the connector name is salted with timestamp on every deployment, right? if that's the case - i think we could simply enable the salting logic for file prefixes regardless of the topic2table configuration - the fact that connector's name is changing will also impact the stage prefix, so trying to keep the naming scheme backwards compatible is already invalidated - that's something i've noticed when analysing this cleaner issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the order of initialization is sequential - I can see the scenario, when first topic is matched to the regex, is added to the table; at this moment - we have a table containing exactly one topic - so the logic would assume this is a signle topic and there is no need to salt the name with topic's hash.

This is not a problem since we first unwind the topics, then check if we have duplicates.

the potential distribution of topics across nodes - if there is a regexp matching say 2 topics and we distribute these topics on 2 different nodes - the map would contain only 1 entry and we are back to square one.

This is a real problem if we cannot assume that each connector name is unique. Alternatively we could generate a UUID (or something) for each connector start that would not change on rebalances. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you checked the code and that's the logic - than ok. i thought it was running the init in a sequential loop per task.

the second part - this could be affecting the assignor policy and honestly - i doubt we should be doing that.

the more i think about it - the more i'm leaning toward enabling that topic salting regardless of the topic2table detected mode....

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the table name was resolved using topic2table map we should salt the file prefix with the topic — otherwise there could be a clash anyway (see example below). This should not break anything but some files could stay as leftovers on the Stage after an upgrade and potentially that data could be duplicated. No data loss though.

Example:
topics=topicA,topicB or topics.regex=^topic[A-Z]$
and
topic2table.map=topicB:topicA
In the above situation, despite having only a single, non-regex entry in topic2table.map we would have a clash on the cleaners. Reversing a regex to check where the clash could happen is not feasible.

Copy link
Contributor

@sfc-gh-gjachimko sfc-gh-gjachimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just added one comment.

thanks for addressing it!

@sfc-gh-mbobowski sfc-gh-mbobowski changed the title SNOW-1859651 Topic2TableMode for patterns SNOW-1859651 Disable cleaner for tables matched with regex Dec 18, 2024
@sfc-gh-mbobowski sfc-gh-mbobowski changed the title SNOW-1859651 Disable cleaner for tables matched with regex SNOW-1859651 Change stage name prefix for tables matched with regex Dec 18, 2024

public class TopicToTableModeExtractor {

private static Pattern topicRegexPattern =
Pattern.compile("\\[([0-9]-[0-9]|[a-z]-[a-z]|[A-Z]-[A-Z]|[!-@])\\][*+]?");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo having a Regex for finding if a String is a multi-matching Regex is not what we want to do. (not only because it is hard and error prone). I think that we should calculate resulting table for the whole set of topics on each rebalance to create a literal topic to table map for the actual set of topics — only then we know if a topic defined in topic2table.map should be a literal topic or a topic-regex, e.g. "topic.name" could be both, literal matching "topic.name" or a regex for "topic[a-z|A-Z|0-9|._-]name".
Having a concrete mapping for the actual topics allows us to give the definite answer whether we have a MANY_TOPICS_SINGLE_TABLE case.

Comment on lines 224 to 237
@Test
public void testTopic2TableCorrectlyDeterminesMode() {
Map<String, String> config = getConfig();
config.put(TOPICS_TABLES_MAP, "src1:target1,src2:target2,src3:target1");
connectorConfigValidator.validateConfig(config);
Map<String, String> topic2Table = Utils.parseTopicToTableMap(config.get(TOPICS_TABLES_MAP));
assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src1"))
.isEqualTo(TopicToTableModeExtractor.Topic2TableMode.MANY_TOPICS_SINGLE_TABLE);
assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src2"))
.isEqualTo(TopicToTableModeExtractor.Topic2TableMode.SINGLE_TOPIC_SINGLE_TABLE);
assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src3"))
.isEqualTo(TopicToTableModeExtractor.Topic2TableMode.MANY_TOPICS_SINGLE_TABLE);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, will revert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants