SNOW-1859651 Change stage name prefix for tables matched with regex #1030

sfc-gh-mbobowski · 2024-12-18T13:11:01Z

Overview

SNOW-1859651

This change prevents running file cleaner for Snowpipe-based connector when topic to table map contains any regular expression. With regular expressions being used it is highly probable that more than one topic is being ingested into a single table.

Pre-review checklist

sfc-gh-gjachimko · 2024-12-18T13:31:54Z

src/main/java/com/snowflake/kafka/connector/config/TopicToTableModeExtractor.java


 public class TopicToTableModeExtractor {

+  private static Pattern topicRegexPattern =
+      Pattern.compile("\\[([0-9]-[0-9]|[a-z]-[a-z]|[A-Z]-[A-Z]|[!-@])\\][*+]?");


out of curiosity - do you think this expression could be extended somehow? the initial expression i've came up in the ticket was kinda quick thinking, I'm wonder if there could be other/better approach to tackle that?

One idea - in case we'd figue out we need to extend this - do you think it would make sense to embed this regexp in configuration, provide reasonable default and have possiblity to change it at runtime for some extreme corner cases?

Alternatively I was thinking about simply detecting any of the special characters like [ or ]. But I am not sure if it's better.

If we want to be more defensive with this change I would create a feature flag instead of exposing the regex. The worst case for messing sth up is that cleaner will stop working - it's not good for sure but also not terrible.

Imo having a Regex for finding if a String is a multi-matching Regex is not what we want to do. (not only because it is hard and error prone). I think that we should calculate resulting table for the whole set of topics on each rebalance to create a literal topic to table map for the actual set of topics — only then we know if a topic defined in topic2table.map should be a literal topic or a topic-regex, e.g. "topic.name" could be both, literal matching "topic.name" or a regex for "topic[a-z|A-Z|0-9|._-]name".
Having a concrete mapping for the actual topics allows us to give the definite answer whether we have a MANY_TOPICS_SINGLE_TABLE case.

I do like this idea a lot to undwind all the topics up front to their final form rather than rely on regexes. There are two (at least) corner cases I'm worried about:

the order of initialization is sequential - I can see the scenario, when first topic is matched to the regex, is added to the table; at this moment - we have a table containing exactly one topic - so the logic would assume this is a signle topic and there is no need to salt the name with topic's hash.

the potential distribution of topics across nodes - if there is a regexp matching say 2 topics and we distribute these topics on 2 different nodes - the map would contain only 1 entry and we are back to square one.

Potentially, we could consider one more option - Michal, correct me if i'm wrong on this one - the connector name is salted with timestamp on every deployment, right? if that's the case - i think we could simply enable the salting logic for file prefixes regardless of the topic2table configuration - the fact that connector's name is changing will also impact the stage prefix, so trying to keep the naming scheme backwards compatible is already invalidated - that's something i've noticed when analysing this cleaner issue.

the order of initialization is sequential - I can see the scenario, when first topic is matched to the regex, is added to the table; at this moment - we have a table containing exactly one topic - so the logic would assume this is a signle topic and there is no need to salt the name with topic's hash.

This is not a problem since we first unwind the topics, then check if we have duplicates.

the potential distribution of topics across nodes - if there is a regexp matching say 2 topics and we distribute these topics on 2 different nodes - the map would contain only 1 entry and we are back to square one.

This is a real problem if we cannot assume that each connector name is unique. Alternatively we could generate a UUID (or something) for each connector start that would not change on rebalances. WDYT?

if you checked the code and that's the logic - than ok. i thought it was running the init in a sequential loop per task.

the second part - this could be affecting the assignor policy and honestly - i doubt we should be doing that.

the more i think about it - the more i'm leaning toward enabling that topic salting regardless of the topic2table detected mode....

Yeah, if the table name was resolved using topic2table map we should salt the file prefix with the topic — otherwise there could be a clash anyway (see example below). This should not break anything but some files could stay as leftovers on the Stage after an upgrade and potentially that data could be duplicated. No data loss though.

Example:
topics=topicA,topicB or topics.regex=^topic[A-Z]$
and
topic2table.map=topicB:topicA
In the above situation, despite having only a single, non-regex entry in topic2table.map we would have a clash on the cleaners. Reversing a regex to check where the clash could happen is not feasible.

sfc-gh-gjachimko

LGTM - just added one comment.

thanks for addressing it!

sfc-gh-dseweryn · 2024-12-18T15:46:23Z

src/main/java/com/snowflake/kafka/connector/config/TopicToTableModeExtractor.java


 public class TopicToTableModeExtractor {

+  private static Pattern topicRegexPattern =
+      Pattern.compile("\\[([0-9]-[0-9]|[a-z]-[a-z]|[A-Z]-[A-Z]|[!-@])\\][*+]?");


Imo having a Regex for finding if a String is a multi-matching Regex is not what we want to do. (not only because it is hard and error prone). I think that we should calculate resulting table for the whole set of topics on each rebalance to create a literal topic to table map for the actual set of topics — only then we know if a topic defined in topic2table.map should be a literal topic or a topic-regex, e.g. "topic.name" could be both, literal matching "topic.name" or a regex for "topic[a-z|A-Z|0-9|._-]name".
Having a concrete mapping for the actual topics allows us to give the definite answer whether we have a MANY_TOPICS_SINGLE_TABLE case.

sfc-gh-dseweryn · 2024-12-19T15:22:05Z

src/test/java/com/snowflake/kafka/connector/ConnectorConfigValidatorTest.java

-  @Test
-  public void testTopic2TableCorrectlyDeterminesMode() {
-    Map<String, String> config = getConfig();
-    config.put(TOPICS_TABLES_MAP, "src1:target1,src2:target2,src3:target1");
-    connectorConfigValidator.validateConfig(config);
-    Map<String, String> topic2Table = Utils.parseTopicToTableMap(config.get(TOPICS_TABLES_MAP));
-    assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src1"))
-        .isEqualTo(TopicToTableModeExtractor.Topic2TableMode.MANY_TOPICS_SINGLE_TABLE);
-    assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src2"))
-        .isEqualTo(TopicToTableModeExtractor.Topic2TableMode.SINGLE_TOPIC_SINGLE_TABLE);
-    assertThat(TopicToTableModeExtractor.determineTopic2TableMode(topic2Table, "src3"))
-        .isEqualTo(TopicToTableModeExtractor.Topic2TableMode.MANY_TOPICS_SINGLE_TABLE);
-  }
-


whoops, will revert

SNOW-1859651 Topic2TableMode for patterns

3ec125f

sfc-gh-mbobowski requested a review from a team as a code owner December 18, 2024 13:11

sfc-gh-gjachimko reviewed Dec 18, 2024

View reviewed changes

sfc-gh-gjachimko approved these changes Dec 18, 2024

View reviewed changes

sfc-gh-mbobowski changed the title ~~SNOW-1859651 Topic2TableMode for patterns~~ SNOW-1859651 Disable cleaner for tables matched with regex Dec 18, 2024

sfc-gh-mbobowski changed the title ~~SNOW-1859651 Disable cleaner for tables matched with regex~~ SNOW-1859651 Change stage name prefix for tables matched with regex Dec 18, 2024

sfc-gh-dseweryn requested changes Dec 18, 2024

View reviewed changes

sfc-gh-dseweryn reviewed Dec 19, 2024

View reviewed changes

sfc-gh-dseweryn force-pushed the mbobowski-SNOW-1859651 branch from 9fb6116 to 3ec125f Compare December 20, 2024 10:24

sfc-gh-dseweryn mentioned this pull request Dec 20, 2024

SNOW-1859651 Salt prefix for files mapped from different topics #1035

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1859651 Change stage name prefix for tables matched with regex #1030

SNOW-1859651 Change stage name prefix for tables matched with regex #1030

sfc-gh-mbobowski commented Dec 18, 2024

sfc-gh-gjachimko Dec 18, 2024

sfc-gh-mbobowski Dec 18, 2024

sfc-gh-dseweryn Dec 18, 2024

sfc-gh-gjachimko Dec 19, 2024

sfc-gh-dseweryn Dec 19, 2024

sfc-gh-gjachimko Dec 19, 2024

sfc-gh-dseweryn Dec 19, 2024

sfc-gh-gjachimko left a comment

sfc-gh-dseweryn Dec 18, 2024

sfc-gh-dseweryn Dec 19, 2024

SNOW-1859651 Change stage name prefix for tables matched with regex #1030

Are you sure you want to change the base?

SNOW-1859651 Change stage name prefix for tables matched with regex #1030

Conversation

sfc-gh-mbobowski commented Dec 18, 2024

Overview

Pre-review checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-gjachimko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment