Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1829870: Allow structured types to be enabled by default #2727

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-jrose
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose commented Dec 6, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1829870

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    This PR addresses a few misc issue involved with enabling structured types by default.

  • Change to_sql to support structured types. There were some edge cases where it was generating incorrect sql.
  • When the structured type semantic mode is enabled structured types are assumed to be structured and to have semi-structured types you have to set the element types to None for ArrayType, MapType, and StructType.
  • In cases where StringType was being passed to init for those types None is now passed by default allowing the type to decide if it should default to StringType.
  • When the new mode is enabled types will prefer to be structured if element typing is available.

Before this change the in order to have a structured schema you would have to set the structured parameter to True on each type that was expected to be structured:

structured_schema = StructType([
    StructField("array", ArrayType(IntegerType(), structured=True)),
    StructField("map", MapType(IntegerType(), IntegerType(), structured=True)),
    StructField("struct", StructType([StructField("field", IntegerType())], structured=True)),
])

semi_structured_schema = StructType([
    StructField("array", ArrayType(IntegerType())),
    StructField("map", MapType(IntegerType(), IntegerType())),
    StructField("struct", StructType([StructField("field", IntegerType())])),
])

also_semi_structured_schema = StructType([
    StructField("array", ArrayType()),
    StructField("map", MapType()),
    StructField("struct", StructType()),
])

# Both of the semi-structured schemas actually materialize to this:
StructType([
	StructField('ARRAY', ArrayType(StringType()), nullable=True),
	StructField('MAP', MapType(StringType(), StringType()), nullable=True),
	StructField('STRUCT', MapType(StringType(), StringType()), nullable=True)
])

After this change the second case above will instead be inferred to be structured which should be more understandable to users. We've seen confusion in the past with the inner types being converted to StringType being non-inuitive.

After the change the definitions will look like this:

structured_schema = StructType([
    StructField("array", ArrayType(IntegerType())),
    StructField("map", MapType(IntegerType(), IntegerType())),
    StructField("struct", StructType([StructField("field", IntegerType())])),
])

semi_structured_schema = StructType([
    StructField("array", ArrayType()),
    StructField("map", MapType()),
    StructField("struct", StructType()),
])

# Semi structured actually materializes to what it is set to.

@sfc-gh-jrose sfc-gh-jrose added DO-NOT-MERGE NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md labels Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

Copy link

github-actions bot commented Dec 6, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team December 6, 2024 21:57
@sfc-gh-jrose sfc-gh-jrose marked this pull request as ready for review December 16, 2024 18:56
@sfc-gh-jrose sfc-gh-jrose requested review from a team as code owners December 16, 2024 18:56
Comment on lines +343 to +344
self.element_type = element_type
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can element_type be None here? What does it mean for the column type to be so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If element_type is None than this is a semi-structured array column and could contain anything.

)
value_type = (
python_type_to_snow_type(tp_args[1], is_return_type_of_sproc)[0]
if tp_args
else StringType()
else None
Copy link
Contributor

@sfc-gh-aling sfc-gh-aling Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an example to illustrate how the code change would work? I'm wondering changing StringType to None would cause any change in non-structured type cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here:

self.key_type = key_type if key_type else StringType()
self.value_type = value_type if value_type else StringType()

The None here passes responsibility to the type initializer which still defaults to StringType when dealing with non-structured cases.

sfc-gh-vbudati and others added 3 commits December 16, 2024 15:33
)

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   Fixes SNOW-1852779

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Fixed AST encoding for Column `in_`, `asc`, and `desc`. Removed an
unused entity and renamed `sp_column_in__seq` to `sp_column_in`. Changed
the `nulls_first` optional boolean param to be an enum.
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jrose sfc-gh-jrose requested a review from a team December 17, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants