Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Jun 27, 2024

What changes were proposed in this pull request?

Disable writing collated types to data sources that don't support them. However, spark managed tables should still work as the schema is in HMS and not in the file itself.

Why are the changes needed?

Right now, when users write a collated type directly to json, text, orc.. they will not see that collation when reading back.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UTs

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 27, 2024
@stefankandic stefankandic changed the title [SPARK-48739] Disable writing collated data to file formats that don't support them in non managed tables [SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables Jun 27, 2024
@stefankandic
Copy link
Contributor Author

stefankandic commented Jun 27, 2024

@cloud-fan Please take a look when you find the time

@stefankandic stefankandic marked this pull request as ready for review June 27, 2024 20:22
@@ -536,10 +536,12 @@ case class DataSource(
dataSource.toString, field)
}
}
DataSourceUtils.verifyCollations(dataSource, data.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we combine this with the data.schema.foreach check mentioned above?

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 28, 2024

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

For data source v1, we can add a new API to CreatableRelationProvider, like supportsStringCollation. Ideally this is not needed as we already have CreatableRelationProvider#supportsDataType. But string collation is special as it's still StringType and existing v1 data sources may mistakenly support it if they do case _: StringType => true

UPDATE: actually CreatableRelationProvider#supportsDataType is newly added in spark 4.0 (not released yet). We can change it to not support string with collation, so that all existing v1 sources won't support string with collation, unless they override supportsDataType to explicitly support it.

@stefankandic
Copy link
Contributor Author

@cloud-fan

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

@cloud-fan
Copy link
Contributor

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

To confirm the goal of this PR: we want to have a new API for file sources to indicate that a type is supported only with a catalog? I think we should be more specific about this, as there are many APIs to use a file source:

  1. read a path with a user-specified schema
  2. write to a path
  3. create external table with a path
  4. create managed table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants