[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

stefankandic · 2024-06-27T15:22:48Z

What changes were proposed in this pull request?

Disable writing collated types to data sources that don't support them. However, spark managed tables should still work as the schema is in HMS and not in the file itself.

Why are the changes needed?

Right now, when users write a collated type directly to json, text, orc.. they will not see that collation when reading back.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UTs

Was this patch authored or co-authored using generative AI tooling?

No.

stefankandic · 2024-06-27T20:22:03Z

@cloud-fan Please take a look when you find the time

allisonwang-db · 2024-06-27T21:24:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -536,10 +536,12 @@ case class DataSource(
              dataSource.toString, field)
          }
        }
+        DataSourceUtils.verifyCollations(dataSource, data.schema)


Shall we combine this with the data.schema.foreach check mentioned above?

cloud-fan · 2024-06-28T02:10:01Z

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

For data source v1, we can add a new API to CreatableRelationProvider, like supportsStringCollation. Ideally this is not needed as we already have CreatableRelationProvider#supportsDataType. But string collation is special as it's still StringType and existing v1 data sources may mistakenly support it if they do case _: StringType => true

UPDATE: actually CreatableRelationProvider#supportsDataType is newly added in spark 4.0 (not released yet). We can change it to not support string with collation, so that all existing v1 sources won't support string with collation, unless they override supportsDataType to explicitly support it.

stefankandic · 2024-07-01T09:31:02Z

@cloud-fan

For internal file source API, I think we can simply update FileFormat#supportDataType in certain formats such as CSV to return false for string with collation. So no new API is needed.

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

cloud-fan · 2024-07-01T15:13:08Z

This would mean that we wouldn't be able to create spark managed tables with collations for those formats. Is that something that we want to do?

To confirm the goal of this PR: we want to have a new API for file sources to indicate that a type is supported only with a catalog? I think we should be more specific about this, as there are many APIs to use a file source:

read a path with a user-specified schema
write to a path
create external table with a path
create managed table

stefankandic added 2 commits June 27, 2024 16:56

initial

6b27cdb

delete tests which are no longer needed

5d678cf

github-actions bot added the SQL label Jun 27, 2024

stefankandic added 2 commits June 27, 2024 17:32

add check for alter

7f9027a

add check for relation provider

3a2e5ec

stefankandic changed the title ~~[SPARK-48739] Disable writing collated data to file formats that don't support them in non managed tables~~ [SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables Jun 27, 2024

stefankandic added 2 commits June 27, 2024 17:59

formatting

563a927

CreatableRelationProvider should support collation

27d3039

stefankandic marked this pull request as ready for review June 27, 2024 20:22

allisonwang-db reviewed Jun 27, 2024

View reviewed changes

stefankandic added 2 commits July 1, 2024 15:40

do not allow collations by default in supported data types in DS v1 api

214cce1

ditto

263e09c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

stefankandic commented Jun 27, 2024 •

edited

Loading

stefankandic commented Jun 27, 2024 •

edited

Loading

allisonwang-db Jun 27, 2024

cloud-fan commented Jun 28, 2024 •

edited

Loading

stefankandic commented Jul 1, 2024

cloud-fan commented Jul 1, 2024

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

Are you sure you want to change the base?

[SPARK-48739][SQL] Disable writing collated data to file formats that don't support them in non managed tables #47127

Conversation

stefankandic commented Jun 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

stefankandic commented Jun 27, 2024 • edited Loading

allisonwang-db Jun 27, 2024

Choose a reason for hiding this comment

cloud-fan commented Jun 28, 2024 • edited Loading

stefankandic commented Jul 1, 2024

cloud-fan commented Jul 1, 2024

stefankandic commented Jun 27, 2024 •

edited

Loading

stefankandic commented Jun 27, 2024 •

edited

Loading

cloud-fan commented Jun 28, 2024 •

edited

Loading