Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48682][SQL] Use ICU in InitCap expression for UTF8_BINARY strings #47100

Closed
wants to merge 2 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Jun 26, 2024

What changes were proposed in this pull request?

Update InitCap Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default. Note: the same flag is used for Lower & Upper expressions, with changes introduced in: #47043.

Why are the changes needed?

To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

Does this PR introduce any user-facing change?

Yes, the behaviour of initcap string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the ICU_CASE_MAPPINGS_ENABLED flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

How was this patch tested?

Existing tests, with extended CollationSupport unit tests for InitCap to verify both ICU and JVM behaviour.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 26, 2024
@uros-db
Copy link
Contributor Author

uros-db commented Jun 26, 2024

@mkaravel ready for review

@uros-db
Copy link
Contributor Author

uros-db commented Jun 27, 2024

@mkaravel please review

Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@uros-db
Copy link
Contributor Author

uros-db commented Jun 27, 2024

@cloud-fan ready to merge

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2c9eb1c Jun 28, 2024
asl3 pushed a commit to asl3/spark that referenced this pull request Jul 1, 2024
### What changes were proposed in this pull request?
Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: apache#47043.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47100 from uros-db/change-initcap.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants